arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation

Weiqi Wang; Jiefu Ou; Yangqiu Song; Benjamin Van Durme; Daniel Khashabi

arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation

Weiqi Wang, Jiefu Ou, Yangqiu Song, Benjamin Van Durme, Daniel Khashabi

Abstract

Literature review tables are essential for summarizing and comparing collections of scientific papers. In this paper, we study automatic generation of such tables from a pool of papers to satisfy a user’s information need. Building on recent work (Newman et al., 2024), we move beyond oracle settings by (i) simulating well-specified yet schema-agnostic user demands that avoid leaking gold column names or values, (ii) explicitly modeling retrieval noise via semantically related but out-of-scope distractor papers verified by human annotators, and (iii) introducing a lightweight, annotation-free, utilization-oriented evaluation that decomposes utility (schema coverage, unary cell fidelity, pairwise relational consistency) and measures paper selection via a two-way QA procedure (gold→system and system→gold) with recall, precision, and F1. To support reproducible evaluation, we introduce arXiv2Table, a benchmark of 1,957 tables referencing 7,158 papers, with human-verified distractors and rewritten, schema-agnostic user demands. We also develop an iterative, batch-based generation method that co-refines paper filtering and schema over multiple rounds. We validate the evaluation protocol with human audits and cross-evaluator checks. Extensive experiments show that our method consistently improves over strong baselines, while absolute scores remain modest, underscoring the task’s difficulty. Our data and code is available at https://github.com/JHU-CLSP/arXiv2Table.

Anthology ID:: 2026.acl-long.346
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7602–7624
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.346/
DOI:
Bibkey:
Cite (ACL):: Weiqi Wang, Jiefu Ou, Yangqiu Song, Benjamin Van Durme, and Daniel Khashabi. 2026. arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7602–7624, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation (Wang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.346.pdf
Checklist:: 2026.acl-long.346.checklist.pdf

PDF Cite Search Checklist Fix data