AnalystBench: Benchmarking professional long-form report generation with web-mined multimodal tasks

Chau Minh Pham; Zichao Wang; Puneet Mathur; Alexa Siu; Akriti Jain; Aparna Garimella; Ananya B. Sai; Nedim Lipka; Mohit Iyyer; Varun Manjunatha

AnalystBench: Benchmarking professional long-form report generation with web-mined multimodal tasks

Chau Minh Pham, Zichao Wang, Puneet Mathur, Alexa Siu, Akriti Jain, Aparna Garimella, Ananya B. Sai, Nedim Lipka, Mohit Iyyer, Varun Manjunatha

Abstract

Large language models are increasingly used to draft long-form multimodal documents, but their end-to-end performance on professional report generation remains systematically understudied. We introduce AnalystBench, a continually extensible benchmark of 20 real-world report generation tasks grounded in multimodal document collections, where models must process millions of input tokens to produce long-form professional reports. Using expert-validated quality checklists and groundedness evaluation, we evaluate LLMs and coding agents and find that the best model, GPT-5.1, scores highly on executive summarization tasks (exceeding 90% on quality checklists) but degrades substantially on tasks requiring long-horizon synthesis over large inputs (dropping to 25-40%). Agent-based generation substantially benefits strong closed-source models such as GPT-5.1, with checklist scores improving by 20.24 percentage points and visual coverage by 37.41 points over vanilla generation, but offers little or negative gains for open-source models like DeepSeek-R1 (-3.02 points). Expert reviewers note that while generated reports are grounded and clearly separate factual description from interpretation, they often fall short in actionability, clarity, and quantitative precision, which highlights the gap between system performance and real-world professional needs.

Anthology ID:: 2026.findings-acl.1197
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23894–23926
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1197/
DOI:
Bibkey:
Cite (ACL):: Chau Minh Pham, Zichao Wang, Puneet Mathur, Alexa Siu, Akriti Jain, Aparna Garimella, Ananya B. Sai, Nedim Lipka, Mohit Iyyer, and Varun Manjunatha. 2026. AnalystBench: Benchmarking professional long-form report generation with web-mined multimodal tasks. In Findings of the Association for Computational Linguistics: ACL 2026, pages 23894–23926, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: AnalystBench: Benchmarking professional long-form report generation with web-mined multimodal tasks (Pham et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1197.pdf
Checklist:: 2026.findings-acl.1197.checklist.pdf

PDF Cite Search Checklist Fix data