LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability
Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, Bowen Yu, Zuozhu Liu, Junyang Lin
Abstract
Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce LongWeave, which balance real-world and verifiable assessment with Target-Anchored Evaluation (TAE). TAE constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and anchors based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs show that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase. Dataset will be publicly available.- Anthology ID:
- 2025.findings-emnlp.549
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 10386–10417
- Language:
- URL:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.549/
- DOI:
- 10.18653/v1/2025.findings-emnlp.549
- Cite (ACL):
- Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, Bowen Yu, Zuozhu Liu, and Junyang Lin. 2025. LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 10386–10417, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability (Xiao et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.549.pdf