Scaling Up Data-to-Text Generation to Longer Sequences: A New Dataset and Benchmark Results for Generation from Large Triple Sets
Chinonso Cynthia Osuji, Simon Mille, Ornait O’Connell, Thiago Castro Ferreira, Anya Belz, Brian Davis
Abstract
The ability of LLMs to write coherent, faithful long texts from structured data inputs remains relatively uncharted, in part because nearly all public data-to-text datasets contain only short input-output pairs. To address these gaps, we benchmark six LLMs, a rule‐based system and human-written texts on a new long-input dataset in English and Irish via LLM-based evaluation. We find substantial differences between models and languages.- Anthology ID:
- 2025.inlg-main.47
- Volume:
- Proceedings of the 18th International Natural Language Generation Conference
- Month:
- October
- Year:
- 2025
- Address:
- Hanoi, Vietnam
- Editors:
- Lucie Flek, Shashi Narayan, Lê Hồng Phương, Jiahuan Pei
- Venue:
- INLG
- SIG:
- SIGGEN
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 810–822
- Language:
- URL:
- https://preview.aclanthology.org/author-page-lei-gao-usc/2025.inlg-main.47/
- DOI:
- Cite (ACL):
- Chinonso Cynthia Osuji, Simon Mille, Ornait O’Connell, Thiago Castro Ferreira, Anya Belz, and Brian Davis. 2025. Scaling Up Data-to-Text Generation to Longer Sequences: A New Dataset and Benchmark Results for Generation from Large Triple Sets. In Proceedings of the 18th International Natural Language Generation Conference, pages 810–822, Hanoi, Vietnam. Association for Computational Linguistics.
- Cite (Informal):
- Scaling Up Data-to-Text Generation to Longer Sequences: A New Dataset and Benchmark Results for Generation from Large Triple Sets (Osuji et al., INLG 2025)
- PDF:
- https://preview.aclanthology.org/author-page-lei-gao-usc/2025.inlg-main.47.pdf