Scaling Up Data-to-Text Generation to Longer Sequences: A New Dataset and Benchmark Results for Generation from Large Triple Sets

Chinonso Cynthia Osuji, Simon Mille, Ornait O’Connell, Thiago Castro Ferreira, Anya Belz, Brian Davis


Abstract
The ability of LLMs to write coherent, faithful long texts from structured data inputs remains relatively uncharted, in part because nearly all public data-to-text datasets contain only short input-output pairs. To address these gaps, we benchmark six LLMs, a rule‐based system and human-written texts on a new long-input dataset in English and Irish via LLM-based evaluation. We find substantial differences between models and languages.
Anthology ID:
2025.inlg-main.47
Volume:
Proceedings of the 18th International Natural Language Generation Conference
Month:
October
Year:
2025
Address:
Hanoi, Vietnam
Editors:
Lucie Flek, Shashi Narayan, Lê Hồng Phương, Jiahuan Pei
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
810–822
Language:
URL:
https://preview.aclanthology.org/author-page-lei-gao-usc/2025.inlg-main.47/
DOI:
Bibkey:
Cite (ACL):
Chinonso Cynthia Osuji, Simon Mille, Ornait O’Connell, Thiago Castro Ferreira, Anya Belz, and Brian Davis. 2025. Scaling Up Data-to-Text Generation to Longer Sequences: A New Dataset and Benchmark Results for Generation from Large Triple Sets. In Proceedings of the 18th International Natural Language Generation Conference, pages 810–822, Hanoi, Vietnam. Association for Computational Linguistics.
Cite (Informal):
Scaling Up Data-to-Text Generation to Longer Sequences: A New Dataset and Benchmark Results for Generation from Large Triple Sets (Osuji et al., INLG 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-lei-gao-usc/2025.inlg-main.47.pdf