A Grammar-Based Method for Instilling Empirical Dependency Structure in LLMs

Olle Torstensson, Oskar Holmström


Abstract
We investigate whether synthetic pretraining data generated from a formal grammar modeling syntactic dependencies can improve English language models. Building upon the structured pretraining data approach of Papadimitriou and Jurafsky (2023), we develop a grammar that more closely mirrors empirical dependency structures. Our results are negative – this type of pretraining significantly degrades model performance, with both our and their pretraining approach performing worse than no pretraining at all. We analyze potential explanations for these findings and discuss implications for future work on structured-data pretraining.
Anthology ID:
2025.cgmta-1.7
Volume:
Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP
Month:
march
Year:
2025
Address:
Tallinn, Estonia
Editors:
Trond Trosterud, Linda Wiechetek, Flammie Pirinen
Venues:
cgmta | WS
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
45–49
Language:
URL:
https://preview.aclanthology.org/moar-dois/2025.cgmta-1.7/
DOI:
Bibkey:
Cite (ACL):
Olle Torstensson and Oskar Holmström. 2025. A Grammar-Based Method for Instilling Empirical Dependency Structure in LLMs. In Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP, pages 45–49, Tallinn, Estonia. University of Tartu Library.
Cite (Informal):
A Grammar-Based Method for Instilling Empirical Dependency Structure in LLMs (Torstensson & Holmström, cgmta 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/moar-dois/2025.cgmta-1.7.pdf