Evaluating distillation methods for data-efficient syntax learning
Takateru Yamakoshi, Thomas L. Griffiths, R. Thomas McCoy, Robert D. Hawkins
Abstract
Data-efficient training requires strong inductive biases. To the extent that transformer attention matrices encode syntactic relationships, we would predict that knowledge distillation (KD) targeting attention should selectively accelerate syntax acquisition relative to conventional logit-based KD. To test this hypothesis, we train GPT-2 student models on datasets ranging from 10K to 5M sentences using both distillation methods, evaluating them on both syntactic benchmarks and perplexity. Surprisingly, while logit-based KD dramatically improves data-efficiency, attention-based KD provides minimal benefit even for syntactic tasks. This suggests that output distributions provide sufficient supervisory signal for syntax acquisition, indicating that syntactic knowledge may be distributed throughout the network rather than localized in attention patterns.- Anthology ID:
- 2025.findings-emnlp.801
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 14834–14847
- Language:
- URL:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.801/
- DOI:
- 10.18653/v1/2025.findings-emnlp.801
- Cite (ACL):
- Takateru Yamakoshi, Thomas L. Griffiths, R. Thomas McCoy, and Robert D. Hawkins. 2025. Evaluating distillation methods for data-efficient syntax learning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 14834–14847, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Evaluating distillation methods for data-efficient syntax learning (Yamakoshi et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.801.pdf