Richard He Bai
2025
Training Bilingual LMs with Data Constraints in the Targeted Language
Skyler Seto
|
Maartje Ter Hoeve
|
Richard He Bai
|
Natalie Schluter
|
David Grangier
Findings of the Association for Computational Linguistics: ACL 2025
Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high quality pretraining data is unavailable. In this work, we study how to boost pretrained model performance in a target language with insufficient pretraining data for training a high performing language model by enlisting data from an auxiliary language for which high quality data is available. We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language, exploring the benefits of translation systems, studying the limitations of model scaling when data is limited in the target languages, and proposing new methods for upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in performance gains without modification to the model or training objective for close languages, and, in particular, that performance gains due to the development of more information-rich English pretraining datasets can extend to targeted language settings with limited data.
2024
Divide-or-Conquer? Which Part Should You Distill Your LLM?
Zhuofeng Wu
|
Richard He Bai
|
Aonan Zhang
|
Jiatao Gu
|
V.G.Vinod Vydiswaran
|
Navdeep Jaitly
|
Yizhe Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024
Recent methods have demonstrated that Large Language Models (LLMs) can solve reasoning tasks better when they are encouraged to solve subtasks of the main task first. In this paper we devise a similar strategy that breaks down reasoning tasks into a problem decomposition phase and a problem solving phase and show that the strategy is able to outperform a single stage solution. Further, we hypothesize that the decomposition should be easier to distill into a smaller model compared to the problem solving because the latter requires large amounts of domain knowledge while the former only requires learning general problem solving strategies. We propose methods to distill these two capabilities and evaluate their impact on reasoning outcomes and inference cost. We find that we can distill the problem decomposition phase and at the same time achieve good generalization across tasks, datasets, and models. However, it is harder to distill the problem solving capability without losing performance and the resulting distilled model struggles with generalization. These results indicate that by using smaller, distilled problem decomposition models in combination with problem solving LLMs we can achieve reasoning with cost-efficient inference and local adaptation.
Search
Fix author
Co-authors
- David Grangier 1
- Jiatao Gu 1
- Navdeep Jaitly 1
- Natalie Schluter 1
- Skyler Seto 1
- show all...