Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis

Sabri Boughorbel, Md Rizwan Parvez, Majd Hawasly


Abstract
Training LLMs in low resources languages usually utilizes machine translation (MT) data augmentation from English language. However, translation brings a number of challenges: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions; the translated content carries over cultural biases; and if the translation is not faithful and accurate, the quality of the data degrades causing issues in the trained model. In this work, we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model. We train a number of story generation models of size 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify these issues, we further pre-train the models with a small dataset of synthesized high-quality stories generated by a capable LLM in Arabic, representing 1% of the original training data. We show, using GPT-4 as a judge and dictionary learning analysis from mechanistic interpretability, that the suggested approach is a practical means to resolve some of the translation pitfalls. We illustrate the improvement through case studies of linguistic and cultural bias issues.
Anthology ID:
2024.arabicnlp-1.7
Volume:
Proceedings of The Second Arabic Natural Language Processing Conference
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Nizar Habash, Houda Bouamor, Ramy Eskander, Nadi Tomeh, Ibrahim Abu Farha, Ahmed Abdelali, Samia Touileb, Injy Hamed, Yaser Onaizan, Bashar Alhafni, Wissam Antoun, Salam Khalifa, Hatem Haddad, Imed Zitouni, Badr AlKhamissi, Rawan Almatham, Khalil Mrini
Venues:
ArabicNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
73–88
Language:
URL:
https://aclanthology.org/2024.arabicnlp-1.7
DOI:
10.18653/v1/2024.arabicnlp-1.7
Bibkey:
Cite (ACL):
Sabri Boughorbel, Md Rizwan Parvez, and Majd Hawasly. 2024. Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis. In Proceedings of The Second Arabic Natural Language Processing Conference, pages 73–88, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis (Boughorbel et al., ArabicNLP-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2024.arabicnlp-1.7.pdf