Abstract
Open source large language models (LLMs) have shown great improvements in recent times. However, many of these models are focused solely on popular spoken languages. We present a high quality dataset of more than 70k prompt-response pairs in 74 languages which consist of human generated prompts and synthetic responses. We use this dataset to train a state-of-the-art open source English LLM to chat multilingually.We evaluate our model on MT-Bench chat benchmarks in 6 languages, finding that our multilingual model outperforms previous state-of-the-art open source LLMs across each language. We further find that training on more multilingual data is beneficial to the performance in a chosen target language (Japanese) compared to simply training on only data in that language.These results indicate the necessity of training on large amounts of high quality multilingual data to make a more accessible LLM.- Anthology ID:
- 2024.mrl-1.6
- Volume:
- Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Jonne Sälevä, Abraham Owodunni
- Venue:
- MRL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 106–113
- Language:
- URL:
- https://aclanthology.org/2024.mrl-1.6
- DOI:
- 10.18653/v1/2024.mrl-1.6
- Cite (ACL):
- Peter Devine. 2024. Tagengo: A Multilingual Chat Dataset. In Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024), pages 106–113, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Tagengo: A Multilingual Chat Dataset (Devine, MRL 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.mrl-1.6.pdf