TEMA: Token Embeddings Mapping for Enriching Low-Resource Language Models

Rodolfo Zevallos, Núria Bel, Mireia Farrús


Abstract
The objective of the research we present is to remedy the problem of the low quality of language models for low-resource languages. We introduce an algorithm, the Token Embedding Mapping Algorithm (TEMA), that maps the token embeddings of a richly pre-trained model L1 to a poorly trained model L2, thus creating a richer L2’ model. Our experiments show that the L2’ model reduces perplexity with respect to the original monolingual model L2, and that for downstream tasks, including SuperGLUE, the results are state-of-the-art or better for the most semantic tasks. The models obtained with TEMA are also competitive or better than multilingual or extended models proposed as solutions for mitigating the low-resource language problems.
Anthology ID:
2024.emnlp-main.638
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11423–11435
Language:
URL:
https://aclanthology.org/2024.emnlp-main.638
DOI:
10.18653/v1/2024.emnlp-main.638
Bibkey:
Cite (ACL):
Rodolfo Zevallos, Núria Bel, and Mireia Farrús. 2024. TEMA: Token Embeddings Mapping for Enriching Low-Resource Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11423–11435, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
TEMA: Token Embeddings Mapping for Enriching Low-Resource Language Models (Zevallos et al., EMNLP 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2024.emnlp-main.638.pdf