Mapping Cross-Lingual Sentence Representations for Low-Resource Language Pairs Using Pre-trained Language Models

Tsegaye Misikir Tashu; Andreea Ioana Tudor

Mapping Cross-Lingual Sentence Representations for Low-Resource Language Pairs Using Pre-trained Language Models

Tsegaye Misikir Tashu, Andreea Ioana Tudor

Abstract

In this work, we explore different linear mapping techniques to learn cross-lingual document representations from pre-trained multilingual large language models for low-resource languages. Three different mapping techniques namely Linear Concept Approximation (LCA), Linear Concept Compression (LCC), and Neural Concept Approximation (NCA) and four multilingual language models such as mBERT, mT5, XLM-R, and ErnieM were used to extract embeddings. The inter-lingual representations were created mappings the monolingual representation extracted from multilingual language models. The experimental results showed that LCA and LCC significantly outperform NCA, with models like ErnieM achieving the highest alignment quality. Language pairs exhibit variable performance, influenced by linguistic similarity and data availability, with the Amharic-English pair yielding particularly high scores. The results showed the utility of LCA and LCC in enabling cross-lingual tasks for low-resource languages.

Anthology ID:: 2025.loreslm-1.20
Volume:: Proceedings of the First Workshop on Language Models for Low-Resource Languages
Month:: January
Year:: 2025
Address:: Abu Dhabi, United Arab Emirates
Editors:: Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
Venues:: LoResLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 249–257
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/2025.loreslm-1.20/
DOI:
Bibkey:
Cite (ACL):: Tsegaye Misikir Tashu and Andreea Ioana Tudor. 2025. Mapping Cross-Lingual Sentence Representations for Low-Resource Language Pairs Using Pre-trained Language Models. In Proceedings of the First Workshop on Language Models for Low-Resource Languages, pages 249–257, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: Mapping Cross-Lingual Sentence Representations for Low-Resource Language Pairs Using Pre-trained Language Models (Tashu & Tudor, LoResLM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/2025.loreslm-1.20.pdf

PDF Cite Search Fix data