Jianing Liu
2025
Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages
Zeli Su
|
Ziyin Zhang
|
Guixian Xu
|
Jianing Liu
|
Xu Han
|
Ting Zhang
|
Yushuang Dong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.
CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China
Guixian Xu
|
Zeli Su
|
Ziyin Zhang
|
Jianing Liu
|
Xu Han
|
Ting Zhang
|
Yushuang Dong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.
Search
Fix author
Co-authors
- Yushuang Dong 2
- Xu Han (韩旭) 2
- Zeli Su 2
- Guixian Xu 2
- Ziyin Zhang 2
- show all...