2025
pdf
bib
abs
Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu
Renhao Pei
|
Yihong Liu
|
Peiqin Lin
|
François Yvon
|
Hinrich Schuetze
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT, as it can readily take advantage of linguistic resources such as grammar books and dictionaries.Such resources are usually selectively integrated into the prompt so that LLMs can directly perform translation without any specific training, via their in-context learning capability (ICL).However, the relative importance of each type of resource, e.g., dictionary, grammar book, and retrieved parallel examples, is not entirely clear.To address this gap, this study systematically investigates how each resource and its quality affect the translation performance, with the Manchu language as our case study. To remove any prior knowledge of Manchu encoded in the LLM parameters and single out the effect of ICL, we also experiment with an enciphered version of Manchu texts.Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.In a follow-up study, we showcase a promising application of in-context MT: parallel data augmentation as a way to bootstrap a conventional MT model. When monolingual data abound, generating synthetic parallel data through in-context MT offers a pathway to mitigate data scarcity and build effective and efficient low-resource neural MT systems.
pdf
bib
abs
Taxi1500: A Dataset for Multilingual Text Classification in 1500 Languages
Chunlan Ma
|
Ayyoob Imani
|
Haotian Ye
|
Renhao Pei
|
Ehsaneddin Asgari
|
Hinrich Schuetze
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
While broad-coverage multilingual natural language processing tools have been developed, a significant portion of the world’s over 7000 languages are still neglected. One reason is the lack of evaluation datasets that cover a diverse range of languages, particularly those that are low-resource or endangered. To address this gap, we present a large-scale text classification dataset encompassing 1504 languages many of which have otherwise limited or no annotated data. This dataset is constructed using parallel translations of the Bible. We develop relevant topics, annotate the English data through crowdsourcing and project these annotations onto other languages via aligned verses. We benchmark a range of existing multilingual models on this dataset. We make our dataset and code available to the public.
2023
pdf
bib
abs
A Crosslingual Investigation of Conceptualization in 1335 Languages
Yihong Liu
|
Haotian Ye
|
Leonie Weissweiler
|
Philipp Wicke
|
Renhao Pei
|
Robert Zangenfeind
|
Hinrich Schütze
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Languages differ in how they divide up the world into concepts and words; e.g., in contrast to English, Swahili has a single concept for ‘belly’ and ‘womb’. We investigate these differences in conceptualization across 1,335 languages by aligning concepts in a parallel corpus. To this end, we propose Conceptualizer, a method that creates a bipartite directed alignment graph between source language concepts and sets of target language strings. In a detailed linguistic analysis across all languages for one concept (‘bird’) and an evaluation on gold standard data for 32 Swadesh concepts, we show that Conceptualizer has good alignment accuracy. We demonstrate the potential of research on conceptualization in NLP with two experiments. (1) We define crosslingual stability of a concept as the degree to which it has 1-1 correspondences across languages, and show that concreteness predicts stability. (2) We represent each language by its conceptualization pattern for 83 concepts, and define a similarity measure on these representations. The resulting measure for the conceptual similarity between two languages is complementary to standard genealogical, typological, and surface similarity measures. For four out of six language families, we can assign languages to their correct family based on conceptual similarity with accuracies between 54% and 87%
pdf
bib
abs
Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs
Yihong Liu
|
Haotian Ye
|
Leonie Weissweiler
|
Renhao Pei
|
Hinrich Schuetze
Findings of the Association for Computational Linguistics: EMNLP 2023
In comparative linguistics, colexification refers to the phenomenon of a lexical form conveying two or more distinct meanings. Existing work on colexification patterns relies on annotated word lists, limiting scalability and usefulness in NLP. In contrast, we identify colexification patterns of more than 2,000 concepts across 1,335 languages directly from an unannotated parallel corpus. We then propose simple and effective methods to build multilingual graphs from the colexification patterns: ColexNet and ColexNet+. ColexNet’s nodes are concepts and its edges are colexifications. In ColexNet+, concept nodes are additionally linked through intermediate nodes, each representing an ngram in one of 1,334 languages. We use ColexNet+ to train ColexNet+, high-quality multilingual embeddings that are well-suited for transfer learning. In our experiments, we first show that ColexNet achieves high recall on CLICS, a dataset of crosslingual colexifications. We then evaluate ColexNet+ on roundtrip translation, sentence retrieval and sentence classification and show that our embeddings surpass several transfer learning baselines. This demonstrates the benefits of using colexification as a source of information in multilingual NLP.