2023
pdf
Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages
Viktor Hangya
|
Silvia Severini
|
Radoslav Ralev
|
Alexander Fraser
|
Hinrich Schütze
Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL)
pdf
abs
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Ayyoob ImaniGooghari
|
Peiqin Lin
|
Amir Hossein Kargaran
|
Silvia Severini
|
Masoud Jalili Sabet
|
Nora Kassner
|
Chunlan Ma
|
Helmut Schmid
|
André Martins
|
François Yvon
|
Hinrich Schütze
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, “help” from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at
https://github.com/cisnlp/Glot500.
2022
pdf
abs
Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages
Silvia Severini
|
Ayyoob ImaniGooghari
|
Philipp Dufter
|
Hinrich Schütze
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Parallel corpora are ideal for extracting a multilingual named entity (MNE) resource, i.e., a dataset of names translated into multiple languages. Prior work on extracting MNE datasets from parallel corpora required resources such as large monolingual corpora or word aligners that are unavailable or perform poorly for underresourced languages. We present CLC-BN, a new method for creating an MNE resource, and apply it to the Parallel Bible Corpus, a corpus of more than 1000 languages. CLC-BN learns a neural transliteration model from parallel-corpus statistics, without requiring any other bilingual resources, word aligners, or seed data. Experimental results show that CLC-BN clearly outperforms prior work. We release an MNE resource for 1340 languages and demonstrate its effectiveness in two downstream tasks: knowledge graph augmentation and bilingual lexicon induction.
pdf
abs
Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging
Ayyoob ImaniGooghari
|
Silvia Severini
|
Masoud Jalili Sabet
|
François Yvon
|
Hinrich Schütze
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-resource target languages. We formalize POS tag projection as graph-based label propagation. Given translations of a sentence in multiple languages, we create a graph with words as nodes and alignment links as edges by aligning words for all language pairs. We then propagate node labels from source to target using a Graph Neural Network augmented with transformer layers. We show that our propagation creates training sets that allow us to train POS taggers for a diverse set of languages. When combined with enhanced contextualized embeddings, our method achieves a new state-of-the-art for unsupervised POS tagging of low-resource languages.
pdf
abs
Don’t Forget Cheap Training Signals Before Building Unsupervised Bilingual Word Embeddings
Silvia Severini
|
Viktor Hangya
|
Masoud Jalili Sabet
|
Alexander Fraser
|
Hinrich Schütze
Proceedings of the BUCC Workshop within LREC 2022
Bilingual Word Embeddings (BWEs) are one of the cornerstones of cross-lingual transfer of NLP models. They can be built using only monolingual corpora without supervision leading to numerous works focusing on unsupervised BWEs. However, most of the current approaches to build unsupervised BWEs do not compare their results with methods based on easy-to-access cross-lingual signals. In this paper, we argue that such signals should always be considered when developing unsupervised BWE methods. The two approaches we find most effective are: 1) using identical words as seed lexicons (which unsupervised approaches incorrectly assume are not available for orthographically distinct language pairs) and 2) combining such lexicons with pairs extracted by matching romanized versions of words with an edit distance threshold. We experiment on thirteen non-Latin languages (and English) and show that such cheap signals work well and that they outperform using more complex unsupervised methods on distant language pairs such as Chinese, Japanese, Kannada, Tamil, and Thai. In addition, they are even competitive with the use of high-quality lexicons in supervised approaches. Our results show that these training signals should not be neglected when building BWEs, even for distant languages.
2020
pdf
abs
LMU Bilingual Dictionary Induction System with Word Surface Similarity Scores for BUCC 2020
Silvia Severini
|
Viktor Hangya
|
Alexander Fraser
|
Hinrich Schütze
Proceedings of the 13th Workshop on Building and Using Comparable Corpora
The task of Bilingual Dictionary Induction (BDI) consists of generating translations for source language words which is important in the framework of machine translation (MT). The aim of the BUCC 2020 shared task is to perform BDI on various language pairs using comparable corpora. In this paper, we present our approach to the task of English-German and English-Russian language pairs. Our system relies on Bilingual Word Embeddings (BWEs) which are often used for BDI when only a small seed lexicon is available making them particularly effective in a low-resource setting. On the other hand, they perform well on high frequency words only. In order to improve the performance on rare words as well, we combine BWE based word similarity with word surface similarity methods, such as orthography In addition to the often used top-n translation method, we experiment with a margin based approach aiming for dynamic number of translations for each source word. We participate in both the open and closed tracks of the shared task and we show improved results of our method compared to simple vector similarity based approaches. Our system was ranked in the top-3 teams and achieved the best results for English-Russian.
pdf
abs
Combining Word Embeddings with Bilingual Orthography Embeddings for Bilingual Dictionary Induction
Silvia Severini
|
Viktor Hangya
|
Alexander Fraser
|
Hinrich Schütze
Proceedings of the 28th International Conference on Computational Linguistics
Bilingual dictionary induction (BDI) is the task of accurately translating words to the target language. It is of great importance in many low-resource scenarios where cross-lingual training data is not available. To perform BDI, bilingual word embeddings (BWEs) are often used due to their low bilingual training signal requirements. They achieve high performance, but problematic cases still remain, such as the translation of rare words or named entities, which often need to be transliterated. In this paper, we enrich BWE-based BDI with transliteration information by using Bilingual Orthography Embeddings (BOEs). BOEs represent source and target language transliteration word pairs with similar vectors. A key problem in our BDI setup is to decide which information source – BWEs (or semantics) vs. BOEs (or orthography) – is more reliable for a particular word pair. We propose a novel classification-based BDI system that uses BWEs, BOEs and a number of other features to make this decision. We test our system on English-Russian BDI and show improved performance. In addition, we show the effectiveness of our BOEs by successfully using them for transliteration mining based on cosine similarity.