Shuo Sun


MLQE-PE: A Multilingual Quality Estimation and Post-Editing Dataset
Marina Fomicheva | Shuo Sun | Erick Fonseca | Chrysoula Zerva | Frédéric Blain | Vishrav Chaudhary | Francisco Guzmán | Nina Lopatina | Lucia Specia | André F. T. Martins
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present MLQE-PE, a new dataset for Machine Translation (MT) Quality Estimation (QE) and Automatic Post-Editing (APE). The dataset contains annotations for eleven language pairs, including both high- and low-resource languages. Specifically, it is annotated for translation quality with human labels for up to 10,000 translations per language pair in the following formats: sentence-level direct assessments and post-editing effort, and word-level binary good/bad labels. Apart from the quality-related scores, each source-translation sentence pair is accompanied by the corresponding post-edited sentence, as well as titles of the articles where the sentences were extracted from, and information on the neural MT models used to translate the text. We provide a thorough description of the data collection and annotation process as well as an analysis of the annotation distribution for each language pair. We also report the performance of baseline systems trained on the MLQE-PE dataset. The dataset is freely available and has already been used for several WMT shared tasks.

AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages
Odunayo Ogundepo | Xinyu Zhang | Shuo Sun | Kevin Duh | Jimmy Lin
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Language diversity in NLP is critical in enabling the development of tools for a wide range of users.However, there are limited resources for building such tools for many languages, particularly those spoken in Africa.For search, most existing datasets feature few or no African languages, directly impacting researchers’ ability to build and improve information access capabilities in those languages.Motivated by this, we created AfriCLIRMatrix, a test collection for cross-lingual information retrieval research in 15 diverse African languages.In total, our dataset contains 6 million queries in English and 23 million relevance judgments automatically mined from Wikipedia inter-language links, covering many more African languages than any existing information retrieval test collection.In addition, we release BM25, dense retrieval, and sparse–dense hybrid baselines to provide a starting point for the development of future systems.We hope that these efforts can spur additional work in search for African languages.AfriCLIRMatrix can be downloaded at


Classification-based Quality Estimation: Small and Efficient Models for Real-world Applications
Shuo Sun | Ahmed El-Kishky | Vishrav Chaudhary | James Cross | Lucia Specia | Francisco Guzmán
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Sentence-level Quality estimation (QE) of machine translation is traditionally formulated as a regression task, and the performance of QE models is typically measured by Pearson correlation with human labels. Recent QE models have achieved previously-unseen levels of correlation with human judgments, but they rely on large multilingual contextualized language models that are computationally expensive and make them infeasible for real-world applications. In this work, we evaluate several model compression techniques for QE and find that, despite their popularity in other NLP tasks, they lead to poor performance in this regression setting. We observe that a full model parameterization is required to achieve SoTA results in a regression task. However, we argue that the level of expressiveness of a model in a continuous range is unnecessary given the downstream applications of QE, and show that reframing QE as a classification problem and evaluating QE models using classification metrics would better reflect their actual performance in real-world applications.

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
Holger Schwenk | Vishrav Chaudhary | Shuo Sun | Hongyu Gong | Francisco Guzmán
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or low-resource languages. We do not limit the extraction process to alignments with English, but we systematically consider all possible language pairs. In total, we are able to extract 135M parallel sentences for 16720 different language pairs, out of which only 34M are aligned with English. This corpus is freely available. To get an indication on the quality of the extracted bitexts, we train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English.


Collecting Verified COVID-19 Question Answer Pairs
Adam Poliak | Max Fleming | Cash Costello | Kenton Murray | Mahsa Yarmohammadi | Shivani Pandya | Darius Irani | Milind Agarwal | Udit Sharma | Shuo Sun | Nicola Ivanov | Lingxi Shang | Kaushik Srinivasan | Seolhwa Lee | Xu Han | Smisha Agarwal | João Sedoc
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

We release a dataset of over 2,100 COVID19 related Frequently asked Question-Answer pairs scraped from over 40 trusted websites. We include an additional 24, 000 questions pulled from online sources that have been aligned by experts with existing answered questions from our dataset. This paper describes our efforts in collecting the dataset and summarizes the resulting data. Our dataset is automatically updated daily and available at scraping-qas. So far, this data has been used to develop a chatbot providing users information about COVID-19. We encourage others to build analytics and tools upon this dataset as well.

BERGAMOT-LATTE Submissions for the WMT20 Quality Estimation Shared Task
Marina Fomicheva | Shuo Sun | Lisa Yankovskaya | Frédéric Blain | Vishrav Chaudhary | Mark Fishel | Francisco Guzmán | Lucia Specia
Proceedings of the Fifth Conference on Machine Translation

This paper presents our submission to the WMT2020 Shared Task on Quality Estimation (QE). We participate in Task and Task 2 focusing on sentence-level prediction. We explore (a) a black-box approach to QE based on pre-trained representations; and (b) glass-box approaches that leverage various indicators that can be extracted from the neural MT systems. In addition to training a feature-based regression model using glass-box quality indicators, we also test whether they can be used to predict MT quality directly with no supervision. We assess our systems in a multi-lingual setting and show that both types of approaches generalise well across languages. Our black-box QE models tied for the winning submission in four out of seven language pairs inTask 1, thus demonstrating very strong performance. The glass-box approaches also performed competitively, representing a light-weight alternative to the neural-based models.

An Exploratory Study on Multilingual Quality Estimation
Shuo Sun | Marina Fomicheva | Frédéric Blain | Vishrav Chaudhary | Ahmed El-Kishky | Adithya Renduchintala | Francisco Guzmán | Lucia Specia
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Predicting the quality of machine translation has traditionally been addressed with language-specific models, under the assumption that the quality label distribution or linguistic features exhibit traits that are not shared across languages. An obvious disadvantage of this approach is the need for labelled data for each given language pair. We challenge this assumption by exploring different approaches to multilingual Quality Estimation (QE), including using scores from translation models. We show that these outperform single-language models, particularly in less balanced quality label distributions and low-resource settings. In the extreme case of zero-shot QE, we show that it is possible to accurately predict quality for any given new language from models trained on other languages. Our findings indicate that state-of-the-art neural QE models based on powerful pre-trained representations generalise well across languages, making them more applicable in real-world settings.

Unsupervised Quality Estimation for Neural Machine Translation
Marina Fomicheva | Shuo Sun | Lisa Yankovskaya | Frédéric Blain | Francisco Guzmán | Mark Fishel | Nikolaos Aletras | Vishrav Chaudhary | Lucia Specia
Transactions of the Association for Computational Linguistics, Volume 8

Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation, and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By utilizing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivaling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.

Multi-Reward based Reinforcement Learning for Neural Machine Translation
Shuo Sun | Hongxu Hou | Nier Wu | Ziyue Guo | Chaowei Zhang
Proceedings of the 19th Chinese National Conference on Computational Linguistics

Reinforcement learning (RL) has made remarkable progress in neural machine translation (NMT). However, it exists the problems with uneven sampling distribution, sparse rewards and high variance in training phase. Therefore, we propose a multi-reward reinforcement learning training strategy to decouple action selection and value estimation. Meanwhile, our method combines with language model rewards to jointly optimize model parameters. In addition, we add Gumbel noise in sampling to obtain more effective semantic information. To verify the robustness of our method, we not only conducted experiments on large corpora, but also performed on low-resource languages. Experimental results show that our work is superior to the baselines in WMT14 English-German, LDC2014 Chinese-English and CWMT2018 Mongolian-Chinese tasks, which fully certificates the effectiveness of our method.

Are we Estimating or Guesstimating Translation Quality?
Shuo Sun | Francisco Guzmán | Lucia Specia
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recent advances in pre-trained multilingual language models lead to state-of-the-art results on the task of quality estimation (QE) for machine translation. A carefully engineered ensemble of such models won the QE shared task at WMT19. Our in-depth analysis, however, shows that the success of using pre-trained language models for QE is over-estimated due to three issues we observed in current QE datasets: (i) The distributions of quality scores are imbalanced and skewed towards good quality scores; (iii) QE models can perform well on these datasets while looking at only source or translated sentences; (iii) They contain statistical artifacts that correlate well with human-annotated QE labels. Our findings suggest that although QE models might capture fluency of translated sentences and complexity of source sentences, they cannot model adequacy of translations effectively.

CLIReval: Evaluating Machine Translation as a Cross-Lingual Information Retrieval Task
Shuo Sun | Suzanna Sia | Kevin Duh
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We present CLIReval, an easy-to-use toolkit for evaluating machine translation (MT) with the proxy task of cross-lingual information retrieval (CLIR). Contrary to what the project name might suggest, CLIReval does not actually require any annotated CLIR dataset. Instead, it automatically transforms translations and references used in MT evaluations into a synthetic CLIR dataset; it then sets up a standard search engine (Elasticsearch) and computes various information retrieval metrics (e.g., mean average precision) by treating the translations as documents to be retrieved. The idea is to gauge the quality of MT by its impact on the document translation approach to CLIR. As a case study, we run CLIReval on the “metrics shared task” of WMT2019; while this extrinsic metric is not intended to replace popular intrinsic metrics such as BLEU, results suggest CLIReval is competitive in many language pairs in terms of correlation to human judgments of quality. CLIReval is publicly available at

CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval
Shuo Sun | Kevin Duh
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present CLIRMatrix, a massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval extracted automatically from Wikipedia. CLIRMatrix comprises (1) BI-139, a bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs, and (2) MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different languages. In total, we mined 49 million unique queries and 34 billion (query, document, label) triplets, making it the largest and most comprehensive CLIR dataset to date. This collection is intended to support research in end-to-end neural information retrieval and is publicly available at [url]. We provide baseline neural model results on BI-139, and evaluate MULTI-8 in both single-language retrieval and mix-language retrieval settings.


Cross-Lingual Learning-to-Rank with Shared Representations
Shota Sasaki | Shuo Sun | Shigehiko Schamoni | Kevin Duh | Kentaro Inui
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Cross-lingual information retrieval (CLIR) is a document retrieval task where the documents are written in a language different from that of the user’s query. This is a challenging problem for data-driven approaches due to the general lack of labeled training data. We introduce a large-scale dataset derived from Wikipedia to support CLIR research in 25 languages. Further, we present a simple yet effective neural learning-to-rank model that shares representations across languages and reduces the data requirement. This model can exploit training data in, for example, Japanese-English CLIR to improve the results of Swahili-English CLIR.