2025
pdf
bib
abs
Y-NQ: English-Yorùbá Evaluation dataset for Open-Book Reading Comprehension with Open-Ended Questions
Marta R. Costa-jussà
|
Joy Chen
|
Ife Adebara
|
Joe Chuang
|
Christophe Ropers
|
Eduardo Sánchez
Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)
The purpose of this work is to share an English-Yorùbá evaluation dataset for openbook reading comprehension with open-ended questions to assess the performance of models both in a high- and a low-resource language. The dataset contains 358 questions and answers on 338 English documents and 208 Yorùbá documents. Experiments show a consistent disparity in performance between the two languages, with Yorùbá falling behind English for automatic metrics even if documents are much shorter for this language. For a small set of documents with comparable length, performance of Yorùbá drops by 2.5 times and this comparison is validated with humanevaluation. When analyzing performance by length, we observe that Yorùbá decreases performance dramatically for documents that reach 1500 words while English performance is barely affected at that length. Our dataset opens the door to showcasing if English LLM reading comprehension capabilities extend to Yorùbá, which for the evaluated LLMs is not the case.
pdf
bib
abs
On the Role of Speech Data in Reducing Toxicity Detection Bias
Samuel Bell
|
Mariano Coria Meglioli
|
Megan Richards
|
Eduardo Sánchez
|
Christophe Ropers
|
Skyler Wang
|
Adina Williams
|
Levent Sagun
|
Marta R. Costa-jussà
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Text toxicity detection systems exhibit significant biases, producing disproportionate rates of false positives on samples mentioning demographic groups. But what about toxicity detection in speech? To investigate the extent to which text-based biases are mitigated by speech-based systems, we produce a set of high-quality group annotations for the multilingual MuTOX dataset, and then leverage these annotations to systematically compare speech- and text-based toxicity classifiers. Our findings indicate that access to speech data during inference supports reduced bias against group mentions, particularly for ambiguous and disagreement-inducing samples. Our results also suggest that improving classifiers, rather than transcription pipelines, is more helpful for reducing group bias. We publicly release our annotations and provide recommendations for future toxicity dataset construction.
2024
pdf
bib
abs
Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models
Kenza Benkirane
|
Laura Gongas
|
Shahar Pelles
|
Naomi Fuchs
|
Joshua Darmon
|
Pontus Stenetorp
|
David Ifeoluwa Adelani
|
Eduardo Sánchez
Findings of the Association for Computational Linguistics: EMNLP 2024
Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even the best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit substantial limitations when applied to Low-Resource Languages (LRLs). This paper evaluates sentence-level hallucination detection approaches using Large Language Models (LLMs) and semantic similarity within massively multilingual embeddings. Our study spans 16 language directions, covering HRLs, LRLs, with diverse scripts. We find that the choice of model is essential for performance. On average, for HRLs, Llama3-70B outperforms the previous state of the art by as much as 0.16 MCC (Matthews Correlation Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task. However, their advantage is less significant for LRLs.
pdf
bib
abs
Overview of the Shared Task on Machine Translation Gender Bias Evaluation with Multilingual Holistic Bias
Marta Costa-jussà
|
Pierre Andrews
|
Christine Basta
|
Juan Ciro
|
Agnieszka Falenska
|
Seraphina Goldfarb-Tarrant
|
Rafael Mosquera
|
Debora Nozza
|
Eduardo Sánchez
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
We describe the details of the Shared Task of the 5th ACL Workshop on Gender Bias in Natural Language Processing (GeBNLP 2024). The task uses dataset to investigate the quality of Machine Translation systems on a particular case of gender robustness. We report baseline results as well as the results of the first participants. The shared task will be permanently available in the Dynabench platform.
pdf
bib
abs
Gender-specific Machine Translation with Large Language Models
Eduardo Sánchez
|
Pierre Andrews
|
Pontus Stenetorp
|
Mikel Artetxe
|
Marta R. Costa-jussà
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
‘While machine translation (MT) systems have seen significant improvements,it is still common for translations to reflect societal biases, such as genderbias. Decoder-only language models (LLMs) have demonstrated potential in MT, albeitwith performance slightly lagging behind traditional encoder-decoder neural machinetranslation (NMT) systems. However, LLMs offer a unique advantage: the abilityto control the properties of the output through prompting. In this study, we leveragethis flexibility to explore Llama”s capability to produce gender-specific translations.Our results indicate that Llama can generate gender-specific translations withtranslation quality and gender bias comparable to NLLB, a state-of-the-art multilingualNMT system.’