Rafael Mosquera


2025

pdf bib
Building Data Infrastructure for Low-Resource Languages
Sarah K. K. Luger | Rafael Mosquera | Pedro Ortiz Suarez
Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025)

The MLCommons Datasets Working Group presents a comprehensive initiative to advance the development and accessibility of artificial intelligence (AI) training and testing resources. This paper introduces three key projects aimed at addressing critical gaps in the AI data ecosystem: the Unsupervised People’s Speech Dataset, containing over 821,000 hours of speech across 89+ languages; a strategic collaboration with Common Crawl to enhance web crawling capabilities for low-resource languages; and a framework for knowledge graph extraction evaluation. By focusing on languages other than English (LOTE) and creating permissively licensed, high-quality datasets, these initiatives aim to democratize AI development and improve model performance across diverse linguistic contexts. This work represents a significant step toward more inclusive and capable AI systems that can serve global communities.

2024

pdf bib
Overview of the Shared Task on Machine Translation Gender Bias Evaluation with Multilingual Holistic Bias
Marta Costa-jussà | Pierre Andrews | Christine Basta | Juan Ciro | Agnieszka Falenska | Seraphina Goldfarb-Tarrant | Rafael Mosquera | Debora Nozza | Eduardo Sánchez
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

We describe the details of the Shared Task of the 5th ACL Workshop on Gender Bias in Natural Language Processing (GeBNLP 2024). The task uses dataset to investigate the quality of Machine Translation systems on a particular case of gender robustness. We report baseline results as well as the results of the first participants. The shared task will be permanently available in the Dynabench platform.

2023

pdf bib
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning
Alex Warstadt | Aaron Mueller | Leshem Choshen | Ethan Wilcox | Chengxu Zhuang | Juan Ciro | Rafael Mosquera | Bhargavi Paranjabe | Adina Williams | Tal Linzen | Ryan Cotterell
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

pdf bib
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt | Aaron Mueller | Leshem Choshen | Ethan Wilcox | Chengxu Zhuang | Juan Ciro | Rafael Mosquera | Bhargavi Paranjabe | Adina Williams | Tal Linzen | Ryan Cotterell
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning