2025
pdf
bib
abs
Scaling Low-Resource MT via Synthetic Data Generation with LLMs
Ona de Gibert
|
Joseph Attieh
|
Teemu Vahtola
|
Mikko Aulamo
|
Zihao Li
|
Raúl Vázquez
|
Tiancheng Hu
|
Jörg Tiedemann
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We investigate the potential of LLM-generated synthetic data for improving low-resource Machine Translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its overall high quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, (iii) studying the effect of varying training data size, and (iiii) testing its utility beyond English-centric MT. Finally, we introduce SynOPUS, a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.
pdf
bib
abs
GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models
Hengyu Luo
|
Zihao Li
|
Joseph Attieh
|
Sawal Devkota
|
Ona de Gibert
|
Xu Huang
|
Shaoxiong Ji
|
Peiqin Lin
|
Bhavani Sai Praneeth Varma Mantina
|
Ananda Sreenidhi
|
Raúl Vázquez
|
Mengjie Wang
|
Samea Yusofi
|
Fei Yuan
|
Jörg Tiedemann
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary languages. Evaluating these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks suffer from inconsistency across different benchmarks, being disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this critical challenge of fragmented and inconsistent multilingual evaluation, we introduce GlotEval, a unified and lightweight framework that systematically integrates 27 benchmarks under a standardized ISO 639-3 language identifier system, allowing for seamless incorporation of new benchmarks. Supporting nine key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, intrinsic evaluation, instruction following and reasoning), spanning over dozens to hundreds of languages, GlotEval uniquely enables language-specific, cross-benchmark analysis and non-English-centric evaluations at a scale previously less practical for many researchers. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval’s applicability for multilingual and language-specific evaluations.
pdf
bib
abs
A Comparative Study of PEFT Methods for Python Code Generation
Johanna Männistö
|
Joseph Attieh
|
Jörg Tiedemann
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Fine-tuning language models incurs high costs in training, inference and storage. Parameter-efficient fine-tuning (PEFT) methods have emerged as a more cost-effective alternative to full fine-tuning. However, limited work has compared different PEFT approaches for tasks like code generation. In this study, we examine the effect of various PEFT training methods on model performance in the task of Python code generation. We fine-tune four model families, ranging from 124M to 7B parameters, using three PEFT approaches alongside standard full fine-tuning. Our findings reveal that the effectiveness of each PEFT method varies with the model size and the corpus used.
pdf
bib
abs
SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
Raul Vazquez
|
Timothee Mickus
|
Elaine Zosa
|
Teemu Vahtola
|
Jörg Tiedemann
|
Aman Sinha
|
Vincent Segonne
|
Fernando Sanchez - Vega
|
Alessandro Raganato
|
Jindřich Libovický
|
Jussi Karlgren
|
Shaoxiong Ji
|
Jindřich Helcl
|
Liane Guillou
|
Ona De Gibert
|
Jaione Bengoetxea
|
Joseph Attieh
|
Marianna Apidianaki
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs).Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies. The very high number of submissions highlights the interest of the community in hallucination detection. We present the results of the participating systems and provide an empirical analysis in order to better understand the factors that can lead to strong performance in this task. We also underscore current challenges, notably the varying degree of hallucinations across languages and the high annotator disagreement when labeling hallucination spans.
2024
pdf
bib
abs
Isotropy, Clusters, and Classifiers
Timothee Mickus
|
Stig-Arne Grönroos
|
Joseph Attieh
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Whether embedding spaces use all their dimensions equally, i.e., whether they are isotropic, has been a recent subject of discussion. Evidence has been accrued both for and against enforcing isotropy in embedding spaces. In the present paper, we stress that isotropy imposes requirements on the embedding space that are not compatible with the presence of clusters—which also negatively impacts linear classification objectives. We demonstrate this fact both empirically and mathematically and use it to shed light on previous results from the literature.
pdf
bib
abs
System Description of the NordicsAlps Submission to the AmericasNLP 2024 Machine Translation Shared Task
Joseph Attieh
|
Zachary Hopton
|
Yves Scherrer
|
Tanja Samardžić
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
This paper presents the system description of the NordicsAlps team for the AmericasNLP 2024 Machine Translation Shared Task 1. We investigate the effect of tokenization on translation quality by exploring two different tokenization schemes: byte-level and redundancy-driven tokenization. We submitted three runs per language pair. The redundancy-driven tokenization ranked first among all submissions, scoring the highest average chrF2++, chrF, and BLEU metrics (averaged across all languages). These findings demonstrate the importance of carefully tailoring the tokenization strategies of machine translation systems, particularly in resource-constrained scenarios.
pdf
bib
abs
MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki
Timothee Mickus
|
Stig-Arne Grönroos
|
Joseph Attieh
|
Michele Boggia
|
Ona De Gibert
|
Shaoxiong Ji
|
Niki Andreas Loppi
|
Alessandro Raganato
|
Raúl Vázquez
|
Jörg Tiedemann
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
NLP in the age of monolithic large language models is approaching its limits in terms of size and information that can be handled. The trend goes to modularization, a necessary step into the direction of designing smaller sub-networks and components with specialized functionality. In this paper, we present the MAMMOTH toolkit: a framework designed for training massively multilingual modular machine translation systems at scale, initially derived from OpenNMT-py and then adapted to ensure efficient training across computation clusters.We showcase its efficiency across clusters of A100 and V100 NVIDIA GPUs, and discuss our design philosophy and plans for future information.The toolkit is publicly available online at https://github.com/Helsinki-NLP/mammoth.
pdf
bib
abs
I Have an Attention Bridge to Sell You: Generalization Capabilities of Modular Translation Architectures
Timothee Mickus
|
Raul Vazquez
|
Joseph Attieh
Proceedings of the Fifth Workshop on Insights from Negative Results in NLP
Modularity is a paradigm of machine translation with the potential of bringing forth models that are large at training time and small during inference. Within this field of study, modular approaches, and in particular attention bridges, have been argued to improve the generalization capabilities of models by fostering language-independent representations. In the present paper, we study whether modularity affects translation quality; as well as how well modular architectures generalize across different evaluation scenarios. For a given computational budget, we find non-modular architectures to be always comparable or preferable to all modular designs we study.
2022
pdf
bib
abs
Arabic Dialect Identification and Sentiment Classification using Transformer-based Models
Joseph Attieh
|
Fadi Hassan
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
In this paper, we present two deep learning approaches that are based on AraBERT, submitted to the Nuanced Arabic Dialect Identification (NADI) shared task of the Seventh Workshop for Arabic Natural Language Processing (WANLP 2022). NADI consists of two main sub-tasks, mainly country-level dialect and sentiment identification for dialectical Arabic. We present one system per sub-task. The first system is a multi-task learning model that consists of a shared AraBERT encoder with three task-specific classification layers. This model is trained to jointly learn the country-level dialect of the tweet as well as the region-level and area-level dialects. The second system is a distilled model of an ensemble of models trained using K-fold cross-validation. Each model in the ensemble consists of an AraBERT model and a classifier, fine-tuned on (K-1) folds of the training set. Our team Pythoneers achieved rank 6 on the first test set of the first sub-task, rank 9 on the second test set of the first sub-task, and rank 4 on the test set of the second sub-task.
pdf
bib
abs
Pythoneers at WANLP 2022 Shared Task: Monolingual AraBERT for Arabic Propaganda Detection and Span Extraction
Joseph Attieh
|
Fadi Hassan
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
In this paper, we present two deep learning approaches that are based on AraBERT, submitted to the Propaganda Detection shared task of the Seventh Workshop for Arabic Natural Language Processing (WANLP 2022). Propaganda detection consists of two main sub-tasks, mainly propaganda identification and span extraction. We present one system per sub-task. The first system is a Multi-Task Learning model that consists of a shared AraBERT encoder with task-specific binary classification layers. This model is trained to jointly learn one binary classification task per propaganda method. The second system is an AraBERT model with a Conditional Random Field (CRF) layer. We achieved rank 3 on the first sub-task and rank 1 on the second sub-task.