Fedor Vitiugin
2026
Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models
Stephan Oepen | Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Maja Buljan | Laurie V. Burchell | Lucas Georges Gabriel Charpentier | Pinzhen Chen | Mariia Fedorova | Ona de Gibert | Barry Haddow | Jan Hajič | Jindrich Helcl | Andrey Kutuzov | Veronika Laippala | Zihao Li | Bhavitvya Malik | Vladislav Mikhailov | Amanda Myntti | Dayyán O'Brien | Lucie Polakova | Gema Ramírez-Sánchez | Janine Siewert | Pavel Stepachev | Joerg Tiedemann | Teemu Vahtola | Dusan Varis | Fedor Vitiugin | Jaume Zaragoza
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Stephan Oepen | Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Maja Buljan | Laurie V. Burchell | Lucas Georges Gabriel Charpentier | Pinzhen Chen | Mariia Fedorova | Ona de Gibert | Barry Haddow | Jan Hajič | Jindrich Helcl | Andrey Kutuzov | Veronika Laippala | Zihao Li | Bhavitvya Malik | Vladislav Mikhailov | Amanda Myntti | Dayyán O'Brien | Lucie Polakova | Gema Ramírez-Sánchez | Janine Siewert | Pavel Stepachev | Joerg Tiedemann | Teemu Vahtola | Dusan Varis | Fedor Vitiugin | Jaume Zaragoza
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for some 20 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder–decoder models, as well as about 30 “smallish” monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.
2024
Ensemble-based Multilingual Euphemism Detection: a Behavior-Guided Approach
Fedor Vitiugin | Henna Paakki
Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024)
Fedor Vitiugin | Henna Paakki
Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024)
This paper describes the system submitted by our team to the Multilingual Euphemism Detection Shared Task for the Fourth Workshop on Figurative Language Processing (FigLang 2024). We propose a novel model for multilingual euphemism detection, combining contextual and behavior-related features. The system classifies texts that potentially contain euphemistic terms with an ensemble classifier based on outputs from behavior-related fine-tuned models. Our results show that, for this kind of task, our model outperforms baselines and state-of-the-art euphemism detection methods. As for the leader-board, our classification model achieved a macro averaged F1 score of [anonymized], reaching the [anonymized] place.
Search
Fix author
Co-authors
- Nikolay Arefyev 1
- Mikko Aulamo 1
- Marta Bañón 1
- Maja Buljan 1
- Laurie Burchell 1
- Pinzhen Chen 1
- Mariia Fedorova 1
- Lucas Georges Gabriel Charpentier 1
- Barry Haddow 1
- Jan Hajic 1
- Jindřich Helcl 1
- Andrey Kutuzov 1
- Veronika Laippala 1
- Zihao Li 1
- Bhavitvya Malik 1
- Vladislav Mikhailov 1
- Amanda Myntti 1
- Stephan Oepen 1
- Dayyán O’Brien 1
- Henna Paakki 1
- Lucie Polakova 1
- Gema Ramírez-Sánchez 1
- Janine Siewert 1
- Pavel Stepachev 1
- Jörg Tiedemann 1
- Teemu Vahtola 1
- Dusan Varis 1
- Jaume Zaragoza 1
- Ona de Gibert 1