Mariia Fedorova
2026
DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling
Mariia Fedorova | Andrey Kutuzov | Khonzoda Umarova
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)
Mariia Fedorova | Andrey Kutuzov | Khonzoda Umarova
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)
In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets.DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field.
OpenLID-v3: Improving the Precision of Closely Related Language Identification – An Experience Report
Mariia Fedorova | Nikolay Arefyev | Maja Buljan | Jindřich Helcl | Stephan Oepen | Egil Rønningstad | Yves Scherrer
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Mariia Fedorova | Nikolay Arefyev | Maja Buljan | Jindřich Helcl | Stephan Oepen | Egil Rønningstad | Yves Scherrer
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During the development we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages.
2025
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)
Laurie Burchell | Ona de Gibert | Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Pinzhen Chen | Mariia Fedorova | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Mateusz Klimaszewski | Ville Komulainen | Andrey Kutuzov | Joona Kytöniemi | Veronika Laippala | Petter Mæhlum | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Nikita Moghe | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Proyag Pal | Jousia Piha | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Tereza Vojtěchová | Jaume Zaragoza-Bernabeu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Laurie Burchell | Ona de Gibert | Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Pinzhen Chen | Mariia Fedorova | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Mateusz Klimaszewski | Ville Komulainen | Andrey Kutuzov | Joona Kytöniemi | Veronika Laippala | Petter Mæhlum | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Nikita Moghe | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Proyag Pal | Jousia Piha | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Tereza Vojtěchová | Jaume Zaragoza-Bernabeu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
Explaining novel senses using definition generation with open language models
Mariia Fedorova | Andrey Kutuzov | Francesco Periti | Yves Scherrer
Findings of the Association for Computational Linguistics: EMNLP 2025
Mariia Fedorova | Andrey Kutuzov | Francesco Periti | Yves Scherrer
Findings of the Association for Computational Linguistics: EMNLP 2025
We apply definition generators based on open-weights large language models to the task of creating explanations of novel senses, taking target word usages as an input. To this end, we employ the datasets from the AXOLOTL’24 shared task on explainable semantic change modeling, which features Finnish, Russian and German languages. We fine-tune and provide publicly the open-source models performing higher than the best submissions of the aforementioned shared task, which employed closed proprietary LLMs. In addition, we find that encoder-decoder definition generators perform on par with their decoder-only counterparts.
HPLT’s Second Data Release
Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Laurie Burchell | Pinzhen Chen | Mariia Fedorova | Ona de Gibert | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Andrey Kutuzov | Veronika Laippala | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza-Bernabeu
Proceedings of Machine Translation Summit XX: Volume 2
Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Laurie Burchell | Pinzhen Chen | Mariia Fedorova | Ona de Gibert | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Andrey Kutuzov | Veronika Laippala | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza-Bernabeu
Proceedings of Machine Translation Summit XX: Volume 2
We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.
Multi-label Scandinavian Language Identification (SLIDE)
Mariia Fedorova | Jonas Sebulon Frydenberg | Victoria Handford | Victoria Ovedie Chruickshank Langø | Solveig Helene Willoch | Marthe Løken Midtgaard | Yves Scherrer | Petter Mæhlum | David Samuel
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)
Mariia Fedorova | Jonas Sebulon Frydenberg | Victoria Handford | Victoria Ovedie Chruickshank Langø | Solveig Helene Willoch | Marthe Løken Midtgaard | Yves Scherrer | Petter Mæhlum | David Samuel
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)
Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed–accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.
2024
Definition generation for lexical semantic change detection
Mariia Fedorova | Andrey Kutuzov | Yves Scherrer
Findings of the Association for Computational Linguistics: ACL 2024
Mariia Fedorova | Andrey Kutuzov | Yves Scherrer
Findings of the Association for Computational Linguistics: ACL 2024
We use contextualized word definitions generated by large language models as semantic representations in the task of diachronic lexical semantic change detection (LSCD). In short, generated definitions are used as ‘senses’, and the change score of a target word is retrieved by comparing their distributions in two time periods under comparison. On the material of five datasets and three languages, we show that generated definitions are indeed specific and general enough to convey a signal sufficient to rank sets of words by the degree of their semantic change over time. Our approach is on par with or outperforms prior non-supervised sense-based LSCD methods. At the same time, it preserves interpretability and allows to inspect the reasons behind a specific shift in terms of discrete definitions-as-senses. This is another step in the direction of explainable semantic change modeling.
AXOLOTL’24 Shared Task on Multilingual Explainable Semantic Change Modeling
Mariia Fedorova | Timothee Mickus | Niko Partanen | Janine Siewert | Elena Spaziani | Andrey Kutuzov
Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change
Mariia Fedorova | Timothee Mickus | Niko Partanen | Janine Siewert | Elena Spaziani | Andrey Kutuzov
Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change
Enriching Word Usage Graphs with Cluster Definitions
Andrey Kutuzov | Mariia Fedorova | Dominik Schlechtweg | Nikolay Arefyev
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Andrey Kutuzov | Mariia Fedorova | Dominik Schlechtweg | Nikolay Arefyev
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present a dataset of word usage graphs (WUGs), where the existing WUGs for multiple languages are enriched with cluster labels functioning as sense definitions. They are generated from scratch by fine-tuned encoder-decoder language models. The conducted human evaluation has shown that these definitions match the existing clusters in WUGs better than the definitions chosen from WordNet by two baseline systems. At the same time, the method is straightforward to use and easy to extend to new languages. The resulting enriched datasets can be extremely helpful for moving on to explainable semantic change modeling.
Search
Fix author
Co-authors
- Andrey Kutuzov 7
- Nikolay Arefyev 4
- Yves Scherrer 4
- Jindřich Helcl 3
- Stephan Oepen 3
- David Samuel 3
- Mikko Aulamo 2
- Marta Bañón 2
- Laurie Burchell 2
- Pinzhen Chen 2
- Liane Guillou 2
- Barry Haddow 2
- Jan Hajic 2
- Erik Henriksson 2
- Veronika Laippala 2
- Bhavitvya Malik 2
- Farrokh Mehryary 2
- Vladislav Mikhailov 2
- Amanda Myntti 2
- Petter Mæhlum 2
- Dayyán O’Brien 2
- Sampo Pyysalo 2
- Gema Ramírez-Sánchez 2
- Pavel Stepachev 2
- Jörg Tiedemann 2
- Dusan Varis 2
- Jaume Zaragoza-Bernabeu 2
- Ona de Gibert 2
- Maja Buljan 1
- Jonas Sebulon Frydenberg 1
- Victoria Handford 1
- Mateusz Klimaszewski 1
- Ville Komulainen 1
- Joona Kytöniemi 1
- Victoria Ovedie Chruickshank Langø 1
- Timothee Mickus 1
- Marthe Løken Midtgaard 1
- Nikita Moghe 1
- Proyag Pal 1
- Niko Partanen 1
- Francesco Periti 1
- Jousia Piha 1
- Egil Rønningstad 1
- Dominik Schlechtweg 1
- Janine Siewert 1
- Elena Spaziani 1
- Khonzoda Umarova 1
- Tereza Vojtěchová 1
- Solveig Helene Willoch 1