Farrokh Mehryary


2025

pdf bib
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)
Laurie Burchell | Ona De Gibert Bonet | Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Pinzhen Chen | Mariia Fedorova | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Mateusz Klimaszewski | Ville Komulainen | Andrey Kutuzov | Joona Kytöniemi | Veronika Laippala | Petter Mæhlum | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Nikita Moghe | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Proyag Pal | Jousia Piha | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Tereza Vojtěchová | Jaume Zaragoza-Bernabeu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

pdf bib
HPLT’s Second Data Release
Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Laurie Burchell | Pinzhen Chen | Mariia Fedorova | Ona de Gibert | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Andrey Kutuzov | Veronika Laippala | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza-Bernabeu
Proceedings of Machine Translation Summit XX: Volume 2

We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.

2017

pdf bib
End-to-End System for Bacteria Habitat Extraction
Farrokh Mehryary | Kai Hakala | Suwisa Kaewphan | Jari Björne | Tapio Salakoski | Filip Ginter
Proceedings of the 16th BioNLP Workshop

We introduce an end-to-end system capable of named-entity detection, normalization and relation extraction for extracting information about bacteria and their habitats from biomedical literature. Our system is based on deep learning, CRF classifiers and vector space models. We train and evaluate the system on the BioNLP 2016 Shared Task Bacteria Biotope data. The official evaluation shows that the joint performance of our entity detection and relation extraction models outperforms the winning team of the Shared Task by 19pp on F1-score, establishing a new top score for the task. We also achieve state-of-the-art results in the normalization task. Our system is open source and freely available at https://github.com/TurkuNLP/BHE.

pdf bib
Detecting mentions of pain and acute confusion in Finnish clinical text
Hans Moen | Kai Hakala | Farrokh Mehryary | Laura-Maria Peltonen | Tapio Salakoski | Filip Ginter | Sanna Salanterä
Proceedings of the 16th BioNLP Workshop

We study and compare two different approaches to the task of automatic assignment of predefined classes to clinical free-text narratives. In the first approach this is treated as a traditional mention-level named-entity recognition task, while the second approach treats it as a sentence-level multi-label classification task. Performance comparison across these two approaches is conducted in the form of sentence-level evaluation and state-of-the-art methods for both approaches are evaluated. The experiments are done on two data sets consisting of Finnish clinical text, manually annotated with respect to the topics pain and acute confusion. Our results suggest that the mention-level named-entity recognition approach outperforms sentence-level classification overall, but the latter approach still manages to achieve the best prediction scores on several annotation classes.

2016

pdf bib
Deep Learning with Minimal Training Data: TurkuNLP Entry in the BioNLP Shared Task 2016
Farrokh Mehryary | Jari Björne | Sampo Pyysalo | Tapio Salakoski | Filip Ginter
Proceedings of the 4th BioNLP Shared Task Workshop