Vit Suchomel

Also published as: Vít Suchomel

2026

FeedFetcher: A Resilient Web Feed Downloader for Corpus Construction
Ondřej Herman | Jan Kraus | Vit Suchomel
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Building large-scale, timestamped monitor corpora requires robust and efficient tools for continuous web data acquisition. We present FeedFetcher, an open-source, lightweight yet resilient downloader designed to collect linguistic data from RSS/Atom web feeds. The tool enables continuous corpus updates by harvesting newly published web content with minimal downtime and high data integrity. Implemented in Rust for performance, memory safety, and scalable concurrency, FeedFetcher supports thousands of simultaneous connections while maintaining server politeness. The software is available under the GPL-3.0 license on https://github.com/ondra/feed_fetcher. In our setup, the entire workflow integrates FeedFetcher with downstream text-processing pipelines for tokenization, lemmatization, corpus compilation and deployment. The system is currently used to update monitor corpora in 64 languages, producing approximately two billion tokens per month. These corpora are available in Sketch Engine. We also describe methods for discovering new web feeds, combining manual exploration with automated extraction from large-scale web crawls to expand linguistic coverage. We demonstrate the system’s applicability through a time-based analysis of word-frequency change, showing how long-term accumulation of timestamped data supports the study of lexical dynamics and language evolution.

pdf bib abs

The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora
Taja Kuzman Pungeršek | Peter Rupnik | Vit Suchomel | Nikola Ljubešić
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Crawling national top-level domains has proven to be highly effective for collecting texts in less-resourced languages. This approach has been recently used for South Slavic languages and resulted in the largest general corpora for this language group: the CLASSLA-web 1.0 corpora. Building on this success, we established a continuous crawling infrastructure for iterative national top-level domain crawling across South Slavic and related webs. We present the first outcome of this crawling infrastructure - the CLASSLA-web 2.0 corpus collection, with substantially larger web corpora containing 17.0 billion words in 38.1 million texts in seven languages: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. In addition to genre categories, the new version is also automatically annotated with topic labels. Comparing CLASSLA-web 2.0 with its predecessor reveals that only one-fifth of the texts overlap, showing that re-crawling after just two years yields largely new content. However, while the new web crawls bring growing gains, we also notice growing pains - a manual inspection of top domains reveals a visible degradation of web content, as machine-generated sites now contribute a significant portion of texts.

2024

pdf bib abs

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining
Nikola Ljubešić | Vít Suchomel | Peter Rupnik | Taja Kuzman | Rik van Noord
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are still very much needed, their primary usage being in enriching large collections of data with metadata necessary for downstream research. We investigate the best way to ensure the existence of such encoder models on the set of very closely related languages - Croatian, Serbian, Bosnian and Montenegrin, by setting up a diverse benchmark for these languages, and comparing the trained-from-scratch models with the new models constructed via additional pretraining of existing multilingual models. We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models even with a limited amount of computation. We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.

2023

pdf bib abs

We present the most relevant results of the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages in its second year. To date, parallel and monolingual corpora have been produced for seven low-resourced European languages by crawling large amounts of textual data from selected top-level domains of the Internet; both human and automatic evaluation show its usefulness. In addition, several large language models pretrained on MaCoCu data have been published, as well as the code used to collect and curate the data.

2022

pdf bib abs

We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from carefully selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release successive versions of the free/open-source web crawling and curation software used.

2020

pdf bib abs

Current Challenges in Web Corpus Building
Miloš Jakubíček | Vojtěch Kovář | Pavel Rychlý | Vit Suchomel
Proceedings of the 12th Web as Corpus Workshop

In this paper we discuss some of the current challenges in web corpus building that we faced in the recent years when expanding the corpora in Sketch Engine. The purpose of the paper is to provide an overview and raise discussion on possible solutions, rather than bringing ready solutions to the readers. For every issue we try to assess its severity and briefly discuss possible mitigation options.

2016

pdf bib abs

DSL Shared Task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation–Maximization and Chunk-based Language Model
Ondřej Herman | Vít Suchomel | Vít Baisa | Pavel Rychlý
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

In this paper we investigate two approaches to discrimination of similar languages: Expectation–maximization algorithm for estimating conditional probability P(word|language) and byte level language models similar to compression-based language modelling methods. The accuracy of these methods reached respectively 86.6% and 88.3% on set A of the DSL Shared task 2016 competition.

2014

pdf bib abs

We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task.

pdf bib

Finding Terms in Corpora for Many Languages with the Sketch Engine
Miloš Jakubíček | Adam Kilgarriff | Vojtěch Kovář | Pavel Rychlý | Vít Suchomel
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

Vit Suchomel

2026

2024

2023

2022

2020

2016

2014

Co-authors

Venues