Cristian García-Romero

2022

pdf abs
Building Domain-specific Corpora from the Web: the Case of European Digital Service Infrastructures
Rik van Noord | Cristian García-Romero | Miquel Esplà-Gomis | Leopoldo Pla Sempere | Antonio Toral
Proceedings of the BUCC Workshop within LREC 2022

An important goal of the MaCoCu project is to improve EU-specific NLP systems that concern their Digital Service Infrastructures (DSIs). In this paper we aim at boosting the creation of such domain-specific NLP systems. To do so, we explore the feasibility of building an automatic classifier that allows to identify which segments in a generic (potentially parallel) corpus are relevant for a particular DSI. We create an evaluation data set by crawling DSI-specific web domains and then compare different strategies to build our DSI classifier for text in three languages: English, Spanish and Dutch. We use pre-trained (multilingual) language models to perform the classification, with zero-shot classification for Spanish and Dutch. The results are promising, as we are able to classify DSIs with between 70 and 80% accuracy, even without in-language training data. A manual annotation of the data revealed that we can also find DSI-specific data on crawled texts from general web domains with reasonable accuracy. We publicly release all data, predictions and code, as to allow future investigations in whether exploiting this DSI-specific data actually leads to improved performance on particular applications, such as machine translation.

We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from carefully selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release successive versions of the free/open-source web crawling and curation software used.

Co-authors

Mikel L. Forcada 1

Taja Kuzman 1

Nikola Ljubešić 1

Gema Ramírez‐Sánchez 1

Peter Rupnik 1

Vit Suchomel 1

Tobias van der Werff 1

Jaume Zaragoza 1

Venues

bucc1
eamt1