Taja Kuzman


2023

pdf
Get to Know Your Parallel Data: Performing English Variety and Genre Classification over MaCoCu Corpora
Taja Kuzman | Peter Rupnik | Nikola Ljubešić
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Collecting texts from the web enables a rapid creation of monolingual and parallel corpora of unprecedented size. However, unlike manually-collected corpora, authors and end users do not know which texts make up the web collections. In this work, we analyse the content of seven European parallel web corpora, collected from national top-level domains, by analysing the English variety and genre distribution in them. We develop and provide a lexicon-based British-American variety classifier, which we use to identify the English variety. In addition, we apply a Transformer-based genre classifier to corpora to analyse genre distribution and the interplay between genres and English varieties. The results reveal significant differences among the seven corpora in terms of different genre distribution and different preference for English varieties.

pdf
BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian
Peter Rupnik | Taja Kuzman | Nikola Ljubešić
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Automatic discrimination between Bosnian, Croatian, Montenegrin and Serbian is a hard task due to the mutual intelligibility of these South-Slavic languages. In this paper, we introduce the BENCHić-lang benchmark for discriminating between these four languages. The benchmark consists of two datasets from different domains - a Twitter and a news dataset - selected with the aim of fostering cross-dataset evaluation of different modelling approaches. We experiment with the baseline SVM models, based on character n-grams, which perform nicely in-dataset, but do not generalize well in cross-dataset experiments. Thus, we introduce another approach, exploiting only web-crawled data and the weak supervision signal coming from the respective country/language top-level domains. The resulting simple Naive Bayes model, based on less than a thousand word features extracted from web data, outperforms the baseline models in the cross-dataset scenario and achieves good levels of generalization across datasets.

2022

pdf
MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Marta Bañón | Miquel Esplà-Gomis | Mikel L. Forcada | Cristian García-Romero | Taja Kuzman | Nikola Ljubešić | Rik van Noord | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Peter Rupnik | Vít Suchomel | Antonio Toral | Tobias van der Werff | Jaume Zaragoza
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from carefully selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release successive versions of the free/open-source web crawling and curation software used.

pdf
The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild
Taja Kuzman | Peter Rupnik | Nikola Ljubešić
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1,125 crawled Slovenian web documents that consist of 650,000 words. Each document was manually annotated for genre with a new annotation schema that builds upon existing schemata, having primarily clarity of labels and inter-annotator agreement in mind. The dataset consists of various challenges related to web-based data, such as machine translated content, encoding errors, multiple contents presented in one document etc., enabling evaluation of classifiers in realistic conditions. The initial machine learning experiments on the dataset show that (1) pre-Transformer models are drastically less able to model the phenomena, with macro F1 metrics ranging around 0.22, while Transformer-based models achieve scores of around 0.58, and (2) multilingual Transformer models work as well on the task as the monolingual models that were previously proven to be superior to multilingual models on standard NLP tasks.

2019

pdf bib
Neural Machine Translation of Literary Texts from English to Slovene
Taja Kuzman | Špela Vintar | Mihael Arčan
Proceedings of the Qualities of Literary Machine Translation