2024
pdf
bib
abs
Constructing a Multimodal, Multilingual Translation and Interpreting Corpus: A Modular Pipeline and an Evaluation of ASR for Verbatim Transcription
Alice Fedotova
|
Adriano Ferraresi
|
Maja Miličević Petrović
|
Alberto Barrón-Cedeño
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
This paper presents a novel pipeline for constructing multimodal and multilingual parallel corpora, with a focus on evaluating state-of-the-art ASR tools for verbatim transcription. Our findings indicate that current technologies can streamline corpus construction, with fine-tuning showing promising results in terms of transcription quality compared to out-of-the-box Whisper models. The lowest overall WER achieved for English was 0.180, using a fine-tuned Whisper-small model. As for Italian, the fine-tuned Whisper-small model obtained a lower WER of 0.201 compared to the baseline Whisper-small’s WER of 0.219. While limitations remain, the updated pipeline is expected to drastically reduce the human efforts involved.
2023
pdf
bib
Hate Speech Detection in an Italian Incel Forum Using Bilingual Data for Pre-Training and Fine-Tuning
Paolo Gajo
|
Silvia Bernardini
|
Adriano Ferraresi
|
Alberto Barrón-Cedeño
Proceedings of the 9th Italian Conference on Computational Linguistics (CLiC-it 2023)
pdf
bib
abs
Return to the Source: Assessing Machine Translation Suitability
Francesco Fernicola
|
Silvia Bernardini
|
Federico Garcea
|
Adriano Ferraresi
|
Alberto Barrón-Cedeño
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
We approach the task of assessing the suitability of a source text for translation by transferring the knowledge from established MT evaluation metrics to a model able to predict MT quality a priori from the source text alone. To open the door to experiments in this regard, we depart from reference English-German parallel corpora to build a corpus of 14,253 source text-quality score tuples. The tuples include four state-of-the-art metrics: cushLEPOR, BERTScore, COMET, and TransQuest. With this new resource at hand, we fine-tune XLM-RoBERTa, both in a single-task and a multi-task setting, to predict these evaluation scores from the source text alone. Results for this methodology are promising, with the single-task model able to approximate well-established MT evaluation and quality estimation metrics - without looking at the actual machine translations - achieving low RMSE values in the [0.1-0.2] range and Pearson correlation scores up to 0.688.
2019
pdf
bib
MAGMATic: A Multi-domain Academic Gold Standard with Manual Annotation of Terminology for Machine Translation Evaluation
Randy Scansani
|
Luisa Bentivogli
|
Silvia Bernardini
|
Adriano Ferraresi
Proceedings of Machine Translation Summit XVII: Research Track
pdf
bib
Do translator trainees trust machine translation? An experiment on post-editing and revision
Randy Scansani
|
Silvia Bernardini
|
Adriano Ferraresi
|
Luisa Bentivogli
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks
2017
pdf
bib
abs
Enhancing Machine Translation of Academic Course Catalogues with Terminological Resources
Randy Scansani
|
Silvia Bernardini
|
Adriano Ferraresi
|
Federico Gaspari
|
Marcello Soffritti
Proceedings of the Workshop Human-Informed Translation and Interpreting Technology
This paper describes an approach to translating course unit descriptions from Italian and German into English, using a phrase-based machine translation (MT) system. The genre is very prominent among those requiring translation by universities in European countries in which English is a non-native language. For each language combination, an in-domain bilingual corpus including course unit and degree program descriptions is used to train an MT engine, whose output is then compared to a baseline engine trained on the Europarl corpus. In a subsequent experiment, a bilingual terminology database is added to the training sets in both engines and its impact on the output quality is evaluated based on BLEU and post-editing score. Results suggest that the use of domain-specific corpora boosts the engines quality for both language combinations, especially for German-English, whereas adding terminological resources does not seem to bring notable benefits.
2008
pdf
bib
abs
Introducing, evaluating ukWaC, a very large web-derived corpus of English
Adriano Ferraresi
|
Eros Zanchetta
|
Marco Baroni
|
Silvia Bernardini
Proceedings of the 4th Web as Corpus Workshop
In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens, is one of the largest freely available linguistic resources for English. The paper describes the tools, methodology used in the construction of the corpus, provides a qualitative evaluation of its contents, carried out through a vocabulary-based comparison with the BNC. We conclude by giving practical information about availability, format of the corpus.