MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, Jaume Zaragoza
Abstract
We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from carefully selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release successive versions of the free/open-source web crawling and curation software used.- Anthology ID:
- 2022.eamt-1.41
- Volume:
- Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
- Month:
- June
- Year:
- 2022
- Address:
- Ghent, Belgium
- Editors:
- Helena Moniz, Lieve Macken, Andrew Rufener, Loïc Barrault, Marta R. Costa-jussà, Christophe Declercq, Maarit Koponen, Ellie Kemp, Spyridon Pilos, Mikel L. Forcada, Carolina Scarton, Joachim Van den Bogaert, Joke Daems, Arda Tezcan, Bram Vanroy, Margot Fonteyne
- Venue:
- EAMT
- SIG:
- Publisher:
- European Association for Machine Translation
- Note:
- Pages:
- 303–304
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2022.eamt-1.41/
- DOI:
- Cite (ACL):
- Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, and Jaume Zaragoza. 2022. MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 303–304, Ghent, Belgium. European Association for Machine Translation.
- Cite (Informal):
- MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages (Bañón et al., EAMT 2022)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2022.eamt-1.41.pdf