Almudena Carrillo


2024

pdf
SmartBiC: Smart Harvesting of Bilingual Corpora from the Internet
Gema Ramírez-Sánchez | Sergio Ortiz Rojas | Alicia Núñez Alcover | Tudor Mateiu | Mikel Forcada | Pedro Orzas | Almudena Carrillo | Giuseppe Nolasco | Noelia Listón
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

SmartBiC, an 18-month innovation project funded by the Spanish Government, aims at improving the full process of collecting, filtering and selecting in-domain parallel content to be used for machine translation and language model tuning purposes in industrial settings. Based on state-of-the-art technology in the free/open-source parallel web corpora harvester Bitextor, SmartBic develops a web-based application around it including novel components such as a language- and domain-focused crawler and a domain-specific corpora selector. SmartBic also addresses specific industrial use cases for individual components of the Bitextor pipeline, such as parallel data cleaning. Relevant improvements to the current Bitextor pipeline will be publicly released.