Comparing two acquisition systems for automatically building an English—Croatian parallel corpus from multilingual websites

Miquel Esplà-Gomis; Filip Klubička; Nikola Ljubešić; Sergio Ortíz-Rojas; Vassilis Papavassiliou; Prokopis Prokopidis

Comparing two acquisition systems for automatically building an English—Croatian parallel corpus from multilingual websites

Miquel Esplà-Gomis, Filip Klubička, Nikola Ljubešić, Sergio Ortiz-Rojas, Vassilis Papavassiliou, Prokopis Prokopidis

Abstract

In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English―Croatian parallel corpus. Different settings were tried for both tools and 10,662 unique document pairs were obtained. A sample of about 10% of them was manually examined and the success rate was computed on the collection of pairs of documents detected by each setting. We compare the performance of the settings and the amount of different corpora detected by each setting. In addition, we describe the resource obtained, both by the settings and through the human evaluation, which has been released as a high-quality parallel corpus.

Anthology ID:: L14-1437
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 1252–1258
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/529_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Miquel Esplà-Gomis, Filip Klubička, Nikola Ljubešić, Sergio Ortiz-Rojas, Vassilis Papavassiliou, and Prokopis Prokopidis. 2014. Comparing two acquisition systems for automatically building an English—Croatian parallel corpus from multilingual websites. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1252–1258, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: Comparing two acquisition systems for automatically building an English—Croatian parallel corpus from multilingual websites (Esplà-Gomis et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/529_Paper.pdf

PDF Search