Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus

Starkaður Barkarson, Steinþór Steingrímsson


Abstract
We present ParIce, a new English-Icelandic parallel corpus. This is the first parallel corpus built for the purposes of language technology development and research for Icelandic, although some Icelandic texts can be found in various other multilingual parallel corpora. We map out which Icelandic texts are available for these purposes, collect aligned data and align other bilingual texts we acquired. We describe the alignment process and how we filter the data to weed out noise and bad alignments. In total we collected 43 million Icelandic words in 4.3 million aligned segment pairs, but after filtering, our corpus includes 38.8 million Icelandic words in 3.5 million segment pairs. We estimate that approximately 5% of the corpus data is noise or faulty alignments while more than 50% of the segments we deleted were faulty. We estimate that our filtering process reduced the number of faulty segments in the corpus by more than 60% while only reducing the number of good alignments by approximately 8%.
Anthology ID:
W19-6115
Volume:
Proceedings of the 22nd Nordic Conference on Computational Linguistics
Month:
September–October
Year:
2019
Address:
Turku, Finland
Venue:
NoDaLiDa
SIG:
Publisher:
Linköping University Electronic Press
Note:
Pages:
140–145
Language:
URL:
https://aclanthology.org/W19-6115
DOI:
Bibkey:
Cite (ACL):
Starkaður Barkarson and Steinþór Steingrímsson. 2019. Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages 140–145, Turku, Finland. Linköping University Electronic Press.
Cite (Informal):
Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus (Barkarson & Steingrímsson, NoDaLiDa 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/W19-6115.pdf
Data
Tilde MODEL Corpus