Abstract
We present ParIce, a new English-Icelandic parallel corpus. This is the first parallel corpus built for the purposes of language technology development and research for Icelandic, although some Icelandic texts can be found in various other multilingual parallel corpora. We map out which Icelandic texts are available for these purposes, collect aligned data and align other bilingual texts we acquired. We describe the alignment process and how we filter the data to weed out noise and bad alignments. In total we collected 43 million Icelandic words in 4.3 million aligned segment pairs, but after filtering, our corpus includes 38.8 million Icelandic words in 3.5 million segment pairs. We estimate that approximately 5% of the corpus data is noise or faulty alignments while more than 50% of the segments we deleted were faulty. We estimate that our filtering process reduced the number of faulty segments in the corpus by more than 60% while only reducing the number of good alignments by approximately 8%.- Anthology ID:
- W19-6115
- Volume:
- Proceedings of the 22nd Nordic Conference on Computational Linguistics
- Month:
- September–October
- Year:
- 2019
- Address:
- Turku, Finland
- Venue:
- NoDaLiDa
- SIG:
- Publisher:
- Linköping University Electronic Press
- Note:
- Pages:
- 140–145
- Language:
- URL:
- https://aclanthology.org/W19-6115
- DOI:
- Cite (ACL):
- Starkaður Barkarson and Steinþór Steingrímsson. 2019. Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages 140–145, Turku, Finland. Linköping University Electronic Press.
- Cite (Informal):
- Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus (Barkarson & Steingrímsson, NoDaLiDa 2019)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/W19-6115.pdf
- Data
- Tilde MODEL Corpus