Effective Bitext Extraction From Comparable Corpora Using a Combination of Three Different Approaches
Steinþór Steingrímsson, Pintu Lohar, Hrafn Loftsson, Andy Way
Abstract
Parallel sentences extracted from comparable corpora can be useful to supplement parallel corpora when training machine translation (MT) systems. This is even more prominent in low-resource scenarios, where parallel corpora are scarce. In this paper, we present a system which uses three very different measures to identify and score parallel sentences from comparable corpora. We measure the accuracy of our methods in low-resource settings by comparing the results against manually curated test data for English–Icelandic, and by evaluating an MT system trained on the concatenation of the parallel data extracted by our approach and an existing data set. We show that the system is capable of extracting useful parallel sentences with high accuracy, and that the extracted pairs substantially increase translation quality of an MT system trained on the data, as measured by automatic evaluation metrics.- Anthology ID:
- 2021.bucc-1.3
- Volume:
- Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
- Month:
- September
- Year:
- 2021
- Address:
- Online (Virtual Mode)
- Venue:
- BUCC
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 8–17
- Language:
- URL:
- https://aclanthology.org/2021.bucc-1.3
- DOI:
- Cite (ACL):
- Steinþór Steingrímsson, Pintu Lohar, Hrafn Loftsson, and Andy Way. 2021. Effective Bitext Extraction From Comparable Corpora Using a Combination of Three Different Approaches. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021), pages 8–17, Online (Virtual Mode). INCOMA Ltd..
- Cite (Informal):
- Effective Bitext Extraction From Comparable Corpora Using a Combination of Three Different Approaches (Steingrímsson et al., BUCC 2021)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2021.bucc-1.3.pdf
- Data
- WikiMatrix