Abstract
Obtaining high-quality parallel corpora is of paramount importance for training NMT systems. However, as many language pairs lack adequate gold-standard training data, a popular approach has been to mine so-called “pseudo-parallel” sentences from paired documents in two languages. In this paper, we outline some drawbacks with current methods that rely on an embedding similarity threshold, and propose a heuristic method in its place. Our method involves translating both halves of a paired corpus before mining, and then performing a majority vote on sentence pairs mined in three ways: after translating documents in language x to language y, after translating language y to x, and using the original documents in languages x and y. We demonstrate success with this novel approach on the Tatoeba similarity search benchmark in 64 low-resource languages, and on NMT in Kazakh and Gujarati. We also uncover the effect of resource-related factors (i.e. how much monolingual/bilingual data is available for a given language) on the optimal choice of bitext mining method, demonstrating that there is currently no one-size-fits-all approach for this task. We make the code and data used in our experiments publicly available.- Anthology ID:
- 2021.bucc-1.7
- Volume:
- Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
- Month:
- September
- Year:
- 2021
- Address:
- Online (Virtual Mode)
- Venue:
- BUCC
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 46–59
- Language:
- URL:
- https://aclanthology.org/2021.bucc-1.7
- DOI:
- Cite (ACL):
- Alexander Jones and Derry Tanti Wijaya. 2021. Majority Voting with Bidirectional Pre-translation For Bitext Retrieval. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021), pages 46–59, Online (Virtual Mode). INCOMA Ltd..
- Cite (Informal):
- Majority Voting with Bidirectional Pre-translation For Bitext Retrieval (Jones & Wijaya, BUCC 2021)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2021.bucc-1.7.pdf
- Code
- AlexJonesNLP/alt-bitexts
- Data
- BUCC, Tatoeba