Unsupervised Parallel Sentence Extraction from Comparable Corpora
Viktor Hangya, Fabienne Braune, Yuliya Kalasouskaya, Alexander Fraser
Abstract
Mining parallel sentences from comparable corpora is of great interest for many downstream tasks. In the BUCC 2017 shared task, systems performed well by training on gold standard parallel sentences. However, we often want to mine parallel sentences without bilingual supervision. We present a simple approach relying on bilingual word embeddings trained in an unsupervised fashion. We incorporate orthographic similarity in order to handle words with similar surface forms. In addition, we propose a dynamic threshold method to decide if a candidate sentence-pair is parallel which eliminates the need to fine tune a static value for different datasets. Since we do not employ any language specific engineering our approach is highly generic. We show that our approach is effective, on three language-pairs, without the use of any bilingual signal which is important because parallel sentence mining is most useful in low resource scenarios.- Anthology ID:
- 2018.iwslt-1.2
- Volume:
- Proceedings of the 15th International Conference on Spoken Language Translation
- Month:
- October 29-30
- Year:
- 2018
- Address:
- Brussels
- Editors:
- Marco Turchi, Jan Niehues, Marcello Frederico
- Venue:
- IWSLT
- SIG:
- SIGSLT
- Publisher:
- International Conference on Spoken Language Translation
- Note:
- Pages:
- 7–13
- Language:
- URL:
- https://aclanthology.org/2018.iwslt-1.2
- DOI:
- Cite (ACL):
- Viktor Hangya, Fabienne Braune, Yuliya Kalasouskaya, and Alexander Fraser. 2018. Unsupervised Parallel Sentence Extraction from Comparable Corpora. In Proceedings of the 15th International Conference on Spoken Language Translation, pages 7–13, Brussels. International Conference on Spoken Language Translation.
- Cite (Informal):
- Unsupervised Parallel Sentence Extraction from Comparable Corpora (Hangya et al., IWSLT 2018)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2018.iwslt-1.2.pdf
- Data
- BUCC