Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation

Viktor Hangya, Alexander Fraser


Abstract
Mining parallel sentences from comparable corpora is important. Most previous work relies on supervised systems, which are trained on parallel data, thus their applicability is problematic in low-resource scenarios. Recent developments in building unsupervised bilingual word embeddings made it possible to mine parallel sentences based on cosine similarities of source and target language words. We show that relying only on this information is not enough, since sentences often have similar words but different meanings. We detect continuous parallel segments in sentence pair candidates and rely on them when mining parallel sentences. We show better mining accuracy on three language pairs in a standard shared task on artificial data. We also provide the first experiments showing that parallel sentences mined from real life sources improve unsupervised MT. Our code is available, we hope it will be used to support low-resource MT research.
Anthology ID:
P19-1118
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1224–1234
Language:
URL:
https://aclanthology.org/P19-1118
DOI:
10.18653/v1/P19-1118
Bibkey:
Cite (ACL):
Viktor Hangya and Alexander Fraser. 2019. Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1224–1234, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation (Hangya & Fraser, ACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/P19-1118.pdf
Code
 hangyav/UnsupPSE
Data
BUCCWMT 2014WMT 2016