Parallel Sentence Mining by Constrained Decoding
Pinzhen Chen, Nikolay Bogoychev, Kenneth Heafield, Faheem Kirefu
Abstract
We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation. Our method relies on translating sentences in one corpus, but constraining the decoding by a prefix tree built on the other corpus. We argue that a neural machine translation system by itself can be a sentence similarity scorer and it efficiently approximates pairwise comparison with a modified beam search. When benchmarked on the BUCC shared task, our method achieves results comparable to other submissions.- Anthology ID:
- 2020.acl-main.152
- Volume:
- Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
- Month:
- July
- Year:
- 2020
- Address:
- Online
- Editors:
- Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1672–1678
- Language:
- URL:
- https://aclanthology.org/2020.acl-main.152
- DOI:
- 10.18653/v1/2020.acl-main.152
- Cite (ACL):
- Pinzhen Chen, Nikolay Bogoychev, Kenneth Heafield, and Faheem Kirefu. 2020. Parallel Sentence Mining by Constrained Decoding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1672–1678, Online. Association for Computational Linguistics.
- Cite (Informal):
- Parallel Sentence Mining by Constrained Decoding (Chen et al., ACL 2020)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2020.acl-main.152.pdf
- Code
- marian-nmt/marian-dev
- Data
- BUCC