Parallel Sentence Mining by Constrained Decoding

Pinzhen Chen, Nikolay Bogoychev, Kenneth Heafield, Faheem Kirefu


Abstract
We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation. Our method relies on translating sentences in one corpus, but constraining the decoding by a prefix tree built on the other corpus. We argue that a neural machine translation system by itself can be a sentence similarity scorer and it efficiently approximates pairwise comparison with a modified beam search. When benchmarked on the BUCC shared task, our method achieves results comparable to other submissions.
Anthology ID:
2020.acl-main.152
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1672–1678
Language:
URL:
https://aclanthology.org/2020.acl-main.152
DOI:
10.18653/v1/2020.acl-main.152
Bibkey:
Cite (ACL):
Pinzhen Chen, Nikolay Bogoychev, Kenneth Heafield, and Faheem Kirefu. 2020. Parallel Sentence Mining by Constrained Decoding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1672–1678, Online. Association for Computational Linguistics.
Cite (Informal):
Parallel Sentence Mining by Constrained Decoding (Chen et al., ACL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2020.acl-main.152.pdf
Video:
 http://slideslive.com/38929223
Code
 marian-nmt/marian-dev
Data
BUCC