Abstract
We present a simple document alignment method that incorporates sentence order information in both candidate generation and candidate re-scoring. Our method results in 61% relative reduction in error compared to the best previously published result on the WMT16 document alignment shared task. Our method improves downstream MT performance on web-scraped Sinhala–English documents from ParaCrawl, outperforming the document alignment method used in the most recent ParaCrawl release. It also outperforms a comparable corpora method which uses the same multilingual embeddings, demonstrating that exploiting sentence order is beneficial even if the end goal is sentence-level bitext.- Anthology ID:
- 2020.emnlp-main.483
- Volume:
- Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Editors:
- Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5997–6007
- Language:
- URL:
- https://aclanthology.org/2020.emnlp-main.483
- DOI:
- 10.18653/v1/2020.emnlp-main.483
- Cite (ACL):
- Brian Thompson and Philipp Koehn. 2020. Exploiting Sentence Order in Document Alignment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5997–6007, Online. Association for Computational Linguistics.
- Cite (Informal):
- Exploiting Sentence Order in Document Alignment (Thompson & Koehn, EMNLP 2020)
- PDF:
- https://preview.aclanthology.org/naacl-24-ws-corrections/2020.emnlp-main.483.pdf
- Code
- thompsonb/vecalign