Abstract
The performance of current sentence alignment tools varies according to the to-be-aligned texts. We have found existing tools unsuitable for hard-to-align parallel texts and describe an alternative alignment algorithm. The basic idea is to use machine translations of a text and BLEU as a similarity score to find reliable alignments which are used as anchor points. The gaps between these anchor points are then filled using BLEU-based and length-based heuristics. We show that this approach outperforms state-of-the-art algorithms in our alignment task, and that this improvement in alignment quality translates into better SMT performance. Furthermore, we show that even length-based alignment algorithms profit from having a machine translation as a point of comparison.- Anthology ID:
- 2010.amta-papers.14
- Volume:
- Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers
- Month:
- October 31-November 4
- Year:
- 2010
- Address:
- Denver, Colorado, USA
- Venue:
- AMTA
- SIG:
- Publisher:
- Association for Machine Translation in the Americas
- Note:
- Pages:
- Language:
- URL:
- https://aclanthology.org/2010.amta-papers.14
- DOI:
- Cite (ACL):
- Rico Sennrich and Martin Volk. 2010. MT-based Sentence Alignment for OCR-generated Parallel Texts. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers, Denver, Colorado, USA. Association for Machine Translation in the Americas.
- Cite (Informal):
- MT-based Sentence Alignment for OCR-generated Parallel Texts (Sennrich & Volk, AMTA 2010)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2010.amta-papers.14.pdf