Does Sentence Segmentation Matter for Machine Translation?

Rachel Wicks; Matt Post

Does Sentence Segmentation Matter for Machine Translation?

Abstract

For the most part, NLP applications operate at the sentence level. Since sentences occur most naturally in documents, they must be extracted and segmented via the use of a segmenter, of which there are a handful of options. There has been some work evaluating the performance of segmenters on intrinsic metrics, that look at their ability to recover human-segmented sentence boundaries, but there has been no work looking at the effect of segmenters on downstream tasks. We ask the question, “does segmentation matter?” and attempt to answer it on the task of machine translation. We consider two settings: the application of segmenters to a black-box system whose training segmentation is mostly unknown, as well as the variation in performance when segmenters are applied to the training process, too. We find that the choice of segmenter largely does not matter, so long as its behavior is not one of extreme under- or over-segmentation. For such settings, we provide some qualitative analysis examining their harms, and point the way towards document-level processing.

Anthology ID:: 2022.wmt-1.78
Volume:: Proceedings of the Seventh Conference on Machine Translation (WMT)
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates (Hybrid)
Editors:: Philipp Koehn, Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Marco Turchi, Marcos Zampieri
Venue:: WMT
SIG:: SIGMT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 843–854
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2022.wmt-1.78/
DOI:
Bibkey:
Cite (ACL):: Rachel Wicks and Matt Post. 2022. Does Sentence Segmentation Matter for Machine Translation?. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 843–854, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):: Does Sentence Segmentation Matter for Machine Translation? (Wicks & Post, WMT 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2022.wmt-1.78.pdf

PDF Cite Search Fix data