Norizo Sakaguchi


2024

pdf
Identifying Source Language Expressions for Pre-editing in Machine Translation
Norizo Sakaguchi | Yugo Murawaki | Chenhui Chu | Sadao Kurohashi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Machine translation-mediated communication can benefit from pre-editing source language texts to ensure accurate transmission of intended meaning in the target language. The primary challenge lies in identifying source language expressions that pose difficulties in translation. In this paper, we hypothesize that such expressions tend to be distinctive features of texts originally written in the source language (native language) rather than translations generated from the target language into the source language (machine translation). To identify such expressions, we train a neural classifier to distinguish native language from machine translation, and subsequently isolate the expressions that contribute to the model’s prediction of native language. Our manual evaluation revealed that our method successfully identified characteristic expressions of the native language, despite the noise and the inherent nuances of the task. We also present case studies where we edit the identified expressions to improve translation quality.