Refer to the Reference: Reference-focused Synthetic Automatic Post-Editing Data Generation

Sourabh Deoghare; Diptesh Kanojia; Pushpak Bhattacharyya

Refer to the Reference: Reference-focused Synthetic Automatic Post-Editing Data Generation

Sourabh Deoghare, Diptesh Kanojia, Pushpak Bhattacharyya

Abstract

A prevalent approach to synthetic APE data generation uses source (src) sentences in a parallel corpus to obtain translations (mt) through an MT system and treats corresponding reference (ref) sentences as post-edits (pe). While effective, due to independence between ‘mt’ and ‘pe,’ these translations do not adequately reflect errors to be corrected by a human post-editor. Thus, we introduce a novel and simple yet effective reference-focused synthetic APE data generation technique that uses ‘ref’ instead of src’ sentences to obtain corrupted translations (mt_new). The experimental results across English-German, English-Russian, English-Marathi, English-Hindi, and English-Tamil language pairs demonstrate the superior performance of APE systems trained using the newly generated synthetic data compared to those trained using existing synthetic data. Further, APE models trained using a balanced mix of existing and newly generated synthetic data achieve improvements of 0.37, 0.19, 1.01, 2.42, and 2.60 TER points, respectively. We will release the generated synthetic APE data.

Anthology ID:: 2025.coling-main.344
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5123–5135
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.344/
DOI:
Bibkey:
Cite (ACL):: Sourabh Deoghare, Diptesh Kanojia, and Pushpak Bhattacharyya. 2025. Refer to the Reference: Reference-focused Synthetic Automatic Post-Editing Data Generation. In Proceedings of the 31st International Conference on Computational Linguistics, pages 5123–5135, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Refer to the Reference: Reference-focused Synthetic Automatic Post-Editing Data Generation (Deoghare et al., COLING 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.344.pdf

PDF Cite Search Fix data