Two Step Automatic Post Editing of Patent Machine Translation based on Pre-trained Encoder Models and LLMs

Kosei Buma, Takehito Utsuro, Masaaki Nagata


Abstract
We study automatic post-editing for patent translation, where accuracy and traceability are critical, and propose a two-step pipeline that combines a multilingual encoder for token-level error detection with an LLM for targeted correction. As no word-level annotations exist for Japanese–English patents, we create supervised data by injecting synthetic errors into parallel patent sentences and fine-tune mBERT, XLM-RoBERTa, and mDeBERTa as detectors. In the second stage, GPT-4o is prompted to revise translations either freely or under a restricted policy that allows edits only on detector-marked spans. For error detection, evaluation on synthetic errors shows that encoder-based detectors outperform LLMs in both F1 and MCC. For error correction, tests on synthetic, repetition, and omission datasets demonstrate statistically significant BLEU gains over LLM methods for synthetic and repetition errors, while omission errors remain challenging. Overall, pairing compact encoders with an LLM enables more accurate and controllable post-editing for key patent error types, reducing unnecessary rewrites via restricted edits. Future work will focus on strengthening omission modeling to better detect and correct missing content.
Anthology ID:
2025.ijcnlp-srw.19
Volume:
The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Santosh T.y.s.s, Shuichiro Shimizu, Yifan Gong
Venue:
IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
218–231
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-srw.19/
DOI:
Bibkey:
Cite (ACL):
Kosei Buma, Takehito Utsuro, and Masaaki Nagata. 2025. Two Step Automatic Post Editing of Patent Machine Translation based on Pre-trained Encoder Models and LLMs. In The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 218–231, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):
Two Step Automatic Post Editing of Patent Machine Translation based on Pre-trained Encoder Models and LLMs (Buma et al., IJCNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-srw.19.pdf