Tagged Span Annotation for Detecting Translation Errors in Reasoning LLMs

Taemin Yeom, Yonghyun Ryu, Yoonjung Choi, Jinyeong Bak


Abstract
We present the AIP team’s submission to the WMT 2025 Unified MT Evaluation SharedTask, focusing on the span-level error detection subtask. Our system emphasizes response format design to better harness the capabilities of OpenAI’s o3, the state-of-the-art reasoning LLM. To this end, we introduce Tagged SpanAnnotation (TSA), an annotation scheme designed to more accurately extract span-level information from the LLM. On our refined version of WMT24 ESA dataset, our reference-free method achieves an F1 score of approximately 27 for character-level label prediction, outperforming the reference-based XCOMET-XXL at approximately 17.
Anthology ID:
2025.wmt-1.62
Volume:
Proceedings of the Tenth Conference on Machine Translation
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
878–886
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.62/
DOI:
Bibkey:
Cite (ACL):
Taemin Yeom, Yonghyun Ryu, Yoonjung Choi, and Jinyeong Bak. 2025. Tagged Span Annotation for Detecting Translation Errors in Reasoning LLMs. In Proceedings of the Tenth Conference on Machine Translation, pages 878–886, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Tagged Span Annotation for Detecting Translation Errors in Reasoning LLMs (Yeom et al., WMT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.62.pdf