2025
pdf
bib
abs
Structured Document Translation via Format Reinforcement Learning
Haiyue Song
|
Johannes Eschbach-Dymanus
|
Hour Kaing
|
Sumire Honda
|
Hideki Tanaka
|
Bianka Buschbeck
|
Masao Utiyama
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose Format Reinforcement Learning (FormatRL), which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we propose StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.
2023
pdf
bib
abs
Context-aware Neural Machine Translation for English-Japanese Business Scene Dialogues
Sumire Honda
|
Patrick Fernandes
|
Chrysoula Zerva
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
Despite the remarkable advancements in machine translation, the current sentence-level paradigm faces challenges when dealing with highly-contextual languages like Japanese. In this paper, we explore how context-awareness can improve the performance of the current Neural Machine Translation (NMT) models for English-Japanese business dialogues translation, and what kind of context provides meaningful information to improve translation. As business dialogue involves complex discourse phenomena but offers scarce training resources, we adapted a pretrained mBART model, finetuning on multi-sentence dialogue data, which allows us to experiment with different contexts. We investigate the impact of larger context sizes and propose novel context tokens encoding extra-sentential information, such as speaker turn and scene type. We make use of Conditional Cross-Mutual Information (CXMI) to explore how much of the context the model uses and generalise CXMI to study the impact of the extra sentential context. Overall, we find that models leverage both preceding sentences and extra-sentential context (with CXMI increasing with context size) and we provide a more focused analysis on honorifics translation. Regarding translation quality, increased source-side context paired with scene and speaker information improves the model performance compared to previous work and our context-agnostic baselines, measured in BLEU and COMET metrics.
pdf
bib
abs
Noam Chomsky at SemEval-2023 Task 4: Hierarchical Similarity-aware Model for Human Value Detection
Sumire Honda
|
Sebastian Wilharm
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
This paper presents a hierarchical similarity-aware approach for the SemEval-2023 task 4 human value detection behind arguments using SBERT. The approach takes similarity score as an additional source of information between the input arguments and the lower level of labels in a human value hierarchical dataset. Our similarity-aware model improved the similarity-agnostic baseline model, especially showing a significant increase in or the value categories with lowest scores by the baseline model.