Proceedings of Machine Translation Summit XX: Volume 1

Pierrette Bouillon, Johanna Gerlach, Sabrina Girletti, Lise Volkart, Raphael Rubino, Rico Sennrich, Ana C. Farinha, Marco Gaido, Joke Daems, Dorothy Kenny, Helena Moniz, Sara Szoc (Editors)


Anthology ID:
2025.mtsummit-1
Month:
June
Year:
2025
Address:
Geneva, Switzerland
Venue:
MTSummit
SIG:
Publisher:
European Association for Machine Translation
URL:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.mtsummit-1/
DOI:
ISBN:
978-2-9701897-0-1
Bib Export formats:
BibTeX
PDF:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.mtsummit-1.pdf

pdf bib
Proceedings of Machine Translation Summit XX: Volume 1
Pierrette Bouillon | Johanna Gerlach | Sabrina Girletti | Lise Volkart | Raphael Rubino | Rico Sennrich | Ana C. Farinha | Marco Gaido | Joke Daems | Dorothy Kenny | Helena Moniz | Sara Szoc

pdf bib
Robust, interpretable and efficient MT evaluation with fine-tuned metrics
Ricardo Rei

None

pdf bib
Direct Speech Translation in Constrained Contexts: the Simultaneous and Subtitling Scenarios
Sara Papi

None

pdf bib
Investigating Length Issues in Document-level Machine Translation
Ziqian Peng | Rachel Bawden | François Yvon

Transformer architectures are increasingly effective at processing and generating very long chunks of texts, opening new perspectives for document-level machine translation (MT). In this work, we challenge the ability of MT systems to handle texts comprising up to several thousands of tokens. We design and implement a new approach designed to precisely measure the effect of length increments on MT outputs. Our experiments with two representative architectures unambiguously show that (a) translation performance decreases with the length of the input text; (b) the position of sentences within the document matters and translation quality is higher for sentences occurring earlier in a document. We further show that manipulating the distribution of document lengths and of positional embeddings only marginally mitigates such problems. Our results suggest that even though document-level MT is computationally feasible, it does not yet match the performance of sentence-based MT.

pdf bib
Investigating the translation capabilities of Large Language Models trained on parallel data only
Javier García Gilabert | Carlos Escolano | Aleix Sant | Francesca De Luca Fornaciari | Audrey Mash | Xixian Liao | Maite Melero

In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce Plume (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the role of vocabulary size, the impact of the different elements of the prompt, and their cross-lingual representation space. We find that larger vocabulary sizes improve zero-shot performance and that different layers specialize in distinct aspects of the prompt, such as language-specific tags. We further show that as the vocabulary size grows, a larger number of attention heads can be pruned with minimal loss in translation quality, achieving a reduction of over 64.7% in attention heads.

pdf bib
Improve Fluency Of Neural Machine Translation Using Large Language Models
Jianfei He | Wenbo Pan | Jijia Yang | Sen Peng | Xiaohua Jia

Large language models (LLMs) demonstrate significant capabilities in many natural language processing. However, their performance in machine translation is still behind the models that are specially trained for machine translation with an encoder-decoder architecture. This paper investigates how to improve neural machine translation (NMT) with LLMs. Our proposal is based on an empirical insight that NMT gets worse fluency than human translation. We propose to use LLMs to enhance the fluency of NMT’s generation by integrating a language model at the target side. we use contrastive learning to constrain fluency so that it does not exceed the LLMs. Our experiments on three language pairs show that this method can improve the performance of NMT. Our empirical analysis further demonstrates that this method improves the fluency at the target side. Our experiments also show that some straightforward post-processing methods using LLMs, such as re-ranking and refinement, are not effective.

pdf bib
Optimizing the Training Schedule of Multilingual NMT using Reinforcement Learning
Alexis Allemann | Àlex R. Atrio | Andrei Popescu-Belis

Multilingual NMT is a viable solution for translating low-resource languages (LRLs) when data from high-resource languages (HRLs) from the same language family is available. However, the training schedule, i.e. the order of presentation of languages, has an impact on the quality of such systems. Here, in a many-to-one translation setting, we propose to apply two algorithms that use reinforcement learning to optimize the training schedule of NMT: (1) Teacher-Student Curriculum Learning and (2) Deep Q Network. The former uses an exponentially smoothed estimate of the returns of each action based on the loss on monolingual or multilingual development subsets, while the latter estimates rewards using an additional neural network trained from the history of actions selected in different states of the system, together with the rewards received. On a 8-to-1 translation dataset with LRLs and HRLs, our second method improves BLEU and COMET scores with respect to both random selection of monolingual batches and shuffled multilingual batches, by adjusting the number of presentations of LRL vs. HRL batches.

pdf bib
Languages Transferred Within the Encoder: On Representation Transfer in Zero-Shot Multilingual Translation
Zhi Qu | Chenchen Ding | Taro Watanabe

Understanding representation transfer in multilingual neural machine translation (MNMT) can reveal the reason for the zero-shot translation deficiency. In this work, we systematically analyze the representational issue of MNMT models. We first introduce the identity pair, translating a sentence to itself, to address the lack of the base measure in multilingual investigations, as the identity pair can reflect the representation of a language within the model. Then, we demonstrate that the encoder transfers the source language to the representational subspace of the target language instead of the language-agnostic state. Thus, the zero-shot translation deficiency arises because the representation of a translation is entangled with other languages and not transferred to the target language effectively. Based on our findings, we propose two methods: 1) low-rank language-specific embedding at the encoder, and 2) language-specific contrastive learning of the representation at the decoder. The experimental results on Europarl-15, TED-19, and OPUS-100 datasets show that our methods substantially enhance the performance of zero-shot translations without sacrifices in supervised directions by improving language transfer capacity, thereby providing practical evidence to support our conclusions. Codes are available at https://github.com/zhiqu22/ZeroTrans.

pdf bib
Decoding Machine Translationese in English-Chinese News: LLMs vs. NMTs
Delu Kong | Lieve Macken

This study explores Machine Translationese (MTese) — the linguistic peculiarities of machine translation outputs — focusing on the under-researched English-to-Chinese language pair in news texts. We construct a large dataset consisting of 4 sub-corpora and employ a comprehensive five-layer feature set. Then, a chi-square ranking algorithm is applied for feature selection in both classification and clustering tasks. Our findings confirm the presence of MTese in both Neural Machine Translation systems (NMTs) and Large Language Models (LLMs). Original Chinese texts are nearly perfectly distinguishable from both LLM and NMT outputs. Notable linguistic patterns in MT outputs are shorter sentence lengths and increased use of adversative conjunctions. Comparing LLMs and NMTs, we achieve approximately 70% classification accuracy, with LLMs exhibiting greater lexical diversity and NMTs using more brackets. Additionally, translation-specific LLMs show lower lexical diversity but higher usage of causal conjunctions compared to generic LLMs. Lastly, we find no significant differences between LLMs developed by Chinese firms and their foreign counterparts.

pdf bib
OJ4OCRMT: A Large Multilingual Dataset for OCR-MT Evaluation
Paul McNamee | Kevin Duh | Cameron Carpenter | Ron Colaianni | Nolan King | Kenton Murray

We introduce OJ4OCRMT, an Optical Character Recognition (OCR) dataset for Machine Translation (MT). The dataset supports research on automatic extraction, recognition, and translation of text from document images. The Official Journal of the European Union (OJEU), is the official gazette for the EU. Tens of thousands of pages of legislative acts and regulatory notices are published annually, and parallel translations are available in each of the official languages. Due to its large size, high degree of multilinguality, and carefully produced human translations, the OJEU is a singular resource for language processing research. We have assembled a large collection of parallel pages from the OJEU and have created a dataset to support translation of document images. In this work we introduce the dataset, describe the design decisions which we undertook, and report baseline performance figures for the translation task. It is our hope that this dataset will significantly add to the comparatively few resources presently available for evaluating OCR-MT systems.

pdf bib
Context-Aware or Context-Insensitive? Assessing LLMs’ Performance in Document-Level Translation
Wafaa Mohammed | Vlad Niculae

Large language models (LLMs) are increasingly strong contenders in machine translation. In this work, we focus on document-level translation, where some words cannot be translated without context from outside the sentence. Specifically, we investigate the ability of prominent LLMs to utilize the document context during translation through a perturbation analysis (analyzing models’ robustness to perturbed and randomized document context) and an attribution analysis (examining the contribution of relevant context to the translation). We conduct an extensive evaluation across nine LLMs from diverse model families and training paradigms, including translation-specialized LLMs, alongside two encoder-decoder transformer baselines. We find that LLMs’ improved document-translation performance compared to encoder-decoder models is not reflected in pronoun translation performance. Our analysis highlight the need for context-aware finetuning of LLMs with a focus on relevant parts of the context to improve their reliability for document-level translation.

pdf bib
Context-Aware Monolingual Evaluation of Machine Translation
Silvio Picinini | Sheila Castilho

This paper explores the potential of context-aware monolingual evaluation for assessing machine translation (MT) when no source is given for reference. To this end, we compare monolingual with bilingual evaluations (with source text), under two scenarios: the evaluation of a single MT system, and the comparative evaluation of pairwise MT systems. Four professional translators performed both monolingual and bilingual evaluations by assigning ratings and annotating errors, and providing feedback on their experience. Our findings suggest that context-aware monolingual evaluation achieves comparable outcomes to bilingual evaluations, and highlight the feasibility and potential of monolingual evaluation as an efficient approach to assessing MT.

pdf bib
Culture-aware machine translation: the case study of low-resource language pair Catalan-Chinese
Xixian Liao | Carlos Escolano | Audrey Mash | Francesca De Luca Fornaciari | Javier García Gilabert | Miguel Claramunt Argote | Ella Bohman | Maite Melero

High-quality machine translation requires datasets that not only ensure linguistic accuracy but also capture regional and cultural nuances. While many existing benchmarks, such as FLORES-200, rely on English as a pivot language, this approach can overlook the specificity of direct language pairs, particularly for underrepresented combinations like Catalan-Chinese. In this study, we demonstrate that even with a relatively small dataset of approximately 1,000 sentences, we can significantly improve MT localization. To this end, we introduce a dataset specifically designed to enhance Catalan-to-Chinese translation by prioritizing regionally and culturally specific topics. Unlike pivot-based datasets, our data source ensures a more faithful representation of Catalan linguistic and cultural elements, leading to more accurate translations of local terms and expressions. Using this dataset, we demonstrate better performance over the English-pivot FLORES-200 dev set and achieve competitive results on the FLORES-200 devtest set when evaluated with neural-based metrics. We release this dataset as both a human-preference resource and a benchmark for Catalan-Chinese translation. Additionally, we include Spanish translations for each sentence, facilitating extensions to Spanish-Chinese translation tasks.

pdf bib
Instruction-tuned Large Language Models for Machine Translation in the Medical Domain
Miguel Rios

Large Language Models (LLMs) have shown promising results on machine translation for high resource language pairs and domains. However, in specialised domains (e.g. medical) LLMs have shown lower performance compared to standard neural machine translation models. The consistency in the machine translation of terminology is crucial for users, researchers, and translators in specialised domains. In this study, we compare the performance between baseline LLMs and instruction-tuned LLMs in the medical domain. In addition, we introduce terminology from specialised medical dictionaries into the instruction formatted datasets for fine-tuning LLMs. The instruction-tuned LLMs significantly outperform the baseline models with automatic metrics. Moreover, the instruction-tuned LLMs produce fewer errors compared to the baseline based on automatic error annotation.

pdf bib
Lingonberry Giraffe: Lexically-Sound Beam Search for Explainable Translation of Compound Words
Théo Salmenkivi-Friberg | Iikka Hauhio

We present a hybrid rule-based and neural method for translating Finnish compound words into English. We use a lightweight set of rules to split a Finnish word into its constituent parts and determine the possible translations of those words using a dictionary. We then use an NMT model to rank these alternatives to determine the final output. Since the number of translations that takes into account different spellings, inflections, and word separators can be very large, we use beam search for the ranking when the number of translations is over a threshold. We find that our method is an improvement over using the same NMT model for end-to-end translation in both automatic and human evaluation. We conclude that our method retains the good qualities of rule-based translation such as explainability and controllability while keeping the rules lightweight.

pdf bib
Testing LLMs’ Capabilities in Annotating Translations Based on an Error Typology Designed for LSP Translation: First Experiments with ChatGPT
Joachim Minder | Guillaume Wisniewski | Natalie Kübler

This study investigates the capabilities of large language models (LLMs), specifically ChatGPT, in annotating MT outputs based on an error typology. In contrast to previous work focusing mainly on general language, we explore ChatGPT’s ability to identify and categorise errors in specialised translations. By testing two different prompts and based on a customised error typology, we compare ChatGPT annotations with human expert evaluations of translations produced by DeepL and ChatGPT itself. The results show that, for translations generated by DeepL, recall and precision are quite high. However, the degree of accuracy in error categorisation depends on the prompt’s specific features and its level of detail, ChatGPT performing very well with a detailed prompt. When evaluating its own translations, ChatGPT achieves significantly poorer results, revealing limitations with self-assessment. These results highlight both the potential and the limitations of LLMs for translation evaluation, particularly in specialised domains. Our experiments pave the way for future research on open-source LLMs, which could produce annotations of comparable or even higher quality. In the future, we also aim to test the practical effectiveness of this automated evaluation in the context of translation training, particularly by optimising the process of human evaluation by teachers and by exploring the impact of annotations by LLMs on students’ post-editing and translation learning.

pdf bib
Name Consistency in LLM-based Machine Translation of Historical Texts
Dominic P. Fischer | Martin Volk

Large Language Models (LLMs) excel at translating 16th-century letters from Latin and Early New High German to modern English and German. While they perform well at translating well-known historical city names (e.g., Lutetia –> Paris), their ability to handle person names (e.g., Theodor Bibliander) or lesser-known toponyms (e.g., Augusta Vindelicorum –> Augsburg) remains unclear. This study investigates LLM-based translations of person and place names across various frequency bands in a corpus of 16th-century letters. Our results show that LLMs struggle with person names, achieving accuracies around 60%, but perform better with place names, reaching accuracies around 90%. We further demonstrate that including a translation suggestion for the proper noun in the prompt substantially boosts accuracy, yielding highly reliable results.

pdf bib
Non-autoregressive Modeling for Sign-gloss to Texts Translation
Fan Zhou | Tim Van de Cruys

Automatic sign language translation has seen significant advancements, driven by progress in computer vision and natural language processing. While end to end sign-to-text translation systems are available, many systems still rely on a gloss-based representation–an intermediate symbolic representation that functions as a bridge between sign language and its written counterpart. This paper focuses on the gloss-to-text (gloss2text) task, a key step in the sign-to-text translation pipeline, which has traditionally been addressed using autoregressive (AR) modeling approaches. In this study, we propose the use of non-autoregressive (NAR) modeling techniques, including non-autoregressive Transformer (NAT) and diffusion models, tailored to the unique characteristics of gloss2text. Specifically, we introduce PointerLevT, a novel NAT-based model designed to enhance performance in this task. Our experiments demonstrate that NAR models achieve higher accuracy than pre-trained AR models with less data, while also matching the performance of fine-tuned AR models such as mBART. Furthermore, we evaluate inference speed and find that NAR models benefit from parallel generation, resulting in faster inference. However, they require more time to achieve an optimal balance between accuracy and speed, particularly in the multistep denoising process of diffusion models.

pdf bib
Exploring the Feasibility of Multilingual Grammatical Error Correction with a Single LLM up to 9B parameters: A Comparative Study of 17 Models
Dawid Wiśniewski | Antoni Solarski | Artur Nowakowski

Recent language models can successfully solve various language-related tasks, and many understand inputs stated in different languages. In this paper, we explore the performance of 17 popular models used to correct grammatical issues in texts stated in English, German, Italian, and Swedish when using a single model to correct texts in all those languages. We analyze the outputs generated by these models, focusing on decreasing the number of grammatical errors while keeping the changes small. The conclusions drawn help us understand what problems occur among those models and which models can be recommended for multilingual grammatical error correction tasks. We list six models that improve grammatical correctness in all four languages and show that Gemma 9B is currently the best performing one for the languages considered.

pdf bib
Do Not Change Me: On Transferring Entities Without Modification in Neural Machine Translation - a Multilingual Perspective
Dawid Wiśniewski | Mikołaj Pokrywka | Zofia Rostek

Current machine translation models provide us with high-quality outputs in most scenarios. However, they still face some specific problems, such as detecting which entities should not be changed during translation. In this paper, we explore the abilities of popular NMT models, including models from the OPUS project, Google Translate, MADLAD, and EuroLLM, to preserve entities such as URL addresses, IBAN numbers, or emails when producing translations between four languages: English, German, Polish, and Ukrainian. We investigate the quality of popular NMT models in terms of accuracy, discuss errors made by the models, and examine the reasons for errors. Our analysis highlights specific categories, such as emojis, that pose significant challenges for many models considered. In addition to the analysis, we propose a new multilingual synthetic dataset of 36,000 sentences that can help assess the quality of entity transfer across nine categories and four aforementioned languages.

pdf bib
Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn’t Help with MT Evaluation
Petra Barančíková | Ondřej Bojar

In this paper, we compare Czech-specific and multilingual sentence embedding models through intrinsic and extrinsic evaluation paradigms. For intrinsic evaluation, we employ Costra, a complex sentence transformation dataset, and several Semantic Textual Similarity (STS) benchmarks to assess the ability of the embeddings to capture linguistic phenomena such as semantic similarity, temporal aspects, and stylistic variations. In the extrinsic evaluation, we fine-tune each embedding model using COMET-based metrics for machine translation evaluation. Our experiments reveal an interesting disconnect: models that excel in intrinsic semantic similarity tests do not consistently yield superior performance on downstream translation evaluation tasks. Conversely, models with seemingly over-smoothed embedding spaces can, through fine-tuning, achieve excellent results. These findings highlight the complex relationship between semantic property probes and downstream task, emphasizing the need for more research into “operationalizable semantics” in sentence embeddings, or more in-depth downstream tasks datasets (here translation evaluation).

pdf bib
Metaphors in Literary Machine Translation: Close but no cigar?
Alina Karakanta | Mayra Nas | Aletta G. Dorst

The translation of metaphorical language presents a challenge in Natural Language Processing as a result of its complexity and variability in terms of linguistic forms, communicative functions, and cultural embeddedness. This paper investigates the performance of different state-of-the-art Machine Translation (MT) systems and Large Language Models (LLMs) in metaphor translation in literary texts (English->Dutch), examining how metaphorical language is handled by the systems and the types of errors identified by human evaluators. While commercial MT systems perform better in terms of translation quality based on automatic metrics, the human evaluation demonstrates that open-source, literary-adapted NMT systems translate metaphors equally accurately. Still, the accuracy of metaphor translation ranges between 64-80%, with lexical and meaning errors being the most prominent. Our findings indicate that metaphors remain a challenge for MT systems and adaptation to the literary domain is crucial for improving metaphor translation in literary texts.

pdf bib
Synthetic Fluency: Hallucinations, Confabulations, and the Creation of IrishWords in LLM-Generated Translations
Sheila Castilho | Zoe Fitzsimmons | Claire Holton | Aoife Mc Donagh

This study examines hallucinations in Large Language Model (LLM) translations into Irish, specifically focusing on instances where the models generate novel, non-existent words. We classify these hallucinations within verb and noun categories, identifying six distinct patterns among the latter. Additionally, we analyse whether these hallucinations adhere to Irish morphological rules and what linguistic tendencies they exhibit. Our findings show that while both GPT-4.o and GPT-4.o Mini produce similar types of hallucinations, the Mini model generates them at a significantly higher frequency. Beyond classification, the discussion raises speculative questions about the implications of these hallucinations for the Irish language. Rather than seeking definitive answers, we offer food for thought regarding the increasing use of LLMs and their potential role in shaping Irish vocabulary and linguistic evolution. We aim to prompt discussion on how such technologies might influence language over time, particularly in the context of low-resource, morphologically rich languages.

pdf bib
Patent Claim Translation via Continual Pre-training of Large Language Models with Parallel Data
Haruto Azami | Minato Kondo | Takehito Utsuro | Masaaki Nagata

Recent advancements in large language models (LLMs) have enabled their application across various domains. However, in the field of patent translation, Transformer encoder-decoder based models remain the standard approach, and the potential of LLMs for translation tasks has not been thoroughly explored. In this study, we conducted patent claim translation using an LLM fine-tuned with parallel data through continual pre-training and supervised fine-tuning, following the methodology proposed by Guo et al. (2024) and Kondo et al. (2024). Comparative evaluation against the Transformer encoder-decoder based translations revealed that the LLM achieved high scores for both BLEU and COMET. This demonstrated improvements in addressing issues such as omissions and repetitions. Nonetheless, hallucination errors, which were not observed in the traditional models, occurred in some cases and negatively affected the translation quality. This study highlights the promise of LLMs for patent translation while identifying the challenges that warrant further investigation.

pdf bib
The Devil is in the Details: Assessing the Effects of Machine-Translation on LLM Performance in Domain-Specific Texts
Javier Osorio | Afraa Alshammari | Naif Alatrush | Dagmar Heintze | Amber Converse | Sultan Alsarra | Latifur Khan | Patrick T. Brandt | Vito D’Orazio

Conflict scholars increasingly use computational tools to track violence and cooperation at a global scale. To study foreign locations, researchers often use machine translation (MT) tools, but rarely evaluate the quality of the MT output or its effects on Large Language Model (LLM) performance. Using a domain-specific multi-lingual parallel corpus, this study evaluates the quality of several MT tools for text in English, Arabic, and Spanish. Using ConfliBERT, a domain-specific LLM, the study evaluates the effect of MT texts on model performance, and finds that MT texts tend to yield better results than native texts. The MT quality assessment reveals considerable translation-induced distortions, reductions in vocabulary size and text specialization, and changes in syntactical structure. Regression analysis at the sentence-level reveals that such distortions, particularly reductions in general and domain vocabulary rarity, artificially boost LLM performance by simplifying the MT output. This finding cautions researchers and practitioners about uncritically relying on MT tools without considering MT-induced data loss.

pdf bib
Improving Japanese-English Patent Claim Translation with Clause Segmentation Models based on Word Alignment
Masato Nishimura | Kosei Buma | Takehito Utsuro | Masaaki Nagata

In patent documents, patent claims represent a particularly important section as they define the scope of the claims. However, due to the length and unique formatting of these sentences, neural machine translation (NMT) systems are prone to translation errors, such as omissions and repetitions. To address these challenges, this study proposes a translation method that first segments the source sentences into multiple shorter clauses using a clause segmentation model tailored to facilitate translation. These segmented clauses are then translated using a clause translation model specialized for clause-level translation. Finally, the translated clauses are rearranged and edited into the final translation using a reordering and editing model. In addition, this study proposes a method for constructing clause-level parallel corpora required for training the clause segmentation and clause translation models. This method leverages word alignment tools to create clause-level data from sentence-level parallel corpora. Experimental results demonstrate that the proposed method achieves statistically significant improvements in BLEU scores compared to conventional NMT models. Furthermore, for sentences where conventional NMT models exhibit omissions and repetitions, the proposed method effectively suppresses these errors, enabling more accurate translations.

pdf bib
Progressive Perturbation with KTO for Enhanced Machine Translation of Indian Languages
Yash Bhaskar | Ketaki Shetye | Vandan Mujadia | Dipti Misra Sharma | Parameswari Krishnamurthy

This study addresses the critical challenge of data scarcity in machine translation for Indian languages, particularly given their morphological complexity and limited parallel data. We investigate an effective strategy to maximize the utility of existing data by generating negative samples from positive training instances using a progressive perturbation approach. This is used for aligning the model with preferential data using Kahneman-Tversky Optimization (KTO). Comparing it against traditional Supervised Fine-Tuning (SFT), we demonstrate how generating negative samples and leveraging KTO enhances data efficiency. By creating rejected samples through progressively perturbed translations from the available dataset, we fine-tune the Llama 3.1 Instruct 8B model using QLoRA across 16 language directions, including English, Hindi, Bangla, Tamil, Telugu, and Santali. Our results show that KTO-based preference alignment with progressive perturbation consistently outperforms SFT, achieving significant gains in translation quality with an average BLEU increase of 1.84 to 2.47 and CHRF increase of 2.85 to 4.01 compared to SFT for selected languages, while using the same positive training samples and under similar computational constraints. This highlights the potential of our negative sample generation strategy within KTO, especially in low resource scenarios.

pdf bib
Leveraging Visual Scene Graph to Enhance Translation Quality in Multimodal Machine Translation
Ali Hatami | Mihael Arcan | Paul Buitelaar

Despite significant advancements in Multimodal Machine Translation, understanding and effectively utilising visual scenes within multimodal models remains a complex challenge. Extracting comprehensive and relevant visual features requires extensive and detailed input data to ensure the model accurately captures objects, their attributes, and relationships within a scene. In this paper, we explore using visual scene graphs extracted from images to enhance the performance of translation models. We investigate this approach for integrating Visual Scene Graph information into translation models, focusing on representing this information in a semantic structure rather than relying on raw image data. The performance of our approach was evaluated on the Multi30K dataset for English into German, French, and Czech translations using BLEU, chrF2, TER and COMET metrics. Our results demonstrate that utilising visual scene graph information improves translation performance. Using information on semantic structure can improve the multimodal baseline model, leading to better contextual understanding and translation accuracy.

pdf bib
Are AI agents the new machine translation frontier? Challenges and opportunities of single- and multi-agent systems for multilingual digital communication
Vicent Briva-Iglesias

The rapid evolution of artificial intelligence (AI) has introduced AI agents as a disruptive paradigm across various industries, yet their application in machine translation (MT) remains underexplored. This paper describes and analyses the potential of single- and multi-agent systems for MT, reflecting on how they could enhance multilingual digital communication. While single-agent systems are well-suited for simpler translation tasks, multi-agent systems, which involve multiple specialized AI agents collaborating in a structured manner, may offer a promising solution for complex scenarios requiring high accuracy, domain-specific knowledge, and contextual awareness. To demonstrate the feasibility of multi-agent workflows in MT, we are conducting a pilot study in legal MT. The study employs a multi-agent system involving four specialized AI agents for (i) translation, (ii) adequacy review, (iii) fluency review, and (iv) final editing. Our findings suggest that multi-agent systems may have the potential to significantly improve domain-adaptability and contextual awareness, with comparable translation quality to traditional MT or single-agent systems. This paper also sets the stage for future research into multi-agent applications in MT, integration into professional translation workflows, and shares a demo of the system analyzed in the paper.

pdf bib
bytF: How Good Are Byte Level N-Gram F-Scores for Automatic Machine Translation Evaluation?
Raj Dabre | Kaing Hour | Haiyue Song

Recently, chrF and chrF++ have become the preferred metric over BLEU for automatic n-gram evaluation of machine translation. Since they focus on character-level n-grams, it appears to have better correlations with human judgments for translating into morphologically rich languages compared to word-level metrics. However, for non-Latin languages with sub-character-level structures, we can go one step further namely bytes. To this end, we propose bytF to capture sub-character-level information, where we consider byte-level n-grams. Furthermore, we augment it to bytF+ and bytF++ where we consider character and word n-gram backoffs. On machine translation metric meta-evaluation datasets from English into 5 Indian languages, Chinese and Japanese, we show that bytF and its variants are comparable (give minimum difference) or significantly better (give maximum difference) correlated than chrF and chrF++ with human judgments at the segment level. We often observe that backing off to characters and words for bytF and to words for chrF does not have the highest correlation with humans. Furthermore, we also observe that using default n-gram values often leads to scores having poorer correlations with humans, indicating the need for well studied and tuned n-gram metrics for efficacy.

pdf bib
Quality Estimation and Post-Editing Using LLMs For Indic Languages: How Good Is It?
Anushka Singh | Aarya Pakhale | Mitesh M. Khapra | Raj Dabre

Recently, there have been increasing efforts on Quality Estimation (QE) and Post-Editing (PE) using Large Language Models (LLMs) for Machine Translation (MT). However, the focus has mainly been on high resource languages and the approaches either rely on prompting or combining existing QE models with LLMs, instead of single end-to-end systems. In this paper, we investigate the efficacy of end-to-end QE and PE systems for low-resource languages taking 5 Indian languages as a use-case. We augment existing QE data containing multidimentional quality metric (MQM) error annotations with explanations of errors and PEs with the help of proprietary LLMs (GPT-4), following which we fine-tune Gemma-2-9B, an open-source multilingual LLM to perform QE and PE jointly. While our models attain QE capabilities competitive with or surpassing existing models in both referenceful and referenceless settings, we observe that they still struggle with PE. Further investigation reveals that this occurs because our models lack the ability to accurately identify fine-grained errors in the translation, despite being excellent indicators of overall quality. This opens up opportunities for research in end-to-end QE and PE for low-resource languages.

pdf bib
Revisiting Post-Editing for English-Chinese Machine Translation
Hari Venkatesan

Given the rapid strides in quality made by automated translation since the advent of Neural Machine Translation, questions regarding the need and role of Post-Editing (PE) may need revisiting. This paper discusses this in light of a survey of opinions from two cohorts of post-graduate students of translation. The responses indicate that the role of PE may need further elaboration in terms of aspects such as grammar, lexis and style, with lexis and style being the main sites requiring human intervention. Also, contrary to expectations, responses generally show marked hesitation in considering quasi-texts as final without PE even in case of disposable texts. The discussion here pertains to English-Chinese translation, but may resonate with other language pairs as well.

pdf bib
Is it AI or PE that worry translation professionals: results from a Human-Centered AI survey
Miguel A. Jiménez-Crespo | Stephanie A. Rodríguez

Translation technologies have historically been developed without substantial input from professionals (e.g. O’Brien 2012). Conversely, the emerging human-centered AI (HCAI) paradigm emphasizes the importance of including end-users in the “process of conceiving, designing, testing, deploying, and iterating” technologies (Vallor 2024: 17). Therefore, early research engagement on the attitudes, needs and opinions of professionals on AI implementation is essential because incorporating them at later stages “results in issues and missed opportunities, which may be expensive to recover from due to the cost, time, resources, and energy spent” (Winslow and Garibay 2004: 123). To this end, this article presents a qualitative analysis of professional translators’ attitudes towards AI in the future, centered around the role of MT and post-editing (PE). The discussion draws on data collected from open ended questions included in a larger survey on control and autonomy from a HCAI perspective, which were thematically coded and qualitatively examined. The thematic analysis indicates that predominant concerns regarding the future of the AI-driven translation industry still revolves around longstanding issues in PE and MT literature, such as PE, translation quality, communicating and educating LSP, clients, users, and the broader public, maintaining human control over the final product or creativity. This is explained to some extent to the relatively small rates of integration of AI technologies into translation workflows to date (e.g. ELIA 2024; Rivas Ginel et al 2024; GALA 2024; Jimenez-Crespo 2024), or the fact the professional report using AI primarily for tasks related to translation, but not necessarily to PE the output of LLMs or NMT (Rivas Ginel and Moorkens 2025).

pdf bib
Prompt engineering in translation: How do student translators leverage GenAI tools for translation tasks
Jia Zhang | Xiaoyu Zhao | Stephen Doherty

GenAI, though not developed specifically for translation, has shown the potential to produce translations as good as, if not better than, contemporary neural machine translation systems. In the context of tertiary-level translator education, the integration of GenAI has renewed debate in curricula and pedagogy. Despite divergent opinions among educators, it is evident that translation students, like many other students, are using GenAI tools to facilitate translation tasks as they use MT tools. We thus argue for the benefits of guiding students in using GenAI in an informed, critical, and ethical manner. To provide insights for tailored curriculum and pedagogy, it is insightful to investigate what students use GenAI for and how they use it. This study is among the first to investigate translation students’ prompting behaviours. For thematic and discourse analysis, we collected prompts in GenAI tools generated by a representative sample of postgraduate student participants for eight months. The findings revealed that students had indeed used GenAI in various translation tasks, but their prompting behaviours were intuitive and uninformed. Our findings suggest an urgent need for translation educators to consider students’ agency and critical engagement with GenAI tools.

pdf bib
Can postgraduate translation students identify machine-generated text?
Michael Farrell

Given the growing use of generative artificial intelligence as a tool for creating multilingual content and bypassing traditional translation methods, this study explores the ability of linguistically trained individuals to discern machine-generated output from human-written text (HT). After brief training sessions on the textual anomalies characteristic of synthetic text (ST), twenty-three postgraduate translation students analysed excerpts of Italian prose and assigned likelihood scores to indicate whether they believed they were human-written or AI-generated. The results show that, on average, the students struggled to distinguish between HT and ST, with only two participants achieving notable accuracy. Closer analysis revealed that the students often identified textual anomalies in both HT and ST, although features such as low burstiness and self-contradiction were more frequently associated with ST. These findings suggest the need for improvements in the preparatory training. Moreover, the study raises questions about the necessity of editing synthetic text to make it sound more human-like and recommends further research to determine whether AI-generated text is already sufficiently natural-sounding not to require further refinement.

pdf bib
MT or not MT? Do translation specialists know a machine-translated text when they see one?
Rudy Loock | Nathalie Moulard | Quentin Pacinella

In this article, we investigate translation specialists’ capacity to identify raw machine translation (MT) output in comparison with so-called “human” translations produced without any use of MT. Specifically, we measure this capacity via an online activity, based on different criteria: (i) degree of expertise (translation students vs. professionals with at least 5 years’ experience), (ii) MT engine (DeepL, Google Translate, Reverso, ChatGPT), and (iii) length of input (1-3 sentences). A complementary, qualitative analysis, based on participants’ feedback, provides interesting insight on how they discriminate between raw MT output and human translations.

pdf bib
The Challenge of Translating Culture-Specific Items: Evaluating MT and LLMs Compared to Human Translators
Bojana Budimir

We evaluate state-of-the-art Large Language Models (LLM’s) ChatGPT-4o, Gemini 1.5 Flash, and Google Translate, by focusing on the translation of culture-specific items (CSIs) between an underrepresented language pair: the Flemish variant of Dutch and Serbian. Using a corpus derived from three Flemish novels we analyze CSIs in three cultural domains: Material Culture, Proper Names, and Social Culture. Translation strategies are examined on a spectrum that goes from conservation to substitution. Quantitative analysis explores strategy distribution, while qualitative analysis investigates errors, linguistic accuracy, and cultural adaptation. Despite advancements, models struggle to balance cultural nuances with understandability for the target readers. Gemini aligns most closely with human translation strategies, while Google Translate shows significant limitations. These findings underscore the challenges of translating CSIs—particularly Proper Names—in low-resource languages and offer insights for improving machine translation models.

pdf bib
Investigating the Integration of LLMs into Trainee Translators’ Practice and Learning: A Questionnaire-based Study on Translator-AI Interaction
Xindi Hao | Shuyin Zhang

In recent years, large language models (LLMs) have drawn significant attention from translators, including trainee translators, who are increasingly adopting LLMs in their translation practice and learning. Despite this growing interest, to the best of our knowledge, no LLM has yet been specifically designed for (trainee) translators. While numerous LLMs are available on the market, their potential in performing translation-related tasks is yet to be fully discovered. This highlights a pressing need for a tailored LLM translator guide, conceptualized as an aggregator or directory of multiple LLMs and designed to support trainee translators in selecting and navigating the most suitable models for different scenarios in their translation tasks. As an initial step towards the development of such a guide, this study, aims to identify the scenarios in which trainee translators regularly use LLMs. It employs questionnaire-based research to examine the frequency of LLM usage by trainee translators, the average number of prompts, and their satisfaction with the performance of LLMs across the various scenarios identified. The findings give an insight into when and where trainee translators might integrate LLMs into their workflows, identify the limitations of current LLMs in assisting translators’ work, and shed light on a future design for an LLM translator guide.

pdf bib
Introducing Quality Estimation to Machine Translation Post-editing Workflow: An Empirical Study on Its Usefulness
Siqi Liu | Guangrong Dai | Dechao Li

This preliminary study investigates the usefulness of sentence-level Quality Estimation (QE) in English-Chinese Machine Translation Post-Editing (MTPE), focusing on its impact on post-editing speed and student translators’ perceptions. The study also explores the interaction effects between QE and MT quality, as well as between QE and translation expertise. The findings reveal that QE significantly reduces post-editing time. The interaction effects examined were not significant, suggesting that QE consistently improves MTPE efficiency across MT outputs of medium and high quality and among student translators with varying levels of expertise. In addition to indicating potentially problematic segments, QE serves multiple functions in MTPE, such as validating translators’ evaluation of MT quality and enabling them to double-check translation outputs. However, interview data suggest that inaccurate QE may hinder the post-editing processes. This research provides new insights into the strengths and limitations of QE, facilitating its more effective integration into MTPE workflows to enhance translators’ productivity.

pdf bib
Human- or machine-translated subtitles: Who can tell them apart?
Ekaterina Lapshinova-Koltunski | Sylvia Jaki | Maren Bolz | Merle Sauter

This contribution investigates whether machine-translated subtitles can be easily distinguished from human-translated ones. For this, we run an experiment using two versions of German subtitles for an English television series: (1)produced manually by professional subtitlers, and (2) translated automatically with a Large Language Model (LLM), i.e., GPT4. Our participants were students of translation studies with varying experience in subtitling and the use of machine translation. We asked participants to guess if the subtitles for a selection of video clips had been translated manually or automatically. Apart from analysing whether machine-translated subtitles are distinguishable from human-translated ones, we also seek for indicators of the differences between human and machine translations. Our results show that although it is overall hard to differentiate between human and machine translations, there are some differences. Notably, the more experience the humans have with translation and subtitling, the more able they are to tell apart the two translation variants.

pdf bib
Extending CREAMT: Leveraging Large Language Models for Literary Translation Post-Editing
Antonio Castaldo | Sheila Castilho | Joss Moorkens | Johanna Monti

Post-editing machine translation (MT) for creative texts, such as literature, requires balancing efficiency with the preservation of creativity and style. While neural MT systems struggle with these challenges, large language models (LLMs) offer improved capabilities for context-aware and creative translation. This study evaluates the feasibility of post-editing literary translations generated by LLMs. Using a custom research tool, we collaborated with professional literary translators to analyze editing time, quality, and creativity. Our results indicate that post-editing (PE) LLM-generated translations significantly reduce editing time compared to human translation while maintaining a similar level of creativity. The minimal difference in creativity between PE and MT, combined with substantial productivity gains, suggests that LLMs may effectively support literary translators.

pdf bib
To MT or not to MT: An eye-tracking study on the reception by Dutch readers of different translation and creativity levels
Kyo Gerrits | Ana Guerberof Arenas

This article presents the results of a pilot study involving the reception of a fictional short story translated from English into Dutch under four conditions: machine translation (MT), post-editing (PE), human translation (HT) and original source text (ST). The aim is to understand how creativity and errors in different translation modalities affect readers, specifically regarding cognitive load. Eight participants filled in a questionnaire, read a story using an eye-tracker, and conducted a retrospective think-aloud (RTA) interview. The results show that units of creative potential (UCP) increase cognitive load and that this is the highest in HT and the lowest in MT; no effect of error was observed. Triangulating the data with RTAs leads us to hypothesize that the higher cognitive load in UCPs is linked to increases in reader enjoyment and immersion. The effect of translation creativity on cognitive load in different translation modalities at word-level is novel and opens up new avenues for further research.

pdf bib
Translation Analytics for Freelancers: I. Introduction, Data Preparation, Baseline Evaluations
Yuri Balashov | Alex Balashov | Shiho Fukuda Koski

This is the first in a series of papers exploring the rapidly expanding new opportunities arising from recent progress in language technologies for individual translators and language service providers with modest resources. The advent of advanced neural machine translation systems, large language models, and their integration into workflows via computer-assisted translation tools and translation management systems have reshaped the translation landscape. These advancements enable not only translation but also quality evaluation, error spotting, glossary generation, and adaptation to domain-specific needs, creating new technical opportunities for freelancers. In this series, we aim to empower translators with actionable methods to harness these advancements. Our approach emphasizes Translation Analytics, a suite of evaluation techniques traditionally reserved for large-scale industry applications but now becoming increasingly available for smaller-scale users. This first paper introduces a practical framework for adapting automatic evaluation metrics — such as BLEU, chrF, TER, and COMET — to freelancers’ needs. We illustrate the potential of these metrics using a trilingual corpus derived from a real-world project in the medical domain and provide statistical analysis correlating human evaluations with automatic scores. Our findings emphasize the importance of proactive engagement with emerging technologies to not only adapt but thrive in the evolving professional environment.

pdf bib
ITALERT: Assessing the Quality of LLMs and NMT in Translating Italian Emergency Response Text
Maria Carmen Staiano | Lifeng Han | Johanna Monti | Francesca Chiusaroli

This paper presents the outcomes of an initial investigation into the performance of Large Language Models (LLMs) and Neural Machine Translation (NMT) systems in translating high-stakes messages. The research employed a novel bilingual corpus, ITALERT (Italian Emergency Response Text) and applied a human-centric post-editing based metric (HOPE) to assess translation quality systematically. The initial dataset contains eleven texts in Italian and their corresponding English translations, both extracted from the national communication campaign website of the Italian Civil Protection Department. The texts deal with eight crisis scenarios: flooding, earthquake, forest fire, volcanic eruption, tsunami, industrial accident, nuclear risk, and dam failure. The dataset has been carefully compiled to ensure usability and clarity for evaluating machine translation (MT) systems in crisis settings. Our findings show that current LLMs and NMT models, such as ChatGPT (OpenAI’s GPT-4o model) and Google MT, face limitations in translating emergency texts, particularly in maintaining the appropriate register, resolving context ambiguities, and managing domain-specific terminology.

pdf bib
Optimising ChatGPT for creativity in literary translation: A case study from English into Dutch, Chinese, Catalan and Spanish
Shuxiang Du | Ana Guerberof Arenas | Antonio Toral | Kyo Gerrits | Josep Marco Borillo

This study examines the variability of ChatGPT’s machine translation (MT) outputs across six different configurations in four languages, with a focus on creativity in a literary text. We evaluate GPT translations in different text granularity levels, temperature settings and prompting strategies with a Creativity Score formula. We found that prompting ChatGPT with a minimal instruction yields the best creative translations, with Translate the following text into [TG] creatively at the temperature of 1.0 outperforming other configurations and DeepL in Spanish, Dutch, and Chinese. Nonetheless, ChatGPT consistently underperforms compared to human translation (HT). All the code and data are available at Repository URL will be provided with camera-ready version.

pdf bib
Improving MT-enabled Triage Performance with Multiple MT Outputs
Marianna J. Martindale | Marine Carpuat

Recent advances in Machine Translation (MT) quality may motivate adoption in a variety of use cases, but the success of MT deployment depends not only on intrinsic model quality but on how well the model, as deployed, helps users meet the objectives of their use case. This work focuses on a specific triage use case, MT-enabled scanning in intelligence analysis. After describing the use case with its objectives and failure modes, we present a user study to establish a baseline performance level and measure the mitigating effects of a simple intervention, providing additional MT outputs. We find significant improvements in relevance judgment accuracy with outputs from two distinct neural MT models and significant improvements in relevant entity identification with the addition of a rule-based MT. Users also like seeing multiple MT outputs, making it an appealing way to improve MT-enabled scanning performance.

pdf bib
The GAMETRAPP project: Spanish scholars’ perspectives and attitudes towards neural machine translation and post-editing
Cristina Toledo-Báez | Luis Carlos Marín-Navarro

The GAMETRAPP project (2022-2025), funded by the Spanish Ministry of Science and Innovation and led by the University of Málaga, aims to introduce and promote post-editing (PE) practices of machine-translated research abstracts among Spanish scholars. To this aim, the GAMETRAPP project is developing a gamified environment —specifically, an escape room—integrated into a responsive web app. As part of the design of both the gamified environment and the web app, this paper presents the results of a questionnaire distributed to Spanish scholars in order to explore their perspectives and attitudes towards neural machine translation (NMT) and PE. A total of 253 responses were collected from scholars affiliated with 42 Spanish public universities. A two-stage participant selection process was applied: the analysis focuses on scholars who self-reported a CEFR level of C1 or C2 in English proficiency. (n = 152), and, within this group, a comparison was conducted between scholars from linguistic disciplines (23%, n = 35) and those from non-linguistic disciplines (77%, n = 117). Statistically significant differences between these groups were identified using the Mann-Whitney U test in IBM SPSS. The results indicate a widespread and continued use of language technologies, particularly those related to NMT. However, only 34.2% of scholars from non-linguistic disciplines are familiar with PE as a concept, although 59.8% report that they do post-edit their scientific abstracts. Furthermore, 62.9% of scholars from linguistic disciplines and 47.9% from non-linguistic disciplines believe it is necessary to create an app that trains scholars in post-editing Spanish abstracts into English. Sentiment analysis conducted with Atlas.ti on the 29 qualitative responses to the open-ended question suggests overall neutral attitudes toward NMT and PE for both groups of scholars. In conclusion, while both groups engage with NMT tools, there is a clear need for training—especially among scholars from non-linguistic disciplines—to familiarize them with PE concepts and to help develop basic PE literacy skills.

pdf bib
Using Translation Techniques to Characterize MT Outputs
Sergi Alvarez-Vidal | Maria Do Campo | Christian Olalla-Soler | Pilar Sánchez-Gijón

While current NMT and GPT models improve fluency and context awareness, they struggle with creative texts, where figurative language and stylistic choices are crucial. Current evaluation methods fail to capture these nuances, which requires a more descriptive approach. We propose a taxonomy based on translation techniques to assess machine-generated translations more comprehensively. The pilot study we conducted comparing human machine-produced translations reveals that human translations employ a wider range of techniques, enhancing naturalness and cultural adaptation. NMT and GPT models, even with prompting, tend to simplify content and introduce accuracy errors. Our findings highlight the need for refined frameworks that consider stylistic and contextual accuracy, ultimately bridging the gap between human and machine translation performance.