Machine Translation Summit (2025)


up

pdf (full)
bib (full)
Proceedings of Machine Translation Summit XX: Volume 1

pdf bib
Proceedings of Machine Translation Summit XX: Volume 1
Pierrette Bouillon | Johanna Gerlach | Sabrina Girletti | Lise Volkart | Raphael Rubino | Rico Sennrich | Ana C. Farinha | Marco Gaido | Joke Daems | Dorothy Kenny | Helena Moniz | Sara Szoc

pdf bib
Robust, interpretable and efficient MT evaluation with fine-tuned metrics
Ricardo Rei

None

pdf bib
Direct Speech Translation in Constrained Contexts: the Simultaneous and Subtitling Scenarios
Sara Papi

None

pdf bib
Investigating Length Issues in Document-level Machine Translation
Ziqian Peng | Rachel Bawden | François Yvon

Transformer architectures are increasingly effective at processing and generating very long chunks of texts, opening new perspectives for document-level machine translation (MT). In this work, we challenge the ability of MT systems to handle texts comprising up to several thousands of tokens. We design and implement a new approach designed to precisely measure the effect of length increments on MT outputs. Our experiments with two representative architectures unambiguously show that (a) translation performance decreases with the length of the input text; (b) the position of sentences within the document matters and translation quality is higher for sentences occurring earlier in a document. We further show that manipulating the distribution of document lengths and of positional embeddings only marginally mitigates such problems. Our results suggest that even though document-level MT is computationally feasible, it does not yet match the performance of sentence-based MT.

pdf bib
Investigating the translation capabilities of Large Language Models trained on parallel data only
Javier García Gilabert | Carlos Escolano | Aleix Sant | Francesca De Luca Fornaciari | Audrey Mash | Xixian Liao | Maite Melero

In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce Plume (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the role of vocabulary size, the impact of the different elements of the prompt, and their cross-lingual representation space. We find that larger vocabulary sizes improve zero-shot performance and that different layers specialize in distinct aspects of the prompt, such as language-specific tags. We further show that as the vocabulary size grows, a larger number of attention heads can be pruned with minimal loss in translation quality, achieving a reduction of over 64.7% in attention heads.

pdf bib
Improve Fluency Of Neural Machine Translation Using Large Language Models
Jianfei He | Wenbo Pan | Jijia Yang | Sen Peng | Xiaohua Jia

Large language models (LLMs) demonstrate significant capabilities in many natural language processing. However, their performance in machine translation is still behind the models that are specially trained for machine translation with an encoder-decoder architecture. This paper investigates how to improve neural machine translation (NMT) with LLMs. Our proposal is based on an empirical insight that NMT gets worse fluency than human translation. We propose to use LLMs to enhance the fluency of NMT’s generation by integrating a language model at the target side. we use contrastive learning to constrain fluency so that it does not exceed the LLMs. Our experiments on three language pairs show that this method can improve the performance of NMT. Our empirical analysis further demonstrates that this method improves the fluency at the target side. Our experiments also show that some straightforward post-processing methods using LLMs, such as re-ranking and refinement, are not effective.

pdf bib
Optimizing the Training Schedule of Multilingual NMT using Reinforcement Learning
Alexis Allemann | Àlex R. Atrio | Andrei Popescu-Belis

Multilingual NMT is a viable solution for translating low-resource languages (LRLs) when data from high-resource languages (HRLs) from the same language family is available. However, the training schedule, i.e. the order of presentation of languages, has an impact on the quality of such systems. Here, in a many-to-one translation setting, we propose to apply two algorithms that use reinforcement learning to optimize the training schedule of NMT: (1) Teacher-Student Curriculum Learning and (2) Deep Q Network. The former uses an exponentially smoothed estimate of the returns of each action based on the loss on monolingual or multilingual development subsets, while the latter estimates rewards using an additional neural network trained from the history of actions selected in different states of the system, together with the rewards received. On a 8-to-1 translation dataset with LRLs and HRLs, our second method improves BLEU and COMET scores with respect to both random selection of monolingual batches and shuffled multilingual batches, by adjusting the number of presentations of LRL vs. HRL batches.

pdf bib
Languages Transferred Within the Encoder: On Representation Transfer in Zero-Shot Multilingual Translation
Zhi Qu | Chenchen Ding | Taro Watanabe

Understanding representation transfer in multilingual neural machine translation (MNMT) can reveal the reason for the zero-shot translation deficiency. In this work, we systematically analyze the representational issue of MNMT models. We first introduce the identity pair, translating a sentence to itself, to address the lack of the base measure in multilingual investigations, as the identity pair can reflect the representation of a language within the model. Then, we demonstrate that the encoder transfers the source language to the representational subspace of the target language instead of the language-agnostic state. Thus, the zero-shot translation deficiency arises because the representation of a translation is entangled with other languages and not transferred to the target language effectively. Based on our findings, we propose two methods: 1) low-rank language-specific embedding at the encoder, and 2) language-specific contrastive learning of the representation at the decoder. The experimental results on Europarl-15, TED-19, and OPUS-100 datasets show that our methods substantially enhance the performance of zero-shot translations without sacrifices in supervised directions by improving language transfer capacity, thereby providing practical evidence to support our conclusions. Codes are available at https://github.com/zhiqu22/ZeroTrans.

pdf bib
Decoding Machine Translationese in English-Chinese News: LLMs vs. NMTs
Delu Kong | Lieve Macken

This study explores Machine Translationese (MTese) — the linguistic peculiarities of machine translation outputs — focusing on the under-researched English-to-Chinese language pair in news texts. We construct a large dataset consisting of 4 sub-corpora and employ a comprehensive five-layer feature set. Then, a chi-square ranking algorithm is applied for feature selection in both classification and clustering tasks. Our findings confirm the presence of MTese in both Neural Machine Translation systems (NMTs) and Large Language Models (LLMs). Original Chinese texts are nearly perfectly distinguishable from both LLM and NMT outputs. Notable linguistic patterns in MT outputs are shorter sentence lengths and increased use of adversative conjunctions. Comparing LLMs and NMTs, we achieve approximately 70% classification accuracy, with LLMs exhibiting greater lexical diversity and NMTs using more brackets. Additionally, translation-specific LLMs show lower lexical diversity but higher usage of causal conjunctions compared to generic LLMs. Lastly, we find no significant differences between LLMs developed by Chinese firms and their foreign counterparts.

pdf bib
OJ4OCRMT: A Large Multilingual Dataset for OCR-MT Evaluation
Paul McNamee | Kevin Duh | Cameron Carpenter | Ron Colaianni | Nolan King | Kenton Murray

We introduce OJ4OCRMT, an Optical Character Recognition (OCR) dataset for Machine Translation (MT). The dataset supports research on automatic extraction, recognition, and translation of text from document images. The Official Journal of the European Union (OJEU), is the official gazette for the EU. Tens of thousands of pages of legislative acts and regulatory notices are published annually, and parallel translations are available in each of the official languages. Due to its large size, high degree of multilinguality, and carefully produced human translations, the OJEU is a singular resource for language processing research. We have assembled a large collection of parallel pages from the OJEU and have created a dataset to support translation of document images. In this work we introduce the dataset, describe the design decisions which we undertook, and report baseline performance figures for the translation task. It is our hope that this dataset will significantly add to the comparatively few resources presently available for evaluating OCR-MT systems.

pdf bib
Context-Aware or Context-Insensitive? Assessing LLMs’ Performance in Document-Level Translation
Wafaa Mohammed | Vlad Niculae

Large language models (LLMs) are increasingly strong contenders in machine translation. In this work, we focus on document-level translation, where some words cannot be translated without context from outside the sentence. Specifically, we investigate the ability of prominent LLMs to utilize the document context during translation through a perturbation analysis (analyzing models’ robustness to perturbed and randomized document context) and an attribution analysis (examining the contribution of relevant context to the translation). We conduct an extensive evaluation across nine LLMs from diverse model families and training paradigms, including translation-specialized LLMs, alongside two encoder-decoder transformer baselines. We find that LLMs’ improved document-translation performance compared to encoder-decoder models is not reflected in pronoun translation performance. Our analysis highlight the need for context-aware finetuning of LLMs with a focus on relevant parts of the context to improve their reliability for document-level translation.

pdf bib
Context-Aware Monolingual Evaluation of Machine Translation
Silvio Picinini | Sheila Castilho

This paper explores the potential of context-aware monolingual evaluation for assessing machine translation (MT) when no source is given for reference. To this end, we compare monolingual with bilingual evaluations (with source text), under two scenarios: the evaluation of a single MT system, and the comparative evaluation of pairwise MT systems. Four professional translators performed both monolingual and bilingual evaluations by assigning ratings and annotating errors, and providing feedback on their experience. Our findings suggest that context-aware monolingual evaluation achieves comparable outcomes to bilingual evaluations, and highlight the feasibility and potential of monolingual evaluation as an efficient approach to assessing MT.

pdf bib
Culture-aware machine translation: the case study of low-resource language pair Catalan-Chinese
Xixian Liao | Carlos Escolano | Audrey Mash | Francesca De Luca Fornaciari | Javier García Gilabert | Miguel Claramunt Argote | Ella Bohman | Maite Melero

High-quality machine translation requires datasets that not only ensure linguistic accuracy but also capture regional and cultural nuances. While many existing benchmarks, such as FLORES-200, rely on English as a pivot language, this approach can overlook the specificity of direct language pairs, particularly for underrepresented combinations like Catalan-Chinese. In this study, we demonstrate that even with a relatively small dataset of approximately 1,000 sentences, we can significantly improve MT localization. To this end, we introduce a dataset specifically designed to enhance Catalan-to-Chinese translation by prioritizing regionally and culturally specific topics. Unlike pivot-based datasets, our data source ensures a more faithful representation of Catalan linguistic and cultural elements, leading to more accurate translations of local terms and expressions. Using this dataset, we demonstrate better performance over the English-pivot FLORES-200 dev set and achieve competitive results on the FLORES-200 devtest set when evaluated with neural-based metrics. We release this dataset as both a human-preference resource and a benchmark for Catalan-Chinese translation. Additionally, we include Spanish translations for each sentence, facilitating extensions to Spanish-Chinese translation tasks.

pdf bib
Instruction-tuned Large Language Models for Machine Translation in the Medical Domain
Miguel Rios

Large Language Models (LLMs) have shown promising results on machine translation for high resource language pairs and domains. However, in specialised domains (e.g. medical) LLMs have shown lower performance compared to standard neural machine translation models. The consistency in the machine translation of terminology is crucial for users, researchers, and translators in specialised domains. In this study, we compare the performance between baseline LLMs and instruction-tuned LLMs in the medical domain. In addition, we introduce terminology from specialised medical dictionaries into the instruction formatted datasets for fine-tuning LLMs. The instruction-tuned LLMs significantly outperform the baseline models with automatic metrics. Moreover, the instruction-tuned LLMs produce fewer errors compared to the baseline based on automatic error annotation.

pdf bib
Lingonberry Giraffe: Lexically-Sound Beam Search for Explainable Translation of Compound Words
Théo Salmenkivi-Friberg | Iikka Hauhio

We present a hybrid rule-based and neural method for translating Finnish compound words into English. We use a lightweight set of rules to split a Finnish word into its constituent parts and determine the possible translations of those words using a dictionary. We then use an NMT model to rank these alternatives to determine the final output. Since the number of translations that takes into account different spellings, inflections, and word separators can be very large, we use beam search for the ranking when the number of translations is over a threshold. We find that our method is an improvement over using the same NMT model for end-to-end translation in both automatic and human evaluation. We conclude that our method retains the good qualities of rule-based translation such as explainability and controllability while keeping the rules lightweight.

pdf bib
Testing LLMs’ Capabilities in Annotating Translations Based on an Error Typology Designed for LSP Translation: First Experiments with ChatGPT
Joachim Minder | Guillaume Wisniewski | Natalie Kübler

This study investigates the capabilities of large language models (LLMs), specifically ChatGPT, in annotating MT outputs based on an error typology. In contrast to previous work focusing mainly on general language, we explore ChatGPT’s ability to identify and categorise errors in specialised translations. By testing two different prompts and based on a customised error typology, we compare ChatGPT annotations with human expert evaluations of translations produced by DeepL and ChatGPT itself. The results show that, for translations generated by DeepL, recall and precision are quite high. However, the degree of accuracy in error categorisation depends on the prompt’s specific features and its level of detail, ChatGPT performing very well with a detailed prompt. When evaluating its own translations, ChatGPT achieves significantly poorer results, revealing limitations with self-assessment. These results highlight both the potential and the limitations of LLMs for translation evaluation, particularly in specialised domains. Our experiments pave the way for future research on open-source LLMs, which could produce annotations of comparable or even higher quality. In the future, we also aim to test the practical effectiveness of this automated evaluation in the context of translation training, particularly by optimising the process of human evaluation by teachers and by exploring the impact of annotations by LLMs on students’ post-editing and translation learning.

pdf bib
Name Consistency in LLM-based Machine Translation of Historical Texts
Dominic P. Fischer | Martin Volk

Large Language Models (LLMs) excel at translating 16th-century letters from Latin and Early New High German to modern English and German. While they perform well at translating well-known historical city names (e.g., Lutetia –> Paris), their ability to handle person names (e.g., Theodor Bibliander) or lesser-known toponyms (e.g., Augusta Vindelicorum –> Augsburg) remains unclear. This study investigates LLM-based translations of person and place names across various frequency bands in a corpus of 16th-century letters. Our results show that LLMs struggle with person names, achieving accuracies around 60%, but perform better with place names, reaching accuracies around 90%. We further demonstrate that including a translation suggestion for the proper noun in the prompt substantially boosts accuracy, yielding highly reliable results.

pdf bib
Non-autoregressive Modeling for Sign-gloss to Texts Translation
Fan Zhou | Tim Van de Cruys

Automatic sign language translation has seen significant advancements, driven by progress in computer vision and natural language processing. While end to end sign-to-text translation systems are available, many systems still rely on a gloss-based representation–an intermediate symbolic representation that functions as a bridge between sign language and its written counterpart. This paper focuses on the gloss-to-text (gloss2text) task, a key step in the sign-to-text translation pipeline, which has traditionally been addressed using autoregressive (AR) modeling approaches. In this study, we propose the use of non-autoregressive (NAR) modeling techniques, including non-autoregressive Transformer (NAT) and diffusion models, tailored to the unique characteristics of gloss2text. Specifically, we introduce PointerLevT, a novel NAT-based model designed to enhance performance in this task. Our experiments demonstrate that NAR models achieve higher accuracy than pre-trained AR models with less data, while also matching the performance of fine-tuned AR models such as mBART. Furthermore, we evaluate inference speed and find that NAR models benefit from parallel generation, resulting in faster inference. However, they require more time to achieve an optimal balance between accuracy and speed, particularly in the multistep denoising process of diffusion models.

pdf bib
Exploring the Feasibility of Multilingual Grammatical Error Correction with a Single LLM up to 9B parameters: A Comparative Study of 17 Models
Dawid Wiśniewski | Antoni Solarski | Artur Nowakowski

Recent language models can successfully solve various language-related tasks, and many understand inputs stated in different languages. In this paper, we explore the performance of 17 popular models used to correct grammatical issues in texts stated in English, German, Italian, and Swedish when using a single model to correct texts in all those languages. We analyze the outputs generated by these models, focusing on decreasing the number of grammatical errors while keeping the changes small. The conclusions drawn help us understand what problems occur among those models and which models can be recommended for multilingual grammatical error correction tasks. We list six models that improve grammatical correctness in all four languages and show that Gemma 9B is currently the best performing one for the languages considered.

pdf bib
Do Not Change Me: On Transferring Entities Without Modification in Neural Machine Translation - a Multilingual Perspective
Dawid Wiśniewski | Mikołaj Pokrywka | Zofia Rostek

Current machine translation models provide us with high-quality outputs in most scenarios. However, they still face some specific problems, such as detecting which entities should not be changed during translation. In this paper, we explore the abilities of popular NMT models, including models from the OPUS project, Google Translate, MADLAD, and EuroLLM, to preserve entities such as URL addresses, IBAN numbers, or emails when producing translations between four languages: English, German, Polish, and Ukrainian. We investigate the quality of popular NMT models in terms of accuracy, discuss errors made by the models, and examine the reasons for errors. Our analysis highlights specific categories, such as emojis, that pose significant challenges for many models considered. In addition to the analysis, we propose a new multilingual synthetic dataset of 36,000 sentences that can help assess the quality of entity transfer across nine categories and four aforementioned languages.

pdf bib
Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn’t Help with MT Evaluation
Petra Barančíková | Ondřej Bojar

In this paper, we compare Czech-specific and multilingual sentence embedding models through intrinsic and extrinsic evaluation paradigms. For intrinsic evaluation, we employ Costra, a complex sentence transformation dataset, and several Semantic Textual Similarity (STS) benchmarks to assess the ability of the embeddings to capture linguistic phenomena such as semantic similarity, temporal aspects, and stylistic variations. In the extrinsic evaluation, we fine-tune each embedding model using COMET-based metrics for machine translation evaluation. Our experiments reveal an interesting disconnect: models that excel in intrinsic semantic similarity tests do not consistently yield superior performance on downstream translation evaluation tasks. Conversely, models with seemingly over-smoothed embedding spaces can, through fine-tuning, achieve excellent results. These findings highlight the complex relationship between semantic property probes and downstream task, emphasizing the need for more research into “operationalizable semantics” in sentence embeddings, or more in-depth downstream tasks datasets (here translation evaluation).

pdf bib
Metaphors in Literary Machine Translation: Close but no cigar?
Alina Karakanta | Mayra Nas | Aletta G. Dorst

The translation of metaphorical language presents a challenge in Natural Language Processing as a result of its complexity and variability in terms of linguistic forms, communicative functions, and cultural embeddedness. This paper investigates the performance of different state-of-the-art Machine Translation (MT) systems and Large Language Models (LLMs) in metaphor translation in literary texts (English->Dutch), examining how metaphorical language is handled by the systems and the types of errors identified by human evaluators. While commercial MT systems perform better in terms of translation quality based on automatic metrics, the human evaluation demonstrates that open-source, literary-adapted NMT systems translate metaphors equally accurately. Still, the accuracy of metaphor translation ranges between 64-80%, with lexical and meaning errors being the most prominent. Our findings indicate that metaphors remain a challenge for MT systems and adaptation to the literary domain is crucial for improving metaphor translation in literary texts.

pdf bib
Synthetic Fluency: Hallucinations, Confabulations, and the Creation of IrishWords in LLM-Generated Translations
Sheila Castilho | Zoe Fitzsimmons | Claire Holton | Aoife Mc Donagh

This study examines hallucinations in Large Language Model (LLM) translations into Irish, specifically focusing on instances where the models generate novel, non-existent words. We classify these hallucinations within verb and noun categories, identifying six distinct patterns among the latter. Additionally, we analyse whether these hallucinations adhere to Irish morphological rules and what linguistic tendencies they exhibit. Our findings show that while both GPT-4.o and GPT-4.o Mini produce similar types of hallucinations, the Mini model generates them at a significantly higher frequency. Beyond classification, the discussion raises speculative questions about the implications of these hallucinations for the Irish language. Rather than seeking definitive answers, we offer food for thought regarding the increasing use of LLMs and their potential role in shaping Irish vocabulary and linguistic evolution. We aim to prompt discussion on how such technologies might influence language over time, particularly in the context of low-resource, morphologically rich languages.

pdf bib
Patent Claim Translation via Continual Pre-training of Large Language Models with Parallel Data
Haruto Azami | Minato Kondo | Takehito Utsuro | Masaaki Nagata

Recent advancements in large language models (LLMs) have enabled their application across various domains. However, in the field of patent translation, Transformer encoder-decoder based models remain the standard approach, and the potential of LLMs for translation tasks has not been thoroughly explored. In this study, we conducted patent claim translation using an LLM fine-tuned with parallel data through continual pre-training and supervised fine-tuning, following the methodology proposed by Guo et al. (2024) and Kondo et al. (2024). Comparative evaluation against the Transformer encoder-decoder based translations revealed that the LLM achieved high scores for both BLEU and COMET. This demonstrated improvements in addressing issues such as omissions and repetitions. Nonetheless, hallucination errors, which were not observed in the traditional models, occurred in some cases and negatively affected the translation quality. This study highlights the promise of LLMs for patent translation while identifying the challenges that warrant further investigation.

pdf bib
The Devil is in the Details: Assessing the Effects of Machine-Translation on LLM Performance in Domain-Specific Texts
Javier Osorio | Afraa Alshammari | Naif Alatrush | Dagmar Heintze | Amber Converse | Sultan Alsarra | Latifur Khan | Patrick T. Brandt | Vito D’Orazio

Conflict scholars increasingly use computational tools to track violence and cooperation at a global scale. To study foreign locations, researchers often use machine translation (MT) tools, but rarely evaluate the quality of the MT output or its effects on Large Language Model (LLM) performance. Using a domain-specific multi-lingual parallel corpus, this study evaluates the quality of several MT tools for text in English, Arabic, and Spanish. Using ConfliBERT, a domain-specific LLM, the study evaluates the effect of MT texts on model performance, and finds that MT texts tend to yield better results than native texts. The MT quality assessment reveals considerable translation-induced distortions, reductions in vocabulary size and text specialization, and changes in syntactical structure. Regression analysis at the sentence-level reveals that such distortions, particularly reductions in general and domain vocabulary rarity, artificially boost LLM performance by simplifying the MT output. This finding cautions researchers and practitioners about uncritically relying on MT tools without considering MT-induced data loss.

pdf bib
Improving Japanese-English Patent Claim Translation with Clause Segmentation Models based on Word Alignment
Masato Nishimura | Kosei Buma | Takehito Utsuro | Masaaki Nagata

In patent documents, patent claims represent a particularly important section as they define the scope of the claims. However, due to the length and unique formatting of these sentences, neural machine translation (NMT) systems are prone to translation errors, such as omissions and repetitions. To address these challenges, this study proposes a translation method that first segments the source sentences into multiple shorter clauses using a clause segmentation model tailored to facilitate translation. These segmented clauses are then translated using a clause translation model specialized for clause-level translation. Finally, the translated clauses are rearranged and edited into the final translation using a reordering and editing model. In addition, this study proposes a method for constructing clause-level parallel corpora required for training the clause segmentation and clause translation models. This method leverages word alignment tools to create clause-level data from sentence-level parallel corpora. Experimental results demonstrate that the proposed method achieves statistically significant improvements in BLEU scores compared to conventional NMT models. Furthermore, for sentences where conventional NMT models exhibit omissions and repetitions, the proposed method effectively suppresses these errors, enabling more accurate translations.

pdf bib
Progressive Perturbation with KTO for Enhanced Machine Translation of Indian Languages
Yash Bhaskar | Ketaki Shetye | Vandan Mujadia | Dipti Misra Sharma | Parameswari Krishnamurthy

This study addresses the critical challenge of data scarcity in machine translation for Indian languages, particularly given their morphological complexity and limited parallel data. We investigate an effective strategy to maximize the utility of existing data by generating negative samples from positive training instances using a progressive perturbation approach. This is used for aligning the model with preferential data using Kahneman-Tversky Optimization (KTO). Comparing it against traditional Supervised Fine-Tuning (SFT), we demonstrate how generating negative samples and leveraging KTO enhances data efficiency. By creating rejected samples through progressively perturbed translations from the available dataset, we fine-tune the Llama 3.1 Instruct 8B model using QLoRA across 16 language directions, including English, Hindi, Bangla, Tamil, Telugu, and Santali. Our results show that KTO-based preference alignment with progressive perturbation consistently outperforms SFT, achieving significant gains in translation quality with an average BLEU increase of 1.84 to 2.47 and CHRF increase of 2.85 to 4.01 compared to SFT for selected languages, while using the same positive training samples and under similar computational constraints. This highlights the potential of our negative sample generation strategy within KTO, especially in low resource scenarios.

pdf bib
Leveraging Visual Scene Graph to Enhance Translation Quality in Multimodal Machine Translation
Ali Hatami | Mihael Arcan | Paul Buitelaar

Despite significant advancements in Multimodal Machine Translation, understanding and effectively utilising visual scenes within multimodal models remains a complex challenge. Extracting comprehensive and relevant visual features requires extensive and detailed input data to ensure the model accurately captures objects, their attributes, and relationships within a scene. In this paper, we explore using visual scene graphs extracted from images to enhance the performance of translation models. We investigate this approach for integrating Visual Scene Graph information into translation models, focusing on representing this information in a semantic structure rather than relying on raw image data. The performance of our approach was evaluated on the Multi30K dataset for English into German, French, and Czech translations using BLEU, chrF2, TER and COMET metrics. Our results demonstrate that utilising visual scene graph information improves translation performance. Using information on semantic structure can improve the multimodal baseline model, leading to better contextual understanding and translation accuracy.

pdf bib
Are AI agents the new machine translation frontier? Challenges and opportunities of single- and multi-agent systems for multilingual digital communication
Vicent Briva-Iglesias

The rapid evolution of artificial intelligence (AI) has introduced AI agents as a disruptive paradigm across various industries, yet their application in machine translation (MT) remains underexplored. This paper describes and analyses the potential of single- and multi-agent systems for MT, reflecting on how they could enhance multilingual digital communication. While single-agent systems are well-suited for simpler translation tasks, multi-agent systems, which involve multiple specialized AI agents collaborating in a structured manner, may offer a promising solution for complex scenarios requiring high accuracy, domain-specific knowledge, and contextual awareness. To demonstrate the feasibility of multi-agent workflows in MT, we are conducting a pilot study in legal MT. The study employs a multi-agent system involving four specialized AI agents for (i) translation, (ii) adequacy review, (iii) fluency review, and (iv) final editing. Our findings suggest that multi-agent systems may have the potential to significantly improve domain-adaptability and contextual awareness, with comparable translation quality to traditional MT or single-agent systems. This paper also sets the stage for future research into multi-agent applications in MT, integration into professional translation workflows, and shares a demo of the system analyzed in the paper.

pdf bib
bytF: How Good Are Byte Level N-Gram F-Scores for Automatic Machine Translation Evaluation?
Raj Dabre | Kaing Hour | Haiyue Song

Recently, chrF and chrF++ have become the preferred metric over BLEU for automatic n-gram evaluation of machine translation. Since they focus on character-level n-grams, it appears to have better correlations with human judgments for translating into morphologically rich languages compared to word-level metrics. However, for non-Latin languages with sub-character-level structures, we can go one step further namely bytes. To this end, we propose bytF to capture sub-character-level information, where we consider byte-level n-grams. Furthermore, we augment it to bytF+ and bytF++ where we consider character and word n-gram backoffs. On machine translation metric meta-evaluation datasets from English into 5 Indian languages, Chinese and Japanese, we show that bytF and its variants are comparable (give minimum difference) or significantly better (give maximum difference) correlated than chrF and chrF++ with human judgments at the segment level. We often observe that backing off to characters and words for bytF and to words for chrF does not have the highest correlation with humans. Furthermore, we also observe that using default n-gram values often leads to scores having poorer correlations with humans, indicating the need for well studied and tuned n-gram metrics for efficacy.

pdf bib
Quality Estimation and Post-Editing Using LLMs For Indic Languages: How Good Is It?
Anushka Singh | Aarya Pakhale | Mitesh M. Khapra | Raj Dabre

Recently, there have been increasing efforts on Quality Estimation (QE) and Post-Editing (PE) using Large Language Models (LLMs) for Machine Translation (MT). However, the focus has mainly been on high resource languages and the approaches either rely on prompting or combining existing QE models with LLMs, instead of single end-to-end systems. In this paper, we investigate the efficacy of end-to-end QE and PE systems for low-resource languages taking 5 Indian languages as a use-case. We augment existing QE data containing multidimentional quality metric (MQM) error annotations with explanations of errors and PEs with the help of proprietary LLMs (GPT-4), following which we fine-tune Gemma-2-9B, an open-source multilingual LLM to perform QE and PE jointly. While our models attain QE capabilities competitive with or surpassing existing models in both referenceful and referenceless settings, we observe that they still struggle with PE. Further investigation reveals that this occurs because our models lack the ability to accurately identify fine-grained errors in the translation, despite being excellent indicators of overall quality. This opens up opportunities for research in end-to-end QE and PE for low-resource languages.

pdf bib
Revisiting Post-Editing for English-Chinese Machine Translation
Hari Venkatesan

Given the rapid strides in quality made by automated translation since the advent of Neural Machine Translation, questions regarding the need and role of Post-Editing (PE) may need revisiting. This paper discusses this in light of a survey of opinions from two cohorts of post-graduate students of translation. The responses indicate that the role of PE may need further elaboration in terms of aspects such as grammar, lexis and style, with lexis and style being the main sites requiring human intervention. Also, contrary to expectations, responses generally show marked hesitation in considering quasi-texts as final without PE even in case of disposable texts. The discussion here pertains to English-Chinese translation, but may resonate with other language pairs as well.

pdf bib
Is it AI or PE that worry translation professionals: results from a Human-Centered AI survey
Miguel A. Jiménez-Crespo | Stephanie A. Rodríguez

Translation technologies have historically been developed without substantial input from professionals (e.g. O’Brien 2012). Conversely, the emerging human-centered AI (HCAI) paradigm emphasizes the importance of including end-users in the “process of conceiving, designing, testing, deploying, and iterating” technologies (Vallor 2024: 17). Therefore, early research engagement on the attitudes, needs and opinions of professionals on AI implementation is essential because incorporating them at later stages “results in issues and missed opportunities, which may be expensive to recover from due to the cost, time, resources, and energy spent” (Winslow and Garibay 2004: 123). To this end, this article presents a qualitative analysis of professional translators’ attitudes towards AI in the future, centered around the role of MT and post-editing (PE). The discussion draws on data collected from open ended questions included in a larger survey on control and autonomy from a HCAI perspective, which were thematically coded and qualitatively examined. The thematic analysis indicates that predominant concerns regarding the future of the AI-driven translation industry still revolves around longstanding issues in PE and MT literature, such as PE, translation quality, communicating and educating LSP, clients, users, and the broader public, maintaining human control over the final product or creativity. This is explained to some extent to the relatively small rates of integration of AI technologies into translation workflows to date (e.g. ELIA 2024; Rivas Ginel et al 2024; GALA 2024; Jimenez-Crespo 2024), or the fact the professional report using AI primarily for tasks related to translation, but not necessarily to PE the output of LLMs or NMT (Rivas Ginel and Moorkens 2025).

pdf bib
Prompt engineering in translation: How do student translators leverage GenAI tools for translation tasks
Jia Zhang | Xiaoyu Zhao | Stephen Doherty

GenAI, though not developed specifically for translation, has shown the potential to produce translations as good as, if not better than, contemporary neural machine translation systems. In the context of tertiary-level translator education, the integration of GenAI has renewed debate in curricula and pedagogy. Despite divergent opinions among educators, it is evident that translation students, like many other students, are using GenAI tools to facilitate translation tasks as they use MT tools. We thus argue for the benefits of guiding students in using GenAI in an informed, critical, and ethical manner. To provide insights for tailored curriculum and pedagogy, it is insightful to investigate what students use GenAI for and how they use it. This study is among the first to investigate translation students’ prompting behaviours. For thematic and discourse analysis, we collected prompts in GenAI tools generated by a representative sample of postgraduate student participants for eight months. The findings revealed that students had indeed used GenAI in various translation tasks, but their prompting behaviours were intuitive and uninformed. Our findings suggest an urgent need for translation educators to consider students’ agency and critical engagement with GenAI tools.

pdf bib
Can postgraduate translation students identify machine-generated text?
Michael Farrell

Given the growing use of generative artificial intelligence as a tool for creating multilingual content and bypassing traditional translation methods, this study explores the ability of linguistically trained individuals to discern machine-generated output from human-written text (HT). After brief training sessions on the textual anomalies characteristic of synthetic text (ST), twenty-three postgraduate translation students analysed excerpts of Italian prose and assigned likelihood scores to indicate whether they believed they were human-written or AI-generated. The results show that, on average, the students struggled to distinguish between HT and ST, with only two participants achieving notable accuracy. Closer analysis revealed that the students often identified textual anomalies in both HT and ST, although features such as low burstiness and self-contradiction were more frequently associated with ST. These findings suggest the need for improvements in the preparatory training. Moreover, the study raises questions about the necessity of editing synthetic text to make it sound more human-like and recommends further research to determine whether AI-generated text is already sufficiently natural-sounding not to require further refinement.

pdf bib
MT or not MT? Do translation specialists know a machine-translated text when they see one?
Rudy Loock | Nathalie Moulard | Quentin Pacinella

In this article, we investigate translation specialists’ capacity to identify raw machine translation (MT) output in comparison with so-called “human” translations produced without any use of MT. Specifically, we measure this capacity via an online activity, based on different criteria: (i) degree of expertise (translation students vs. professionals with at least 5 years’ experience), (ii) MT engine (DeepL, Google Translate, Reverso, ChatGPT), and (iii) length of input (1-3 sentences). A complementary, qualitative analysis, based on participants’ feedback, provides interesting insight on how they discriminate between raw MT output and human translations.

pdf bib
The Challenge of Translating Culture-Specific Items: Evaluating MT and LLMs Compared to Human Translators
Bojana Budimir

We evaluate state-of-the-art Large Language Models (LLM’s) ChatGPT-4o, Gemini 1.5 Flash, and Google Translate, by focusing on the translation of culture-specific items (CSIs) between an underrepresented language pair: the Flemish variant of Dutch and Serbian. Using a corpus derived from three Flemish novels we analyze CSIs in three cultural domains: Material Culture, Proper Names, and Social Culture. Translation strategies are examined on a spectrum that goes from conservation to substitution. Quantitative analysis explores strategy distribution, while qualitative analysis investigates errors, linguistic accuracy, and cultural adaptation. Despite advancements, models struggle to balance cultural nuances with understandability for the target readers. Gemini aligns most closely with human translation strategies, while Google Translate shows significant limitations. These findings underscore the challenges of translating CSIs—particularly Proper Names—in low-resource languages and offer insights for improving machine translation models.

pdf bib
Investigating the Integration of LLMs into Trainee Translators’ Practice and Learning: A Questionnaire-based Study on Translator-AI Interaction
Xindi Hao | Shuyin Zhang

In recent years, large language models (LLMs) have drawn significant attention from translators, including trainee translators, who are increasingly adopting LLMs in their translation practice and learning. Despite this growing interest, to the best of our knowledge, no LLM has yet been specifically designed for (trainee) translators. While numerous LLMs are available on the market, their potential in performing translation-related tasks is yet to be fully discovered. This highlights a pressing need for a tailored LLM translator guide, conceptualized as an aggregator or directory of multiple LLMs and designed to support trainee translators in selecting and navigating the most suitable models for different scenarios in their translation tasks. As an initial step towards the development of such a guide, this study, aims to identify the scenarios in which trainee translators regularly use LLMs. It employs questionnaire-based research to examine the frequency of LLM usage by trainee translators, the average number of prompts, and their satisfaction with the performance of LLMs across the various scenarios identified. The findings give an insight into when and where trainee translators might integrate LLMs into their workflows, identify the limitations of current LLMs in assisting translators’ work, and shed light on a future design for an LLM translator guide.

pdf bib
Introducing Quality Estimation to Machine Translation Post-editing Workflow: An Empirical Study on Its Usefulness
Siqi Liu | Guangrong Dai | Dechao Li

This preliminary study investigates the usefulness of sentence-level Quality Estimation (QE) in English-Chinese Machine Translation Post-Editing (MTPE), focusing on its impact on post-editing speed and student translators’ perceptions. The study also explores the interaction effects between QE and MT quality, as well as between QE and translation expertise. The findings reveal that QE significantly reduces post-editing time. The interaction effects examined were not significant, suggesting that QE consistently improves MTPE efficiency across MT outputs of medium and high quality and among student translators with varying levels of expertise. In addition to indicating potentially problematic segments, QE serves multiple functions in MTPE, such as validating translators’ evaluation of MT quality and enabling them to double-check translation outputs. However, interview data suggest that inaccurate QE may hinder the post-editing processes. This research provides new insights into the strengths and limitations of QE, facilitating its more effective integration into MTPE workflows to enhance translators’ productivity.

pdf bib
Human- or machine-translated subtitles: Who can tell them apart?
Ekaterina Lapshinova-Koltunski | Sylvia Jaki | Maren Bolz | Merle Sauter

This contribution investigates whether machine-translated subtitles can be easily distinguished from human-translated ones. For this, we run an experiment using two versions of German subtitles for an English television series: (1)produced manually by professional subtitlers, and (2) translated automatically with a Large Language Model (LLM), i.e., GPT4. Our participants were students of translation studies with varying experience in subtitling and the use of machine translation. We asked participants to guess if the subtitles for a selection of video clips had been translated manually or automatically. Apart from analysing whether machine-translated subtitles are distinguishable from human-translated ones, we also seek for indicators of the differences between human and machine translations. Our results show that although it is overall hard to differentiate between human and machine translations, there are some differences. Notably, the more experience the humans have with translation and subtitling, the more able they are to tell apart the two translation variants.

pdf bib
Extending CREAMT: Leveraging Large Language Models for Literary Translation Post-Editing
Antonio Castaldo | Sheila Castilho | Joss Moorkens | Johanna Monti

Post-editing machine translation (MT) for creative texts, such as literature, requires balancing efficiency with the preservation of creativity and style. While neural MT systems struggle with these challenges, large language models (LLMs) offer improved capabilities for context-aware and creative translation. This study evaluates the feasibility of post-editing literary translations generated by LLMs. Using a custom research tool, we collaborated with professional literary translators to analyze editing time, quality, and creativity. Our results indicate that post-editing (PE) LLM-generated translations significantly reduce editing time compared to human translation while maintaining a similar level of creativity. The minimal difference in creativity between PE and MT, combined with substantial productivity gains, suggests that LLMs may effectively support literary translators.

pdf bib
To MT or not to MT: An eye-tracking study on the reception by Dutch readers of different translation and creativity levels
Kyo Gerrits | Ana Guerberof Arenas

This article presents the results of a pilot study involving the reception of a fictional short story translated from English into Dutch under four conditions: machine translation (MT), post-editing (PE), human translation (HT) and original source text (ST). The aim is to understand how creativity and errors in different translation modalities affect readers, specifically regarding cognitive load. Eight participants filled in a questionnaire, read a story using an eye-tracker, and conducted a retrospective think-aloud (RTA) interview. The results show that units of creative potential (UCP) increase cognitive load and that this is the highest in HT and the lowest in MT; no effect of error was observed. Triangulating the data with RTAs leads us to hypothesize that the higher cognitive load in UCPs is linked to increases in reader enjoyment and immersion. The effect of translation creativity on cognitive load in different translation modalities at word-level is novel and opens up new avenues for further research.

pdf bib
Translation Analytics for Freelancers: I. Introduction, Data Preparation, Baseline Evaluations
Yuri Balashov | Alex Balashov | Shiho Fukuda Koski

This is the first in a series of papers exploring the rapidly expanding new opportunities arising from recent progress in language technologies for individual translators and language service providers with modest resources. The advent of advanced neural machine translation systems, large language models, and their integration into workflows via computer-assisted translation tools and translation management systems have reshaped the translation landscape. These advancements enable not only translation but also quality evaluation, error spotting, glossary generation, and adaptation to domain-specific needs, creating new technical opportunities for freelancers. In this series, we aim to empower translators with actionable methods to harness these advancements. Our approach emphasizes Translation Analytics, a suite of evaluation techniques traditionally reserved for large-scale industry applications but now becoming increasingly available for smaller-scale users. This first paper introduces a practical framework for adapting automatic evaluation metrics — such as BLEU, chrF, TER, and COMET — to freelancers’ needs. We illustrate the potential of these metrics using a trilingual corpus derived from a real-world project in the medical domain and provide statistical analysis correlating human evaluations with automatic scores. Our findings emphasize the importance of proactive engagement with emerging technologies to not only adapt but thrive in the evolving professional environment.

pdf bib
ITALERT: Assessing the Quality of LLMs and NMT in Translating Italian Emergency Response Text
Maria Carmen Staiano | Lifeng Han | Johanna Monti | Francesca Chiusaroli

This paper presents the outcomes of an initial investigation into the performance of Large Language Models (LLMs) and Neural Machine Translation (NMT) systems in translating high-stakes messages. The research employed a novel bilingual corpus, ITALERT (Italian Emergency Response Text) and applied a human-centric post-editing based metric (HOPE) to assess translation quality systematically. The initial dataset contains eleven texts in Italian and their corresponding English translations, both extracted from the national communication campaign website of the Italian Civil Protection Department. The texts deal with eight crisis scenarios: flooding, earthquake, forest fire, volcanic eruption, tsunami, industrial accident, nuclear risk, and dam failure. The dataset has been carefully compiled to ensure usability and clarity for evaluating machine translation (MT) systems in crisis settings. Our findings show that current LLMs and NMT models, such as ChatGPT (OpenAI’s GPT-4o model) and Google MT, face limitations in translating emergency texts, particularly in maintaining the appropriate register, resolving context ambiguities, and managing domain-specific terminology.

pdf bib
Optimising ChatGPT for creativity in literary translation: A case study from English into Dutch, Chinese, Catalan and Spanish
Shuxiang Du | Ana Guerberof Arenas | Antonio Toral | Kyo Gerrits | Josep Marco Borillo

This study examines the variability of ChatGPT’s machine translation (MT) outputs across six different configurations in four languages, with a focus on creativity in a literary text. We evaluate GPT translations in different text granularity levels, temperature settings and prompting strategies with a Creativity Score formula. We found that prompting ChatGPT with a minimal instruction yields the best creative translations, with Translate the following text into [TG] creatively at the temperature of 1.0 outperforming other configurations and DeepL in Spanish, Dutch, and Chinese. Nonetheless, ChatGPT consistently underperforms compared to human translation (HT). All the code and data are available at Repository URL will be provided with camera-ready version.

pdf bib
Improving MT-enabled Triage Performance with Multiple MT Outputs
Marianna J. Martindale | Marine Carpuat

Recent advances in Machine Translation (MT) quality may motivate adoption in a variety of use cases, but the success of MT deployment depends not only on intrinsic model quality but on how well the model, as deployed, helps users meet the objectives of their use case. This work focuses on a specific triage use case, MT-enabled scanning in intelligence analysis. After describing the use case with its objectives and failure modes, we present a user study to establish a baseline performance level and measure the mitigating effects of a simple intervention, providing additional MT outputs. We find significant improvements in relevance judgment accuracy with outputs from two distinct neural MT models and significant improvements in relevant entity identification with the addition of a rule-based MT. Users also like seeing multiple MT outputs, making it an appealing way to improve MT-enabled scanning performance.

pdf bib
The GAMETRAPP project: Spanish scholars’ perspectives and attitudes towards neural machine translation and post-editing
Cristina Toledo-Báez | Luis Carlos Marín-Navarro

The GAMETRAPP project (2022-2025), funded by the Spanish Ministry of Science and Innovation and led by the University of Málaga, aims to introduce and promote post-editing (PE) practices of machine-translated research abstracts among Spanish scholars. To this aim, the GAMETRAPP project is developing a gamified environment —specifically, an escape room—integrated into a responsive web app. As part of the design of both the gamified environment and the web app, this paper presents the results of a questionnaire distributed to Spanish scholars in order to explore their perspectives and attitudes towards neural machine translation (NMT) and PE. A total of 253 responses were collected from scholars affiliated with 42 Spanish public universities. A two-stage participant selection process was applied: the analysis focuses on scholars who self-reported a CEFR level of C1 or C2 in English proficiency. (n = 152), and, within this group, a comparison was conducted between scholars from linguistic disciplines (23%, n = 35) and those from non-linguistic disciplines (77%, n = 117). Statistically significant differences between these groups were identified using the Mann-Whitney U test in IBM SPSS. The results indicate a widespread and continued use of language technologies, particularly those related to NMT. However, only 34.2% of scholars from non-linguistic disciplines are familiar with PE as a concept, although 59.8% report that they do post-edit their scientific abstracts. Furthermore, 62.9% of scholars from linguistic disciplines and 47.9% from non-linguistic disciplines believe it is necessary to create an app that trains scholars in post-editing Spanish abstracts into English. Sentiment analysis conducted with Atlas.ti on the 29 qualitative responses to the open-ended question suggests overall neutral attitudes toward NMT and PE for both groups of scholars. In conclusion, while both groups engage with NMT tools, there is a clear need for training—especially among scholars from non-linguistic disciplines—to familiarize them with PE concepts and to help develop basic PE literacy skills.

pdf bib
Using Translation Techniques to Characterize MT Outputs
Sergi Alvarez-Vidal | Maria Do Campo | Christian Olalla-Soler | Pilar Sánchez-Gijón

While current NMT and GPT models improve fluency and context awareness, they struggle with creative texts, where figurative language and stylistic choices are crucial. Current evaluation methods fail to capture these nuances, which requires a more descriptive approach. We propose a taxonomy based on translation techniques to assess machine-generated translations more comprehensively. The pilot study we conducted comparing human machine-produced translations reveals that human translations employ a wider range of techniques, enhancing naturalness and cultural adaptation. NMT and GPT models, even with prompting, tend to simplify content and introduce accuracy errors. Our findings highlight the need for refined frameworks that consider stylistic and contextual accuracy, ultimately bridging the gap between human and machine translation performance.

up

pdf (full)
bib (full)
Proceedings of Machine Translation Summit XX: Volume 2

pdf bib
Proceedings of Machine Translation Summit XX: Volume 2
Pierrette Bouillon | Johanna Gerlach | Sabrina Girletti | Lise Volkart | Raphael Rubino | Rico Sennrich | Samuel Läubli | Martin Volk | Miquel Esplà-Gomis | Vincent Vandeghinste | Helena Moniz | Sara Szoc

pdf bib
Using AI Tools in Multimedia Localization Workflows: a Productivity Evaluation
Ashley Mondello | Romina Cini | Sahil Rasane | Alina Karakanta | Laura Casanellas

Multimedia localization workflows are inherently complex, and the demand for localized content continues to grow. This demand has attracted Language Service Providers (LSPs) to expand their activities into multimedia localization, offering subtitling and voice-over services. While a wide array of AI tools is available for these tasks, their value in increasing productivity in multimedia workflows for LSPs remains uncertain. This study evaluates the productivity, quality, cost, and time efficiency of three multimedia localization workflows, each incorporating varying levels of AI automation. Our findings indicate that workflows merely replacing human vendors with AI tools may result in quality degradation without justifying the productivity gains. In contrast, integrated workflows using specialized tools enhance productivity while maintaining quality, despite requiring additional training and adjustments to established practices.

pdf bib
Replacing the Irreplaceable: A Case Study on the Limitations of MT and AI Translation during the 2023 Gaza-Israel Conflict
Abeer Alfaify

Despite the remarkable development of artificial intelligence (AI) and machine translation (MT) in recent years, which has made them more efficient, less costly and easier to navigate, they still struggle to match the abilities of human translators. The limitations shown by AI and MT, which have been detected in various domain-specific texts and contexts, sustain the debate over whether they can fully replace human translators. Nevertheless, very few studies have examined the translation abilities of AI and MT during conflicts and high-stakes contexts. This paper explores some of these limitations that were detected during the 2023 Gaza-Israel conflict, illustrating significant examples from X (formerly Twitter). These examples showcase limitations in 1) translating cultural references, 2) avoiding critical errors in high-stakes context, 3) preventing bias and intervention, and 4) translating cursive handwriting. This is done through a combination of descriptive, comparative and experimental analysis methods, highlighting risks and implications associated with using these tools in such sensitive contexts, while contributing to the broader discussion on whether advances in AI and MT will diminish the need for human translators.

pdf bib
Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages
Andrei Popescu-Belis | Alexis Allemann | Teo Ferrari | Gopal Krishnamani

The popularity of automatic speech-to-speech translation for human conversations is growing, but the quality varies significantly depending on the language pair. In a context of community interpreting for low-resource languages, namely Turkish and Pashto to/from French, we collected fine-tuning and testing data, and compared systems using several automatic metrics (BLEU, COMET, and BLASER) and human assessments. The pipelines consist of automatic speech recognition, machine translation, and speech synthesis, with local models and cloud-based commercial ones. Some components have been fine-tuned on our data. We evaluated over 60 pipelines and determined the best one for each direction. We also found that the ranks of components are generally independent of the rest of the pipeline.

pdf bib
Arabizi vs LLMs: Can the Genie Understand the Language of Aladdin?
Perla Al Almaoui | Pierrette Bouillon | Simon Hengchen

In an era of rapid technological advancements, communication continues to evolve as new linguistic phenomena emerge. Among these is Arabizi, a hybrid form of Arabic that incorporates Latin characters and numbers to represent the spoken dialects of Arab communities. Arabizi is Widely used on social media and allows people to communicate in an informal and dynamic way, but it poses significant challenges for machine translation due to its lack of formal structure and deeply embedded cultural nuances. This case study is motivated by a growing need to translate Arabizi for gisting purpose. It evaluates the capacity of different LLMs’ to decode and translate Arabizi, focusing on multiple Arabic dialects that have rarely been studied up until now. Using a combination of human evaluators and automatic metrics, this research project investigates the model’s performance in translating Arabizi into both Modern Standard Arabic and English. Key questions explored include which dialects are translated most effectively and whether translations into English surpass those into Arabic.

pdf bib
Cultural Transcreation in Asian Languages with Prompt-Based LLMs
Helena Wu | Beatriz Silva | Vera Cabarrão | Helena Moniz

This research explores Cultural Transcreation (CT) for East Asian languages, focusing primarily on Mandarin Chinese (ZH) and the customer service (CS) market. We combined Large Language Models (LLMs) with prompt engineering to develop a CT product that, aligned with the Augmented Translation concept, enhances multilingual CS communication, enables professionals to engage with their target audience effortlessly, and improves overall service quality. Through a series of preparatory steps, including guideline establishment, benchmark validation, iterative prompt refinement, and LLM testing, we integrated the CT product into the CS platform, assessed its performance, and refined prompts based on a pilot feedback. The results highlight its success in empowering agents, regardless of linguistic or cultural expertise, to bridge effective communication gaps through AI-assisted cultural rephrasing, thus achieving its market launch. Beyond CS, the study extends the concept of transcreation and prompt-based LLM applications to other fields, discussing its performance in the language conversion of website content and advertising.

pdf bib
A comparison of translation performance between DeepL and Supertext
Alex Flückiger | Chantal Amrhein | Tim Graf | Frédéric Odermatt | Martin Pömsl | Philippe Schläpfer | Florian Schottmann | Samuel Läubli

As strong machine translation (MT) systems are increasingly based on large language models (LLMs), reliable quality benchmarking requires methods that capture their ability to leverage extended context. This study compares two commercial MT systems – DeepL and Supertext – by assessing their performance on unsegmented texts. We evaluate translation quality across four language directions with professional translators assessing segments with full document-level context. While segment-level assessments indicate no strong preference between the systems in most cases, document-level analysis reveals a preference for Supertext in three out of four language directions, suggesting superior consistency across longer texts. We advocate for more context-sensitive evaluation methodologies to ensure that MT quality assessments reflect real-world usability. We release all evaluation data and scripts for further analysis and reproduction at https://github.com/supertext/evaluation_deepl_supertext.

pdf bib
Leveraging LLMs for Cross-Locale Adaptation: a Workflow Proposal on Spanish Variants
Vera Senderowicz Guerra

Localization strategies can differ widely between languages, but the necessity and efficiency of maintaining distinct strategies for closely related variants of the same language is debatable. This paper explores the potential for unifying localization strategies across different Spanish locales, leveraging Large Language Models, prompting techniques, and specialized linguistic resources to perform cross-locale adaptations from a chosen baseline. In this study, we examine and develop vocabulary, terminology, grammar, and style transformation methods from Latin American into Mexican and Argentine Spanish. Our findings suggest that parting from a core translation and then following an automated adaptation process to unify localization strategies is feasible for Spanish diverse variants, regardless of the type of divergence each of them has from the baseline locale. However, even if the need for human post-editing is then minimal compared to a fully ‘manual’ cross-locale adaptation, the linguistic review remains crucial, particularly for editing style nuances.

pdf bib
SpeechT: Findings of the First Mentorship in Speech Translation
Yasmin Moslem | Juan Julián Cea Morán | Mariano Gonzalez-Gomez | Muhammad Hazim Al Farouq | Farah Abdou | Satarupa Deb

This work presents the details and findings of the first mentorship in speech translation (SpeechT), which took place in December 2024 and January 2025. To fulfil the mentorship requirements, the participants engaged in key activities, including data preparation, modelling, and advanced research. The participants explored data augmentation techniques and compared end-to-end and cascaded speech translation systems. The projects covered various languages other than English, including Arabic, Bengali, Galician, Indonesian, Japanese, and Spanish.

pdf bib
ZuBidasoa: Participatory Research for the Development of Linguistic Technologies Adapted to the Needs of Migrants in the Basque Country
Xabier Soto | Ander Egurtzegi | Maite Oronoz | Urtzi Etxeberria

Recent years have witnessed the development of advanced language technologies, including the use of audio and images as part of multimodal systems. However, these models are not adapted to the specific needs of migrants and Non-Governmental Organizations (NGOs) communicating in multilingual scenarios. In this project, we focus on the situation of migrants arriving in the Basque Country, nearby the western border between Spain and France. For identifying migrants’ needs, we have met with several organisations helping them in different stages, including: sea rescue; primary care in refugee camps and in situ; assistance with asylum demands; other administrative issues; and human rights defence in retention centres. In these interviews, Darija has been identified as the most spoken language among the under-served ones. Considering this, we have started the development of a Machine Translation (MT) system between Basque and Darija (Moroccan Arabic), based on open-source corpora. In this paper, we present the description of the project and the main results of the participatory research developed in the initial stage.

pdf bib
Machine Translation to Inform Asylum Seekers: Intermediate Findings from the MaTIAS Project
Lieve Macken | Ella van Hest | Arda Tezcan | Michaël Lumingu | Katrijn Maryns | July De Wilde

We present key interim findings from the ongoing MaTIAS project, which focuses on developing a multilingual notification system for asylum reception centres in Belgium. This system integrates machine translation (MT) to enable staff to provide practical information to residents in their native language, thus fostering more effective communication. Our discussion focuses on three key aspects: the development of the multilingual messaging platform, the types of messages the system is designed to handle, and the evaluation of potential MT systems for integration.

pdf bib
CAT-GPT: A Skopos-Driven, LLM-Based Computer-Assisted Translation Tool
Paşa Abdullah Bayramoğlu

This paper introduces CAT-GPT, an innovative Computer-Assisted Translation (CAT) tool designed to address context-awareness and terminological consistency challenges often encountered in standard CAT workflows. Grounded in Skopos theory (Vermeer, 2014) and powered by a Large Language Model (LLM) backend, CAT-GPT integrates context-sensitive segmentation, automatically generated and adjustable translation instructions, and an advanced machine translation component. Comparative observations with a widely used CAT tool (e.g., Trados Studio) suggest that CAT-GPT reduces post-editing effort and improves text-level coherence, especially in specialized or domain-specific scenarios.

pdf bib
MTUOC server: integrating several NMT and LLMs into professional translation workflows
Antoni Oliver

In this paper, we present the latest version of MTUOC-server and MTUOC-multiserver, a robust tool capable of launching one or more translation servers. It supports a wide range of NMT systems and LLM models, both commercial and open-source, and is compatible with several communication protocols, broadening the range of tools it can work with. This server is a component of the MTUOC project and is distributed under an free license.

pdf bib
OPAL Enable: Revolutionizing Localization Through Advanced AI
Mara Nunziatini | Konstantinos Karageorgos | Aaron Schliem | Mikaela Grace

This paper discusses the capabilities and benefits of OPAL Enable, an advanced AI suite designed to modernize localization processes. The suite comprises Machine Translation, AI Post-Editing, and AI Quality Estimation tools, integrated into renowned translation management systems. The paper provides an in-depth analysis of these features, detailing their procedural order, and the time and cost savings they offer. It emphasizes the customization potential of OPAL Enable to meet client-specific requirements, increase scalability, and expedite workflows.

pdf bib
UniOr PET: An Online Platform for Translation Post-Editing
Antonio Castaldo | Sheila Castilho | Joss Moorkens | Johanna Monti

UniOr PET is a browser-based platform for machine translation post-editing and a modern successor to the original PET tool. It features a user-friendly interface that records detailed editing actions, including time spent, additions, and deletions. Fully compatible with PET, UniOr PET introduces two advanced timers for more precise tracking of editing time and computes widely used metrics such as hTER, BLEU, and ChrF, providing comprehensive insights into translation quality and post-editing productivity. Designed with translators and researchers in mind, UniOr PET combines the strengths of its predecessor with enhanced functionality for efficient and user-friendly post-editing projects.

pdf bib
FLORES+ Mayas: Generating Textual Resources to Foster the Development of Language Technologies for Mayan Languages
Andrés Lou | Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez | Miquel Esplà-Gomis | Víctor M. Sánchez-Cartagena

A significant percentage of the population of Guatemala and Mexico belongs to various Mayan indigenous communities, for whom language barriers lead to social, economic, and digital exclusion. The Mayan languages spoken by these communities remain severely underrepresented in terms of digital resources, which prevents them from leveraging the latest advances in artificial intelligence. This project addresses that problem by means of: 1) the digitisation and release of multiple printed linguistic resources; 2) the development of a high-quality parallel machine translation (MT) evaluation corpus for six Mayan languages. In doing so, we are paving the way for the development of MT systems that will facilitate the access for Mayan speakers to essential services such as healthcare or legal aid. The resources are produced with the essential participation of indigenous communities, whereby native speakers provide the necessary translation services, QA, and linguistic expertise. The project is funded by the Google Academic Research Awards and carried out in collaboration with the Proyecto Lingüístico Francisco Marroquín Foundation in Guatemala.

pdf bib
ProMut: The Evolution of NMT Didactic Tools
Pilar Sánchez-Gijón | Gema Ramírez-Sánchez

Neural Machine Translation intensifies educational challenges in translation technologies. The MultiTraiNMT project developed MutNMT, an open-source, didactic platform for training and evaluating NMT systems. Building upon it, LT-LiDER introduces ProMut which implements three main novel features: migration of the core NMT framework from JoeyNMT to MarianNMT, close integration with OPUS datasets, engines and connectors and the addition of a researcher profile for larger datasets and extended training processes and evaluation.

pdf bib
The BridgeAI Project
Helena Moniz | António Novais | Joana Lamego | Nuno André

This paper presents an updated overview of the ‘BridgeAI’ project, a science-for-policy initiative funded by the Portuguese Foundation for Science and Technology (FCT) and the Recovery and Resilience Programme. In its second stage of implementation, BridgeAI continues to build upon its original goals, working towards a strategy to align AI research, policy, regulatory frameworks, and practical application. The project provides Portugal with an evidence-based framework to implement the EU Artificial Intelligence (AI) Act (AIA), ensuring responsible AI innovation through multidisciplinary collaboration. BridgeAI connects academia, industry, public administration, and civil society to create actionable insights and regulatory recommendations. This paper details the project’s latest advancements, key recommendations, and future directions.

pdf bib
DeMINT: Automated Language Debriefing for English Learners via AI Chatbot Analysis of Meeting Transcripts
Miquel Esplà-Gomis | Felipe Sánchez-Martínez | Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz

The objective of the DeMINT project is to develop a conversational tutoring system aimed at enhancing non-native English speakers’ language skills through post-meeting analysis of the transcriptions of video conferences in which they have participated. This paper describes the model developed and the results obtained through a human evaluation conducted with learners of English as a second language.

pdf bib
GAMETRAPP project in progress: Designing a virtual escape room to enhance skills in research abstract post-editing
Cristina Toledo-Báez | Luis Carlos Marín-Navarro

The “App for post-editing neural machine translation using gamification” (GAMETRAPP) project (TED2021-129789B-I00), funded by the Spanish Ministry of Science and Innovation (2022–2025) and led by the University of Málaga, has been in progress for two and a half years. The project is developing a web application that incorporates a gamified environment, specifically a virtual escape room, to bring post-editing practice closer to scholars. This paper outlines the methodological process followed and provides a brief description of the virtual escape room.

pdf bib
AI4Culture platform: upskilling experts on multilingual / -modal tools
Tom Vanallemeersch | Sara Szoc | Marthe Lamote | Frederic Everaert | Eirini Kaldeli

The AI4Culture project, funded by the European Commission (2023-2025), developed a platform (https://ai4culture.eu) to educate cultural heritage (CH) professionals in AI technologies. Acting as an online capacity building hub, the platform describes openly labeled data sets and deployable and reusable tools applying AI technologies in tasks relevant to the CH sector. It also offers tutorials for tools and recipes for the combination of tools. In addition, the platform allows users to contribute their own resources. The resources described by project partners involve applications for optical or handwritten character recognition (OCR, HTR), generation and validation of subtitles, machine translation, image analysis, and semantic linking. The partners customized various tools to enhance the usability of interfaces and components. Here, we zoom in on the use case of correcting OCR/HTR output using various means (such as an unstructured manual transcription) to facilitate multilingual accessibility and create structured ground truth (text lines with image coordinates).

pdf bib
HPLT’s Second Data Release
Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Laurie Burchell | Pinzhen Chen | Mariia Fedorova | Ona de Gibert | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Andrey Kutuzov | Veronika Laippala | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza-Bernabeu

We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.

pdf bib
MaTOS: Machine Translation for Open Science
Rachel Bawden | Maud Bénard | Maud Bénard | José Cornejo Cárcamo | Nicolas Dahan | Manon Delorme | Mathilde Huguin | Natalie Kübler | Paul Lerner | Alexandra Mestivier | Joachim Minder | Jean-François Nominé | Ziqian Peng | Laurent Romary | Panagiotis Tsolakis | Lichao Zhu | François Yvon

This paper is a short presentation of MaTOS, a project focusing on the automatic translation of scholarly documents. Its main aims are threefold: (a) to develop resources (term lists and corpora) for high-quality machine translation; (b) to study methods for translating complete, structured documents in a cohesive and consistent manner; (c) to propose novel metrics to evaluate machine translation in technical domains. Publications and resources are available on the project web site: https://anr-matos.gihub.io.

pdf bib
Prompt-based Explainable Quality Estimation for English-Malayalam
Archchana Sindhujan | Diptesh Kanojia | Constantin Orăsan

The aim of this project was to curate data for the English-Malayalam language pair for the tasks of Quality Estimation (QE) and Automatic Post-Editing (APE) of Machine Translation. Whilst the primary aim of the project was to create a dataset for a low-resource language pair, we plan to use this dataset to investigate different zero-shot and few-shot prompting strategies including chain-of-thought, towards a unified explainable QE-APE framework.

pdf bib
MTxGames: Machine Translation Post-Editing in Video Game Translation - Findings on User Experience and Preliminary Results on Productivity
Judith Brenner

MTxGames is a doctoral research project examining three different translation modes with varying degrees of machine translation post-editing when translating video game texts. For realistic experimental conditions, data elicitation took place at the workplaces of professional game translators. In a mixed-methods approach, quantitative data was elicited through keylogging, eye-tracking, error annotation, and questionnaires as well as qualitative data through interviews. Aspects to be analyzed are translation productivity, cognitive effort, translation quality, and translators’ user experience.

pdf bib
Machine translation as support for epistemic capacities: Findings from the DECA project
Maarit Koponen | Nina Havumetsä | Juha Lång | Mary Nurminen

The DECA project consortium investigates epistemic capacities, defined as an individual’s access to reliable knowledge, their ability to participate in knowledge production, and society’s capacity to make informed, sustainable policy decisions. As a tool both for accessing information across language barriers and for producing multilingual information, machine translation also plays a potential role in supporting these epistemic capacities. In this paper, we present an overview of DECA’s research on two perspectives: 1) how migrants use machine translation to access information, and 2) how journalists use machine translation in their work.

pdf bib
Reverso Define: An AI-Powered Contextual Dictionary for Professionals
Quentin Pleplé | Théo Hoffenberg

We present Reverso Define, an innovative English dictionary designed to support translation professionals with AI-powered, context-aware definitions. Built using a hybrid approach combining Large Language Models and expert linguists, it offers precise definitions with special attention to multi-word expressions and domain-specific terminology. The system provides comprehensive coverage of technical domains relevant to professional translators while maintaining daily updates to address emerging terminology needs. It also provides indicative translations in 26 languages linked to each meaning, and variants within languages, when appropriate, and has links to Reverso Context, the range of contextual and corpus-based bilingual dictionaries, and Reverso Synonyms. We will show the various ways to use it with concrete examples and give some insights on its design and creation process.

pdf bib
Reverso Documents, The New Generation Document Translation Platform
Théo Hoffenberg | Elodie Segrestan

Reverso Documents is a widely-adopted translation and post-editing platform that combines advanced machine translation with extensive document format support and layout preservation capabilities. The system features AI-based rephrasing, bilingual dictionaries, and translation memory integration, enabling both professional translators and general users to work efficiently with complex documents. Used by millions globally, it provides API access for workflow integration and batch processing. The upcoming 2025 release will introduce LLM-based translation with customizable settings, allowing for enhanced control over translation outputs while maintaining document structure and translation quality.

pdf bib
eSTÓR: Curating Irish Datasets for Machine Translation
Abigail Walsh | Órla Ní Loinsigh | Jane Adkins | Ornait O’Connell | Mark Andrade | Teresa Clifford | Federico Gaspari | Jane Dunne | Brian Davis

Minority languages such as Irish are massively under-resourced, particularly in terms of high-quality domain-relevant data, limiting the capabilities of machine translation (MT) engines, even those integrating large language models (LLMs). The eSTÓR project, described in this paper, focuses on the collection and curation of high-quality Irish text data for diverse domains.

up

pdf (full)
bib (full)
Proceedings of the 1st Workshop on Artificial Intelligence and Easy and Plain Language in Institutional Contexts (AI & EL/PL)

pdf bib
Proceedings of the 1st Workshop on Artificial Intelligence and Easy and Plain Language in Institutional Contexts (AI & EL/PL)
María Isabel Rivas Ginel | Patrick Cadwell | Paolo Canavese | Silvia Hansen-Schirra | Martin Kappus | Anna Matamala | Will Noonan

pdf bib
Leveraging Large Language Models for Joint Linguistic and Technical Accessibility Improvement: A Case Study on University Webpages
Pierrette Bouillon | Johanna Gerlach | Raphael Rubino

The aim of the study presented in this paper is to investigate whether Large Language Models can be leveraged to translate French content from existing websites into their B1-level simplified versions and to integrate them into an accessible HTML structure. We design a CMS agnostic approach to webpage accessibility improvement based on prompt engineering and apply it to Geneva University webpages. We conduct several automatic and manual evaluations to measure the accessibility improvement reached by several LLMs with various prompts in a zero-shot setting. Results show that LLMs are not all suitable for the task, while a large disparity is observed among results reached by different prompts. Manual evaluation carried out by a dyslexic crowd shows that some LLMs could produce more accessible websites and improve access to information.

pdf bib
How Artificial Intelligence can help in the Easy-to-Read Adaptation of Numerical Expressions in Spanish
Mari Carmen Suárez-Figueroa | Alejandro Muñoz-Navarro | Isam Diab

Numerical expressions, specifically the use of fractions and percentages in texts, may encounter a difficulty in the reading comprehension process for different groups of the population, including persons with cognitive disabilities. As an element that facilitates reading comprehension, the Easy-to-Read (E2R) Methodology, created to achieve the so-called cognitive accessibility, recommends avoiding the use of fractions and percentages. If it is necessary to include them, their equivalence or explanation should be described. In order to help people who have difficulties in reading comprehension when they have to deal with fractions and percentages, we have developed an initial method for adapting numerical expressions in an automatic way in Spanish. This method is based on (a) Artificial Intelligence (AI) methods and techniques and (b) the E2R guidelines and recommendations. In addition, the method has been implemented as a web application. With the goal of having our research in the context of the so-called responsible AI, we followed the human-centred design approach called participatory design. In this regard, we involved people with cognitive disabilities in order to (a) reinforce the adaptations provided by E2R experts and included in our method, and (b) evaluate our application to automatically adapt numerical expressions following an E2R approach. Moreover, this method can be integrated into institutional procedures, such as those of university administrations and public organisations, to enhance the accessibility of official documents and educational materials.

pdf bib
Large Language Models Applied to Controlled Natural Languages in Communicating Diabetes Therapies
Federica Vezzani | Sara Vecchiato | Elena Frattolin

The aim of this exploratory study is to test the possibility of enhancing the quality of institutional communication related to diabetes self-treatment by switching from manual to prompt-based writing. The study proposes an investigation into the use of prompts applied to controlled natural language, particularly in Italian, French and English. Starting from a corpus of three comparable texts concerning the so-called Rule of 15, a reformulation is undertaken in accordance with the principles of controlled natural languages. Feedback will be gathered through a Likert scale questionnaire and a comprehension test administered to anonymous volunteers.

pdf bib
Simplifying Lithuanian text into Easy-to-Read language using large language models
Simona Kuoraitė | Valentas Gružauskas

This paper explores the task of simplifying Lithuanian text into Easy-to-Read language. Easy-to-Read language is a text written in short, clear sentences and simple words, adapted for people with intellectual disabilities or limited language skills. The aim of this work is to investigate how the large language model Lt-Llama-2-7b-hf, pre-trained on Lithuanian language data, can be adapted to the task of simplifying Lithuanian texts into Easy-to-Read language. To achieve this goal, specialized datasets were developed to fine-tune the model, and experiments were carried out. The model was tested by presenting the texts in their original language and the texts with a prompt adapted to the task. The results were evaluated using the SARI metric for assessing the quality of simplified texts and a qualitative evaluation of the large language model. The results show that the fine-tuned model sometimes simplifies text better than a not fine-tuned model, but that a larger and more extensive dataset would be needed to achieve significant results, and that more research should be carried out on fine-tuning the model for this task.

pdf bib
ChatGPT and Mistral as a tool for intralingual translation into Easy French
Julia Degenhardt

FALC is a simplified variety of French designed to enhance text comprehensibility and accessibility. Despite its societal benefits, the availability of FALC texts remains limited due to the costly human translation process. This study explores the potential of LLMs, specifically ChatGPT and Mistral, as a tool for automatic intralingual translations. The AI-generated translations of standard French texts on sexual health are compared to human-translated versions. Using a mixed-method approach, the study evaluates content accuracy, readability, and syntactic complexity.

pdf bib
Simplifying healthcare communication: Evaluating AI-driven plain language editing of informed consent forms
Vicent Briva-Iglesias | Isabel Peñuelas Gil

Clear communication between patients and healthcare providers is crucial, particularly in informed consent forms (ICFs), which are often written in complex, technical language. This paper explores the effectiveness of generative artificial intelligence (AI) for simplifying ICFs into Plain Language (PL), aiming to enhance patient comprehension and informed decision-making. Using a corpus of 100 cancer-related ICFs, two distinct prompt engineering strategies (Simple AI Edit and Complex AI Edit) were evaluated through readability metrics: Flesch Reading Ease, Gunning Fog Index, and SMOG Index. Statistical analyses revealed statistically significant improvements in readability for AI-simplified texts compared to original documents. Interestingly, the Simple AI Edit strategy consistently outperformed the Complex AI Edit across all metrics. These findings suggest that minimalistic prompt strategies may be optimal, democratizing AI-driven text simplification in healthcare by requiring less expertise and resources. The study underscores the potential for AI to significantly improve patient-provider communication, highlighting future research directions for qualitative assessments and multilingual applications.

pdf bib
Translating Easy Language administrative texts: a quantitative analysis of DeepL’s performance from German into Italian using a bilingual corpus
Christiane Maaß | Chiara Fioravanti

This study evaluates the performance of DeepL as an AI-based translation engine, in translating German Easy Language Texts into Italian. The evaluation is based on a corpus of 26 German fact sheets and their Italian human translations. The results show that DeepL’s translations exhibit significant errors in terminology, accuracy, and language conventions. The machine-translated texts often lack consistency in terminology, and the use of technical or unfamiliar words is not adapted to the difficulty level of the target language. Furthermore, the translations tend to normalize the texts towards standard administrative language, making them less accessible. The study highlights the need for human post-editing to ensure both accuracy and suitability of the translated texts. The findings of this study will help identify where to prioritize post-editing efforts and facilitate comparisons with the results obtained from other artificial intelligence tools used for interlingual translation of Easy Language texts in the administrative domain.

pdf bib
Do professionally adapted texts follow existing Easy-to-Understand (E2U) language guidelines? A quantitative analysis of two professionally adapted corpora
Andreea Deleanu | Constantin Orăsan | Shenbin Qian | Anastasiia Bezobrazova | Sabine Braun

Easy-to-Understand (E2U) language varieties have been recognized by the UN Convention on the Rights of Persons with Disabilities as a means to prevent communicative exclusion of those facing cognitive barriers and guarantee the fundamental right to Accessible Communication. However, guidance on what it is that makes language ‘easier to understand’ is still fragmented and vague, leading practitioners to rely on their individual expertise. For this reason, this article presents a quantitative corpus analysis to further understand which features of E2U language can more effectively improve verbal comprehension according to professional practice. This is achieved by analysing two parallel corpora of standard and professionally adapted E2U articles to identify adaptation practices implemented according to, in spite of or in addition to official E2U guidelines (Deleanu et al., 2024). The results stemming from the corpus analysis, provide insight into the most effective adaptation strategies that can reduce complexity in verbal discourse. This article will present the methods and results of the corpus analysis.

pdf bib
Quantifying word complexity for Leichte Sprache: A computational metric and its psycholinguistic validation
Umesh Patil | Jesus Calvillo | Sol Lago | Anne-Kathrin Schumann

Leichte Sprache (Easy Language or Easy German) is a strongly simplified version of German geared toward a target group with limited language proficiency. In Germany, public bodies are required to provide information in Leichte Sprache. Unfortunately, Leichte Sprache rules are traditionally defined by non-linguists, they are not rooted in linguistic research and they do not provide precise decision criteria or devices for measuring the complexity of linguistic structures (Bock and Pappert,2023). For instance, one of the rules simply recommends the usage of simple rather than complex words. In this paper we, therefore, propose a model to determine word complexity. We train an XGBoost model for classifying word complexity by leveraging word-level linguistic and corpus-level distributional features, frequency information from an in-house Leichte Sprache corpus and human complexity annotations. We psycholinguistically validate our model by showing that it captures human word recognition times above and beyond traditional word-level predictors. Moreover, we discuss a number of practical applications of our classifier, such as the evaluation of AI-simplified text and detection of CEFR levels of words. To our knowledge, this is one of the first attempts to systematically quantify word complexity in the context of Leichte Sprache and to link it directly to real-time word processing.

pdf bib
Democracy Made Easy: Simplifying Complex Topics to Enable Democratic Participation
Nouran Khallaf | Stefan Bott | Carlo Eugeni | John O’Flaherty | Serge Sharoff | Horacio Saggion

Several people are excluded from democratic deliberation because the language which is used in this context may be too difficult to understand for them. Our iDEM project aims at lowering existing linguistic barriers in deliberative processes by developing technology to facilitate the translation of complicated text into easy to read formats which are more suitable for may people. In this paper we describe classification experiments for detecting different types of difficulties which should be amended in order to make texts easier to understand. We focus on a lexical simplification system which can achieve state-of-the-art results with the use of a free and open-weight Large Language Model for the Romance Languages in the iDEM project. Moreover, a sentence segmentation system is introduced that can create text segmentation for long sentences based on training data. We describe the iDEM mobile app, which will make our technology available as a service for end-users of our target populations.

up

pdf (full)
bib (full)
Proceedings of the Third International Workshop on Automatic Translation for Signed and Spoken Languages (AT4SSL)

pdf bib
Proceedings of the Third International Workshop on Automatic Translation for Signed and Spoken Languages (AT4SSL)
Dimitar Shterionov | Mirella De Sisto | Bram Vanroy | Vincent Vandeghinste | Victoria Nyst | Myriam Vermeerbergen | Floris Roelofsen | Lisa Lepp | Irene Strasly

pdf bib
Pose-Based Sign Language Appearance Transfer
Amit Moryossef | Gerard Sant | Zifan Jiang

We introduce a method for transferring the signer’s appearance in sign language skeletal poses while preserving the sign content. Using estimated poses, we transfer the appearance of one signer to another, maintaining natural movements and transitions. This approach improves pose-based rendering and sign stitching while obfuscating identity. Our experiments show that while the method reduces signer identification accuracy, it slightly harms sign recognition performance, highlighting a tradeoff between privacy and utility.

pdf bib
Spontaneous Catalan Sign Language Recognition: Data Acquisition and Classification
Naiara Garmendia | Horacio Saggion | Euan McGill

This work presents the first investigation into Spontaneous Isolated Sign Language Recognition for Catalan Sign Language (LSC). Our work is grounded on the derivation of a dataset of signs and their glosses from a corpus of spontaneous dialogues and monologues. The recognition model is based on a Multi-Scale Graph Convolutional network fitted to our data. Results are promising since several signs are recognized with a high level of accuracy, and an average accuracy of 71% on the top 5 predicted classes from a total of 105 available. An interactive interface with experimental results is also presented. The data and software are made available to the research community.

pdf bib
User Involvement in the Research and Development Life Cycle of Sign Language Machine Translation Systems
Lisa Lepp | Dimitar Shterionov | Mirella De Sisto

Machine translation (MT) has evolved rapidly over the last 70 years thanks to the advances in processing technology, methodologies as well as the ever-increasing volumes of data. This trend is observed in the context of MT for spoken languages. However, when it comes to sign languages (SL) translation technologies, the progress is much slower; SLMT is still in its infancy with limited applications. One of the main factors for this set back is the lack of effective, respectful and fair user involvement across the different phases of the research and development of SLMT. We present a meta-review of 111 articles on SLMT from the perspective of user involvement. Our analysis investigates what users are involved and what tasks they assume in the first four phrases of MT research: (i) Problem and definition, (ii) Dataset construction, (iii) Model Design and Training, (iv) Model Validation and Evaluation. We find out that users have primarily been involved as data creators and monitors as well as evaluators. We assess that effective co-creation, as defined in (Lepp et al., 2025), has not been performed and conclude with recommendations for improving the MT research and development landscape from a co-creative perspective.

pdf bib
PaSCo1: A Parallel Video-SiGML Swiss French Sign Language Corpus in Medical Domain
Bastien David | Pierrette Bouillon | Jonathan Mutal | Irene Strasly | Johanna Gerlach | Hervé Spechbach

This article introduces the parallel sign language translation corpus, PaSCo1, developed as part of the BabelDr project, an automatic speech translation system for medical triage. PaSCo1 aims to make a set of medical data available in Swiss French Sign Language (LSF-CH) in the form of both videos signed by a human and their description in G-SiGML mark-up language. We describe the beginnings of the corpus as part of the BabelDr project, as well as the methodology used to create the videos and generate the G-SiGML language using the SiGLA platform. The resulting FAIR corpus comprises 2 031 medical questions and instructions in the form of videos and G-SiGML code.

up

pdf (full)
bib (full)
Proceedings of the Second Workshop on Creative-text Translation and Technology (CTT)

pdf bib
Proceedings of the Second Workshop on Creative-text Translation and Technology (CTT)
Bram Vanroy | Marie-Aude Lefer | Lieve Macken | Paola Ruffo | Ana Guerberof Arenas | Damien Hansen

pdf bib
The Role of Translation Workflows in Overcoming Translation Difficulties: A Comparative Analysis of Human and Machine Translation (Post-Editing) Approaches
Lieve Macken | Paola Ruffo | Joke Daems

This study investigates the impact of different translation workflows and underlying machine translation technologies on the translation strategies used in literary translations. We compare human translation, translation within a computer-assisted translation (CAT) tool, and machine translation post-editing (MTPE), alongside neural machine translation (NMT) and large language models (LLMs). Using three short stories translated from English into Dutch, we annotated translation difficulties and strategies employed to overcome them. Our analysis reveals differences in translation solutions across modalities, highlighting the influence of technology on the final translation. The findings suggest that while MTPE tends to produce more literal translations, human translators and CAT tools exhibit greater creativity and employ more non-literal translation strategies. Additionally, LLMs reduced the number of literal translation solutions compared to traditional NMT systems. While our study provides valuable insights, it is limited by the use of only three texts and a single language pair. Further research is needed to explore these dynamics across a broader range of texts and languages, to better understand the full impact of translation workflows and technologies on literary translation.

pdf bib
Does the perceived source of a translation (NMT vs. HT) impact student revision quality for news and literary texts?
Xiaoye Li | Joke Daems

With quality improvements in neural machine translation (NMT), scholars have argued that human translation revision and MT post-editing are becoming more alike, which would have implications for translator training. This study contributes to this growing body of work by exploring the ability of student translators (ZH-EN) to distinguish between NMT and human translation (HT) for news text and literary text and analyses how text type and student perceptions influence their subsequent revision process. We found that participants were reasonably adept at distinguishing between NMT and HT, particularly for literary texts. Participants’ revision quality was dependent on the text type as well as the perceived source of translation. The findings also highlight student translators’ limited competence in revision and post-editing, emphasizing the need to integrate NMT, revision, and post-editing into translation training programmes.

pdf bib
Effects of Domain-adapted Machine Translation on the Machine Translation User Experience of Video Game Translators
Judith Brenner | Julia Othlinghaus-Wulhorst

In this empirical study we examine three different translation modes with varying involvement of machine translation (MT) post-editing (PE) when translating video game texts. The three translation modes are translation from scratch without MT, full PE of MT output in a static way, and flexible PE as a combination of translation from scratch and post-editing of only those machine-translated sentences deemed useful by the translator. Data generation took place at the home offices of freelance game translators. In a mixed-methods approach, quantitative data was generated through keylogging, eye tracking, error annotation, and user experience questionnaires as well as qualitative data through interviews. Results show a negative perception of PE and suggest that translators’ user experience is positive when translating from scratch, neutral with a positive tendency when doing flexible PE of domain-adapted MT output and negative with static PE of generic MT output.

pdf bib
Fine-tuning and evaluation of NMT models for literary texts using RomCro v.2.0
Bojana Mikelenić | Antoni Oliver | Sergi Àlvarez Vidal

This paper explores the fine-tuning and evaluation of neural machine translation (NMT) models for literary texts using RomCro v.2.0, an expanded multilingual and multidirectional parallel corpus. RomCro v.2.0 is based on RomCro v.1.0, but includes additional literary works, as well as texts in Catalan, making it a valuable resource for improving MT in underrepresented language pairs. Given the challenges of literary translation, where style, narrative voice, and cultural nuances must be preserved, fine-tuning on high-quality domain-specific data is essential for enhancing MT performance. We fine-tune existing NMT models with RomCro v.2.0 and evaluate their performance for six different language combinations using automatic metrics and for Spanish-Croatian and French-Catalan using manual evaluation. Results indicate that fine-tuned models outperform general-purpose systems, achieving greater fluency and stylistic coherence. These findings support the effectiveness of corpus-driven fine-tuning for literary translation and highlight the importance of curated high-quality corpus.

pdf bib
Can Peter Pan Survive MT? A Stylometric Study of LLMs, NMTs, and HTs in Children’s Literature Translation
Delu Kong | Lieve Macken

This study focuses on evaluating the performance of machine translations (MTs) compared to human translations (HTs) in children’s literature translation (CLT) from a stylometric perspective. The research constructs a extitPeter Pan corpus, comprising 21 translations: 7 human translations (HTs), 7 large language model translations (LLMs), and 7 neural machine translation outputs (NMTs). The analysis employs a generic feature set (including lexical, syntactic, readability, and n-gram features) and a creative text translation (CTT-specific) feature set, which captures repetition, rhyme, translatability, and miscellaneous levels, yielding 447 linguistic features in total. Using classification and clustering techniques in machine learning, we conduct a stylometric analysis of these translations. Results reveal that in generic features, HTs and MTs exhibit significant differences in conjunction word distributions and the ratio of 1-word-gram-一样, while NMTs and LLMs show significant variation in descriptive words usage and adverb ratios. Regarding CTT-specific features, LLMs outperform NMTs in distribution, aligning more closely with HTs in stylistic characteristics, demonstrating the potential of LLMs in CLT.

up

pdf (full)
bib (full)
Proceedings of the 3rd Workshop on Gender-Inclusive Translation Technologies (GITT 2025)

pdf bib
Proceedings of the 3rd Workshop on Gender-Inclusive Translation Technologies (GITT 2025)
Janiça Hackenbuchner | Luisa Bentivogli | Joke Daems | Chiara Manna | Beatrice Savoldi | Eva Vanmassenhove

pdf bib
Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation
Chiara Manna | Afra Alishahi | Frédéric Blain | Eva Vanmassenhove

While gender bias in modern Neural Machine Translation (NMT) systems has received much attention, the traditional evaluation metrics for these systems do not fully capture the extent to which models integrate contextual gender cues. We propose a novel evaluation metric called Minimal Pair Accuracy (MPA) which measures the reliance of models on gender cues for gender disambiguation. We evaluate a number of NMT models using this metric, we show that they ignore available gender cues in most cases in favour of (statistical) stereotypical gender interpretation. We further show that in anti-stereotypical cases, these models tend to more consistently take male gender cues into account while ignoring the female cues. Finally, we analyze the attention head weights in the encoder component of these models and show that while all models to some extent encode gender information, the male gender cues elicit a more diffused response compared to the more concentrated and specialized responses to female gender cues.

pdf bib
Gender Bias in English-to-Greek Machine Translation
Eleni Gkovedarou | Joke Daems | Luna De Bruyne

As the demand for inclusive language increases, concern has grown over the susceptibility of machine translation (MT) systems to reinforce gender stereotypes. This study investigates gender bias in two commercial MT systems, Google Translate and DeepL, focusing on the understudied English-to-Greek language pair. We address three aspects of gender bias: i) male bias, ii) occupational stereotyping, and iii) errors in anti-stereotypical translations. Additionally, we explore the potential of prompted GPT-4o as a bias mitigation tool that provides both gender-explicit and gender-neutral alternatives when necessary. To achieve this, we introduce GendEL, a manually crafted bilingual dataset of 240 gender-ambiguous and unambiguous sentences that feature stereotypical occupational nouns and adjectives. We find persistent gender bias in translations by both MT systems; while they perform well in cases where gender is explicitly defined, with DeepL outperforming both Google Translate and GPT-4o in feminine gender-unambiguous sentences, they are far from producing gender-inclusive or neutral translations when the gender is unspecified. GPT-4o shows promise, generating appropriate gendered and neutral alternatives for most ambiguous cases, though residual biases remain evident. As one of the first comprehensive studies on gender bias in English-to-Greek MT, we provide both our data and code at [github link].

pdf bib
An LLM-as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation
Andrea Piergentili | Beatrice Savoldi | Matteo Negri | Luisa Bentivogli

Gender-neutral translation (GNT) aims to avoid expressing the gender of human referents when the source text lacks explicit cues about the gender of those referents. Evaluating GNT automatically is particularly challenging, with current solutions being limited to monolingual classifiers. Such solutions are not ideal because they do not factor in the source sentence and require dedicated data and fine-tuning to scale to new languages. In this work, we address such limitations by investigating the use of large language models (LLMs) as evaluators of GNT. Specifically, we explore two prompting approaches: one in which LLMs generate sentence-level assessments only, and another—akin to a chain-of-thought approach—where they first produce detailed phrase-level annotations before a sentence-level judgment. Through extensive experiments on multiple languages with five models, both open and proprietary, we show that LLMs can serve as evaluators of GNT. Moreover, we find that prompting for phrase-level annotations before sentence-level assessments consistently improves the accuracy of all models, providing a better and more scalable alternative to current solutions.

pdf bib
Did I (she) or I (he) buy this? Or rather I (she/he)? Towards first-person gender neutral translation by LLMs
Maja Popović | Ekaterina Lapshinova-Koltunski | Anastasiia Göldner

This paper presents an analysis of gender in first-person mentions translated from English into two Slavic languages with the help of three LLMs and two different prompts. We explore if LLMs are able to generate Amazon product reviews with gender neutral first person forms. Apart from the overall question about the ability to produce gender neutral translations, we look into the impact of a prompt with a specific instruction which is supposed to reduce the gender bias in LLMs output translations. Our results show that although we are able to achieve a reduction in gender bias, our specific prompt cause also a number of errors. Analysing those emerging problems qualitatively, we formulate suggestions that could be helpful for the development of better prompting strategies in the future work on gender bias reduction.

pdf bib
Gender-Neutral Machine Translation Strategies in Practice
Hillary Dawkins | Isar Nejadgholi | Chi-Kiu Lo

Gender-inclusive machine translation (MT) should preserve gender ambiguity in the source to avoid misgendering and representational harms. While gender ambiguity often occurs naturally in notional gender languages such as English, maintaining that gender neutrality in grammatical gender languages is a challenge. Here we assess the sensitivity of 21 MT systems to the need for gender neutrality in response to gender ambiguity in three translation directions of varying difficulty. The specific gender-neutral strategies that are observed in practice are categorized and discussed. Additionally, we examine the effect of binary gender stereotypes on the use of gender-neutral translation. In general, we report a disappointing absence of gender-neutral translations in response to gender ambiguity. However, we observe a small handful of MT systems that switch to gender neutral translation using specific strategies, depending on the target language.

pdf bib
Gender-inclusive language and machine translation: from Spanish into Italian
Antonella Bove

Gender-inclusive language is a discursive practice that introduces the use of new forms and strategies to make women and different non-binary gender identities more visible. Spanish uses gender doublets (los niños y las niñas, los/as candidatos/as), the neomorpheme -e, and typographic signs such as @ and x. Similarly, Italian employs gender doublets (i bambini e le bambine, i/le candidati/e), the schwa (ə) as a neomorpheme, and the asterisk (*) as a typographic sign. Strategies like gender doublet and the @ sign aims at making women visible from a binary perspective; the others are intended to give visibility to non-binary gender identities as well (Escandell-Vidal 2020, Giusti 2022). Without a clear and agreed standard, inclusive translation poses a significant challenge and a great social responsibility for translation professionals. Hence, it is crucial to study and evaluate the quality of the outputs generated by machine translation systems (Kornacki & Pietrzak 2025, Pfalzgraf 2024). This paper contributes to the understanding of this phenomenon by analyzing the interaction between artificial intelligence systems and Spanish inclusive strategies in translation into Italian within an augmented translation perspective (Kornacki & Pietrzak 2025). The methodology involved three main steps: data collection, annotation, and analysis. Academic texts originally written in Spanish were gathered from which specific segments were extracted. Using segment-level analysis allowed for the creation of a more diverse corpus. In total, 20 instances were collected for each inclusive language strategy examined: fully split forms, half-split forms, the neomorpheme -e, the typographic sign @ and x. These segments were then translated using four artificial intelligence systems: two neural translation systems (DeepL and Google Translate) and two generative AI systems (ChatGPT and Gemini).

pdf bib
Evaluating Gender Bias in Dutch NLP: Insights from RobBERT-2023 and the HONEST Framework
Marie Dewulf

This study investigates gender bias in the Dutch RobBERT-2023 language model using an adapted version of the HONEST framework, which assesses harmful sentence completions. By translating and expanding HONEST templates to include non-binary and gender-neutral language, we systematically evaluate whether RobBERT-2023 exhibits biased or harmful outputs across gender identities. Our findings reveal that while the model’s overall bias score is relatively low, non-binary identities are disproportionately affected by derogatory language.

up

pdf (full)
bib (full)
Proceedings of the Eleventh Workshop on Patent and Scientific Literature Translation (PSLT 2025)

pdf bib
Proceedings of the Eleventh Workshop on Patent and Scientific Literature Translation (PSLT 2025)
Takashi Tsunakawa | Katsuhito Sudoh | Isao Goto

pdf bib
GenAIese - A Comprehensive Comparison of GPT-4o and DeepSeek-V3 for English-to-Chinese Academic Translation
Longhui Zou | Ke Li | Joshua Lamerton | Mehdi Mirzapour

This study investigates the translation performance of two large language models–ChatGPT-4o and DeepSeek-V3–in translating English academic papers on on language, culture, and literature into Chinese at the discourse level. Using a corpus of 11 academic texts totaling 3,498 sentences, we evaluated translation quality through automatic metrics (COMET-KIWI), lexical diversity indicators, and syntactic complexity measures. Our findings reveal an interesting contrast\colon while DeepSeek-V3 achieves higher overall quality scores, GPT-4o produces translations with consistently greater lexical richness (higher type-token ratio, standardized TTR, average sentence length, and word entropy) and syntactic complexity across all five measured metrics, such as Incomplete Dependency Theory Metric (IDT), Dependency Locality Theory Metric (DLT), Combined IDT+DLT Metric (IDT+DLT), Left-Embeddedness (LE), and Nested Nouns Distance (NND). Particularly notable are GPT-4o’s higher scores in Left-Embeddedness and Nested Nouns Distance metrics, which are specifically relevant to Chinese linguistic patterns. The divergence between automatic quality estimation and linguistic complexity metrics highlights the multifaceted nature of translation quality assessment.

pdf bib
Tailoring Machine Translation for Scientific Literature through Topic Filtering and Fuzzy Match Augmentation
Thomas Moerman | Tom Vanallemeersch | Sara Szoc | Arda Tezcan

To enhance the accessibility of scientific literature in multiple languages and facilitate the exchange of information among scholars and a wider audience, there is a need for high-performing specialized machine translation (MT) engines. However, this requires efficient filtering and the use of domain-specific data. In this study, we investigate whether approaches for increasing training data using topic filtering and more efficient use of such data through exploiting fuzzy matches (i.e. similar translations to a given input; FMs) improve translation quality. We apply these techniques both to sequence-to-sequence MT models and off-the-shelf multilingual large language models (LLMs) in three scientific disciplines. Our results suggest that the combination of topic filtering and FM augmentation is an effective strategy for training neural machine translation (NMT) models from scratch, not only surpassing baseline NMT models but also delivering improved translation performance compared to smaller LLMs in terms of the number of parameters. Furthermore, we find that although FM augmentation through in-context learning generally improves LLM translation performance, limited domain-specific datasets can yield results comparable to those achieved with additional multi-domain datasets.