Vilém Zouhar


2025

pdf bib
Are Large Language Models for Education Reliable Across Languages?
Vansh Gupta | Sankalan Pal Chowdhury | Vilém Zouhar | Donya Rooein | Mrinmaya Sachan
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Large language models (LLMs) are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in eight languages (Mandarin, Hindi, Arabic, German, Farsi, Telugu, Ukrainian, Czech) in addition to English. We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance. However, at least some models are able to more or less maintain their levels of performance across all languages. Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment.

pdf bib
Biased Tales: Cultural and Topic Bias in Generating Children’s Stories
Donya Rooein | Vilém Zouhar | Debora Nozza | Dirk Hovy
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Stories play a pivotal role in human communication, shaping beliefs and morals, particularly in children. As parents increasingly rely on large language models (LLMs) to craft bedtime stories, the presence of cultural and gender stereotypes in these narratives raises significant concerns. To address this issue, we present Biased Tales, a comprehensive dataset designed to analyze how biases influence protagonists’ attributes and story elements in LLM-generated stories. Our analysis uncovers striking disparities. When the protagonist is described as a girl (as compared to a boy), appearance-related attributes increase by 55.26%. Stories featuring non-Western children disproportionately emphasize cultural heritage, tradition, and family themes far more than those for Western children. Our findings highlight the role of sociocultural bias in making creative AI use more equitable and diverse.

pdf bib
Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
Gabriele Sarti | Vilém Zouhar | Malvina Nissim | Arianna Bisazza
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

pdf bib
Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification
Chenfei Xiong | Jingwei Ni | Yu Fan | Vilém Zouhar | Donya Rooein | Lorena Calvo-Bartolomé | Alexander Miserlis Hoyle | Zhijing Jin | Mrinmaya Sachan | Markus Leippold | Dirk Hovy | Mennatallah El-Assady | Elliott Ash
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.

pdf bib
CafGa: Customizing Feature Attributions to Explain Language Models
Alan David Boyle | Furui Cheng | Vilém Zouhar | Mennatallah El-Assady
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Feature attribution methods, such as SHAP and LIME, explain machine learning model predictions by quantifying the influence of each input component. When applying feature attributions to explain language models, a basic question is defining the interpretable components.Traditional feature attribution methods, commonly treat individual words as atomic units.This is highly computationally inefficient for long-form text and fails to capture semantic information that spans multiple words.To address this, we present CafGa, an interactive tool for generating and evaluating feature attribution explanations at customizable granularities. CafGa supports customized segmentation with user interaction and visualizes the deletion and insertion curves for explanation assessments. Through a user study involving participants of various expertise, we confirm CafGa’s usefulness, particularly among LLM practitioners. Explanations created using CafGa were also perceived as more useful compared to those generated by two fully automatic baseline methods: PartitionSHAP and MExGen, suggesting the effectiveness of the system.

pdf bib
Findings of the IWSLT 2025 Evaluation Campaign
Idris Abdulmumin | Victor Agostinelli | Tanel Alumäe | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Claudia Borg | Fethi Bougares | Roldano Cattoni | Mauro Cettolo | Lizhong Chen | William Chen | Raj Dabre | Yannick Estève | Marcello Federico | Mark Fishel | Marco Gaido | Dávid Javorský | Marek Kasztelnik | Fortuné Kponou | Mateusz Krubiński | Tsz Kin Lam | Danni Liu | Evgeny Matusov | Chandresh Kumar Maurya | John P. McCrae | Salima Mdhaffar | Yasmin Moslem | Kenton Murray | Satoshi Nakamura | Matteo Negri | Jan Niehues | Atul Kr. Ojha | John E. Ortega | Sara Papi | Pavel Pecina | Peter Polák | Piotr Połeć | Ashwin Sankar | Beatrice Savoldi | Nivedita Sethiya | Claytone Sikasote | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Brian Thompson | Marco Turchi | Alex Waibel | Patrick Wilken | Rodolfo Zevallos | Vilém Zouhar | Maike Züfle
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

This paper presents the outcomes of the shared tasks conducted at the 22nd International Workshop on Spoken Language Translation (IWSLT). The workshop addressed seven critical challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, model compression, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks garnered significant participation, with 32 teams submitting their runs. The field’s growing importance is reflected in the increasing diversity of shared task organizers and contributors to this overview paper, representing a balanced mix of industrial and academic institutions. This broad participation demonstrates the rising prominence of spoken language translation in both research and practical applications.

pdf bib
A Bayesian Optimization Approach to Machine Translation Reranking
Julius Cheng | Maike Züfle | Vilém Zouhar | Andreas Vlachos
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Reranking, or scoring a list of prediction candidates from a machine translation system with an external scoring model and returning the highest-scoring candidate, remains a simple and effective method for improving prediction quality. However, reranking with high quality scoring models can add substantial computational cost to the translation pipeline, which we address in this work by framing list reranking as a Bayesian optimization (BayesOpt) problem over the candidate list, where unknown scores are modeled with a Gaussian process. This algorithm scores candidates iteratively, choosing next candidates by balancing between exploration, choosing to score those that differ from candidates already scored, and exploitation, choosing to score those that resemble high-scoring candidates.This procedure finds high-scoring candidates while scoring only a fraction of the candidates list; given candidate lists of 200 random samples (before deduplication), our method achieves the same CometKiwi score using only 70 scoring evaluations on average compared to scoring a random subset of 180 candidates. We also propose multi-fidelity BayesOpt for list reranking, where scores obtained from a noisier but cheaper proxy scoring model are incorporated into the search process. We show that well-trained distilled proxy scorers can further improve the performance of BayesOpt.

pdf bib
AI-Assisted Human Evaluation of Machine Translation
Vilém Zouhar | Tom Kocmi | Mrinmaya Sachan
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

pdf bib
Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets
Tom Kocmi | Ekaterina Artemova | Eleftherios Avramidis | Rachel Bawden | Ondřej Bojar | Konstantin Dranch | Anton Dvorkovich | Sergey Dukanov | Mark Fishel | Markus Freitag | Thamme Gowda | Roman Grundkiewicz | Barry Haddow | Marzena Karpinska | Philipp Koehn | Howard Lakougna | Jessica Lundin | Christof Monz | Kenton Murray | Masaaki Nagata | Stefano Perrella | Lorenzo Proietti | Martin Popel | Maja Popović | Parker Riley | Mariya Shmatova | Steinthór Steingrímsson | Lisa Yankovskaya | Vilém Zouhar
Proceedings of the Tenth Conference on Machine Translation

This paper presents the results of the General Machine Translation Task organized as part of the 2025 Conference on Machine Translation (WMT). Participants were invited to build systems for any of 30 language pairs. For half of these pairs, we conducted a human evaluation on test sets spanning four to five different domains.We evaluated 60 systems in total: 36 submitted by participants and 24 for which we collected translations from large language models (LLMs) and popular online translation providers.This year, we focused on creating challenging test sets by developing a difficulty sampling technique and using more complex source data. We evaluated system outputs with professional annotators using the Error Span Annotation (ESA) protocol, except for two language pairs, for which we used Multidimensional Quality Metrics (MQM) instead.We continued the trend of increasingly moving towards document-level translation, providing the source texts as whole documents containing multiple paragraphs.

pdf bib
Findings of the WMT25 Multilingual Instruction Shared Task: Persistent Hurdles in Reasoning, Generation, and Evaluation
Tom Kocmi | Sweta Agrawal | Ekaterina Artemova | Eleftherios Avramidis | Eleftheria Briakou | Pinzhen Chen | Marzieh Fadaee | Markus Freitag | Roman Grundkiewicz | Yupeng Hou | Philipp Koehn | Julia Kreutzer | Saab Mansour | Stefano Perrella | Lorenzo Proietti | Parker Riley | Eduardo Sánchez | Patricia Schmidtova | Mariya Shmatova | Vilém Zouhar
Proceedings of the Tenth Conference on Machine Translation

The WMT25 Multilingual Instruction Shared Task (MIST) introduces a benchmark to evaluate large language models (LLMs) across 30 languages. The benchmark covers five types of problems: machine translation, linguistic reasoning, open-ended generation, cross-lingual summarization, and LLM-as-a-judge.We provide automatic evaluation and collect human annotations, which highlight the limitations of automatic evaluation and allow further research into metric meta-evaluation. We run on our benchmark a diverse set of open- and closed-weight LLMs, providing a broad assessment of the multilingual capabilities of current LLMs. Results highlight substantial variation across sub-tasks and languages, revealing persistent challenges in reasoning, cross-lingual generation, and evaluation reliability. This work establishes a standardized framework for measuring future progress in multilingual LLM development.

pdf bib
Findings of the WMT25 Shared Task on Automated Translation Evaluation Systems: Linguistic Diversity is Challenging and References Still Help
Alon Lavie | Greg Hanneman | Sweta Agrawal | Diptesh Kanojia | Chi-Kiu Lo | Vilém Zouhar | Frederic Blain | Chrysoula Zerva | Eleftherios Avramidis | Sourabh Deoghare | Archchana Sindhujan | Jiayi Wang | David Ifeoluwa Adelani | Brian Thompson | Tom Kocmi | Markus Freitag | Daniel Deutsch
Proceedings of the Tenth Conference on Machine Translation

The WMT25 Shared Task on Automated Translation Evaluation Systems evaluates metrics and quality estimation systems that assess the quality of language translation systems. This task unifies and consolidates the separate WMT shared tasks on Machine Translation Evaluation Metrics and Quality Estimation from previous years. Our primary goal is to encourage the development and assessment of new state-of-the-art translation quality evaluation systems. The shared task this year consisted of three subtasks: (1) segment-level quality score prediction, (2) span-level translation error annotation, and (3) quality-informed segment-level error correction. The evaluation data for the shared task were provided by the General MT shared task and were complemented by “challenge sets” from both the organizers and participants. Task 1 results indicate the strong performance of large LLMs at the system level, whilereference-based baseline metrics outperform LLMs at the segment level. Task 2 results indicate that accurate error detection and balancing precision and recall are persistent challenges. Task 3 results show that minimal editing is challenging even when informed by quality indicators. Robustness across the broad diversity of languages remains a major challenge across all three subtasks.

pdf bib
Findings of the WMT25 Terminology Translation Task: Terminology is Useful Especially for Good MTs
Kirill Semenov | Xu Huang | Vilém Zouhar | Nathaniel Berger | Dawei Zhu | Arturo Oncevay | Pinzhen Chen
Proceedings of the Tenth Conference on Machine Translation

The WMT25 Terminology Translation Task releases new resources in high-stakes domains and investigates the capabilities of translation systems to accurately and consistently translate specialized terms. This year, we feature new domain and language coverage over previous editions, introducing two distinct tracks: (1) sentence-level translation in the information technology domain for English→German, English→Russian, and English→Spanish, and (2) document-level translation in the finance domain for English↔Traditional Chinese with a document-level one-to-many dictionary. Participants are challenged to translate texts under three modes: no terminology, proper terminology, and random terminology, allowing for a causal analysis of terminology utility. Evaluation combines overall quality, terminology accuracy, and terminology consistency. This shared task attracted broad participation, with 13 teams submitting 20 systems in Track 1 and 4 teams participating in Track 2. The results show that providing proper terminology consistently boosts both overall translation quality and term accuracy, whereas reliance on random terminology yields smaller gains. Despite the near-saturation of sentence-level benchmarks, document-level finance translation still fallsshort, indicating an urgent need for long-form evaluation and more robust metrics tailored to professional domains.

pdf bib
COMET-poly: Machine Translation Metric Grounded in Other Candidates
Maike Züfle | Vilém Zouhar | Tu Anh Dinh | Felipe Maia Polo | Jan Niehues | Mrinmaya Sachan
Proceedings of the Tenth Conference on Machine Translation

Automated metrics for machine translation attempt to replicate human judgment. Unlike humans, who often assess a translation in the context of multiple alternatives, these metrics typically consider only the source sentence and a single translation. This discrepancy in the evaluation setup may negatively impact the performance of automated metrics. We propose two automated metrics that incorporate additional information beyond the single translation. COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand, thereby providing a more informed assessment of its quality. COMET-polyic, inspired by retrieval-based in-context learning, takes in translations of similar source texts along with their human-labeled quality scores to guide the evaluation. We find that including a single additional translation in COMET-polycand improves the segment-level metric performance (0.079 to 0.118 Kendall’s tau-b correlation), with further gains when more translations are added. Incorporating retrieved examples in COMET-polyic yields similar improvements (0.079 to 0.116 Kendall’s tau-b correlation). We release our models publicly.

2024

pdf bib
Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
Tom Kocmi | Vilém Zouhar | Christian Federmann | Matt Post
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Ten years ago a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain intuitions about metric deltas that drove earlier research and deployment decisions. This paper investigates the “dynamic range” of a number of modern metrics in an effort to provide a collective understanding of the meaning of differences in scores both within and among metrics; in other words, we ask “what point difference x in metric y is required between two systems for humans to notice?”. We conduct our evaluation on a new large dataset, ToShip23, using it to discover deltas at which metrics achieve system-level differences that are meaningful to humans, which we measure by pairwise system accuracy. We additionally show that this method of establishing delta-accuracy is more stable than the standard use of statistical p-values in regards to testset size. Where data size permits, we also explore the effect of metric deltas and accuracy across finer-grained features such as translation direction, domain, and system closeness.

pdf bib
How to Engage your Readers? Generating Guiding Questions to Promote Active Reading
Peng Cui | Vilém Zouhar | Xiaoyu Zhang | Mrinmaya Sachan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Using questions in written text is an effective strategy to enhance readability. However, what makes an active reading question good, what the linguistic role of these questions is, and what is their impact on human reading remains understudied. We introduce GuidingQ, a dataset of 10K in-text questions from textbooks and scientific articles. By analyzing the dataset, we present a comprehensive understanding of the use, distribution, and linguistic characteristics of these questions. Then, we explore various approaches to generate such questions using language models. Our results highlight the importance of capturing inter-question relationships and the challenge of question position identification in generating these questions. Finally, we conduct a human study to understand the implication of such questions on reading comprehension. We find that the generated questions are of high quality and are almost as effective as human-written questions in terms of improving readers’ memorization and comprehension.

pdf bib
Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains
Vilém Zouhar | Shuoyang Ding | Anna Currey | Tatyana Badeka | Jenyuan Wang | Brian Thompson
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We introduce a new, extensive multidimensional quality metrics (MQM) annotated dataset covering 11 language pairs in the biomedical domain. We use this dataset to investigate whether machine translation (MT) metrics which are fine-tuned on human-generated MT quality judgements are robust to domain shifts between training and inference. We find that fine-tuned metrics exhibit a substantial performance drop in the unseen domain scenario relative to both metrics that rely on the surface form and pre-trained metrics that are not fine-tuned on MT quality judgments.

pdf bib
Distributional Properties of Subword Regularization
Marco Cognetta | Vilém Zouhar | Naoaki Okazaki
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them.We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.

pdf bib
Quality and Quantity of Machine Translation References for Automatic Metrics
Vilém Zouhar | Ondřej Bojar
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

Automatic machine translation metrics typically rely on human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average (or maximum) helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.

pdf bib
PWESuite: Phonetic Word Embeddings and Tasks They Facilitate
Vilém Zouhar | Kalvin Chang | Chenxuan Cui | Nate B. Carlson | Nathaniel Romney Robinson | Mrinmaya Sachan | David R. Mortensen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Mapping words into a fixed-dimensional vector space is the backbone of modern NLP. While most word embedding methods successfully encode semantic information, they overlook phonetic information that is crucial for many tasks. We develop three methods that use articulatory features to build phonetically informed word embeddings. To address the inconsistent evaluation of existing phonetic word embedding methods, we also contribute a task suite to fairly evaluate past, current, and future methods. We evaluate both (1) intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and (2) extrinsic performance on tasks such as rhyme and cognate detection and sound analogies. We hope our task suite will promote reproducibility and inspire future phonetic embedding research.

pdf bib
Two Counterexamples to Tokenization and the Noiseless Channel
Marco Cognetta | Vilém Zouhar | Sangwhan Moon | Naoaki Okazaki
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In Tokenization and the Noiseless Channel (Zouhar et al., 2023), Rényi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest Rényi efficiency of the unigram distribution should be chosen. The Rényi efficiency is thus treated as a predictor of downstream performance (e.g., predicting BLEU for a machine translation task), without the expensive step of training multiple models with different tokenizers. Although useful, the predictive power of this metric is not perfect, and the authors note there are additional qualities of a good tokenization scheme that Rényi efficiency alone cannot capture. We describe two variants of BPE tokenization which can arbitrarily increase Rényi efficiency while decreasing the downstream model performance. These counterexamples expose cases where Rényi efficiency fails as an intrinsic tokenization metric and thus give insight for building more accurate predictors.

pdf bib
Findings of the WMT24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet
Tom Kocmi | Eleftherios Avramidis | Rachel Bawden | Ondřej Bojar | Anton Dvorkovich | Christian Federmann | Mark Fishel | Markus Freitag | Thamme Gowda | Roman Grundkiewicz | Barry Haddow | Marzena Karpinska | Philipp Koehn | Benjamin Marie | Christof Monz | Kenton Murray | Masaaki Nagata | Martin Popel | Maja Popović | Mariya Shmatova | Steinthór Steingrímsson | Vilém Zouhar
Proceedings of the Ninth Conference on Machine Translation

This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA).

pdf bib
Pitfalls and Outlooks in Using COMET
Vilém Zouhar | Pinzhen Chen | Tsz Kin Lam | Nikita Moghe | Barry Haddow
Proceedings of the Ninth Conference on Machine Translation

The COMET metric has blazed a trail in the machine translation community, given its strong correlation with human judgements of translation quality.Its success stems from being a modified pre-trained multilingual model finetuned for quality assessment.However, it being a machine learning model also gives rise to a new set of pitfalls that may not be widely known. We investigate these unexpected behaviours from three aspects:1) technical: obsolete software versions and compute precision; 2) data: empty content, language mismatch, and translationese at test time as well as distribution and domain biases in training; 3) usage and reporting: multi-reference support and model referencing in the literature. All of these problems imply that COMET scores are not comparable between papers or even technical setups and we put forward our perspective on fixing each issue.Furthermore, we release the sacreCOMET package that can generate a signature for the software and model configuration as well as an appropriate citation.The goal of this work is to help the community make more sound use of the COMET metric.

pdf bib
Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation
Tom Kocmi | Vilém Zouhar | Eleftherios Avramidis | Roman Grundkiewicz | Marzena Karpinska | Maja Popović | Mrinmaya Sachan | Mariya Shmatova
Proceedings of the Ninth Conference on Machine Translation

High-quality Machine Translation (MT) evaluation relies heavily on human judgments.Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages.On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but is less reliable.In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM.We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.

2023

pdf bib
Tokenization and the Noiseless Channel
Vilém Zouhar | Clara Meister | Juan Gastaldi | Li Du | Mrinmaya Sachan | Ryan Cotterell
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Subword tokenization is a key part of most NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to improved downstream model performance over others. We propose that good tokenizers lead to efficient channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum entropy of the subword distribution. Nevertheless, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency subwords and very short codes to high-frequency subwords.Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very low-frequency subwords.We posit that (1) extremely high-frequency subwords are problematic because their meaning is not distinct and (2) that low-frequency subwords may not appear frequently enough for their meaning to be learned properly; encodings that induce unigram distributions with either can harm model performance. In machine translation, we find that across multiple tokenizers, the Rényi entropy has a very strong correlation with BLEU: 0.82 in comparison to just -0.30 for compressed length.

pdf bib
Poor Man’s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference
Vilém Zouhar | Shehzaad Dhuliawala | Wangchunshu Zhou | Nico Daheim | Tom Kocmi | Yuchen Eleanor Jiang | Mrinmaya Sachan
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference. State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements yet they are computationally heavy and require human annotations, which are slow and expensive to create. To address these limitations, we define the problem of metric estimation (ME) where one predicts the automated metric scores also without the reference. We show that even without access to the reference, our model can estimate automated metrics (ρ = 60% for BLEU, ρ = 51% for other metrics) at the sentence-level. Because automated metrics correlate with human judgements, we can leverage the ME task for pre-training a QE model. For the QE task, we find that pre-training on TER is better (ρ = 23%) than training for scratch (ρ = 20%).

pdf bib
A Diachronic Perspective on User Trust in AI under Uncertainty
Shehzaad Dhuliawala | Vilém Zouhar | Mennatallah El-Assady | Mrinmaya Sachan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In human-AI collaboration, users typically form a mental model of the AI system, which captures the user’s beliefs about when the system performs well and when it does not. The construction of this mental model is guided by both the system’s veracity as well as the system output presented to the user e.g., the system’s confidence and an explanation for the prediction. However, modern NLP systems are seldom calibrated and are often confidently incorrect about their predictions, which violates users’ mental model and erodes their trust. In this work, we design a study where users bet on the correctness of an NLP system, and use it to study the evolution of user trust as a response to these trust-eroding events and how the user trust is rebuilt as a function of time after these events. We find that even a few highly inaccurate confidence estimation instances are enough to damage users’ trust in the system and performance, which does not easily recover over time. We further find that users are more forgiving to the NLP system if it is unconfidently correct rather than confidently incorrect, even though, from a game-theoretic perspective, their payoff is equivalent. Finally, we find that each user can entertain multiple mental models of the system based on the type of the question. These results highlight the importance of confidence calibration in developing user-centered NLP applications to avoid damaging user trust and compromising the collaboration performance.

pdf bib
Revisiting Automated Topic Model Evaluation with Large Language Models
Dominik Stammbach | Vilém Zouhar | Alexander Hoyle | Mrinmaya Sachan | Elliott Ash
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Topic models help us make sense of large text collections. Automatically evaluating their output and determining the optimal number of topics are both longstanding challenges, with no effective automated solutions to date. This paper proposes using large language models (LLMs) for these tasks. We find that LLMs appropriately assess the resulting topics, correlating more strongly with human judgments than existing automated metrics. However, the setup of the evaluation task is crucial — LLMs perform better on coherence ratings of word sets than on intrustion detection. We find that LLMs can also assist us in guiding us towards a reasonable number of topics. In actual applications, topic models are typically used to answer a research question related to a collection of texts. We can incorporate this research question in the prompt to the LLM, which helps estimating the optimal number of topics.

pdf bib
Enhancing Textbooks with Visuals from the Web for Improved Learning
Janvijay Singh | Vilém Zouhar | Mrinmaya Sachan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Textbooks are one of the main mediums for delivering high-quality education to students. In particular, explanatory and illustrative visuals play a key role in retention, comprehension and general transfer of knowledge. However, many textbooks lack these interesting visuals to support student learning. In this paper, we investigate the effectiveness of vision-language models to automatically enhance textbooks with images from the web. We collect a dataset of e-textbooks in the math, science, social science and business domains. We then set up a text-image matching task that involves retrieving and appropriately assigning web images to textbooks, which we frame as a matching optimization problem. Through a crowd-sourced evaluation, we verify that (1) while the original textbook images are rated higher, automatically assigned ones are not far behind, and (2) the precise formulation of the optimization problem matters. We release the dataset of textbooks with an associated image bank to inspire further research in this intersectional area of computer vision and NLP for education.

pdf bib
A Formal Perspective on Byte-Pair Encoding
Vilém Zouhar | Clara Meister | Juan Gastaldi | Li Du | Tim Vieira | Mrinmaya Sachan | Ryan Cotterell
Findings of the Association for Computational Linguistics: ACL 2023

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method.BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a 1/sigma*(1-e(-sigma))-approximation of an optimal merge sequence, where sigma is the total backward curvature with respect to the optimal merge sequence. Empirically the lower bound of the approximation is approx0.37.We provide a faster implementation of BPE which improves the runtime complexity from O(NM) to O(N log M), where N is the sequence length and M is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.

pdf bib
Findings of the WMT 2023 Shared Task on Machine Translation with Terminologies
Kirill Semenov | Vilém Zouhar | Tom Kocmi | Dongdong Zhang | Wangchunshu Zhou | Yuchen Eleanor Jiang
Proceedings of the Eighth Conference on Machine Translation

The WMT 2023 Terminology Shared Task investigates progress in machine translation of texts with specialized vocabulary. The participants were given the source text and segment-level terminology dictionaries for three language pairs: Chinese→English, English→Czech, and German→English. We evaluate 21 submissions from 7 teams on two main criteria: general translation quality and the effectiveness of translating specialized terminology. Systems took varied approaches — incorporating terminology at inference time or weakly supervised training that uses terminology access. While incorporating terminology dictionaries leads to improvement in the translation quality, incorporating an equal amount of information from the reference leads to similar results. This challenges the position of terminologies being the crux of meaning in translation, it can also be explained by inadequate metrics which are not terminology-centric.

2022

bib
Machine Translate: Open resources and community
Cecilia OL Yalangozian | Vilém Zouhar | Adam Bittlingmayer
Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track)

Machine Translate is a non-profit organization on a mission to make machine translation more accessible to more people. As the field of machine translation continues to grow, the project builds open resources and a community for developers, buyers and translators. The project is ruled by three values: quality, openness and accessibility. Content is open-source and welcomes open-contribution. It is kept up-to-date, and its information is presented in a clear and well-organized format. Machine Translate aims to be accessible to people from many backgrounds and, ultimately, also non-English speakers. The project covers everything about machine translation, from products to research, from development to theory, and from history to news. The topics are very diverse, and the writing is focused on concepts rather than on mathematical details.

pdf bib
Sentence Ambiguity, Grammaticality and Complexity Probes
Sunit Bhattacharya | Vilém Zouhar | Ondrej Bojar
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

It is unclear whether, how and where large pre-trained language models capture subtle linguistic traits like ambiguity, grammaticality and sentence complexity. We present results of automatic classification of these traits and compare their viability and patterns across representation types. We demonstrate that template-based datasets with surface-level artifacts should not be used for probing, careful comparisons with baselines should be done and that t-SNE plots should not be used to determine the presence of a feature among dense vectors representations. We also show how features might be highly localized in the layers for these models and get lost in the upper layers.

pdf bib
Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022)
Antoine Bosselut | Xiang Li | Bill Yuchen Lin | Vered Shwartz | Bodhisattwa Prasad Majumder | Yash Kumar Lal | Rachel Rudinger | Xiang Ren | Niket Tandon | Vilém Zouhar
Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022)

pdf bib
Knowledge Base Index Compression via Dimensionality and Precision Reduction
Vilém Zouhar | Marius Mosbach | Miaoran Zhang | Dietrich Klakow
Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge

Recently neural network based approaches to knowledge-intensive NLP tasks, such as question answering, started to rely heavily on the combination of neural retrievers and readers. Retrieval is typically performed over a large textual knowledge base (KB) which requires significant memory and compute resources, especially when scaled up. On HotpotQA we systematically investigate reducing the size of the KB index by means of dimensionality (sparse random projections, PCA, autoencoders) and numerical precision reduction. Our results show that PCA is an easy solution that requires very little data and is only slightly worse than autoencoders, which are less stable. All methods are sensitive to pre- and post-processing and data should always be centered and normalized both before and after dimension reduction. Finally, we show that it is possible to combine PCA with using 1bit per dimension. Overall we achieve (1) 100× compression with 75%, and (2) 24× compression with 92% original retrieval performance.

2021

pdf bib
Neural Machine Translation Quality and Post-Editing Performance
Vilém Zouhar | Martin Popel | Ondřej Bojar | Aleš Tamchyna
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We test the natural expectation that using MT in professional translation saves human processing time. The last such study was carried out by Sanchez-Torron and Koehn (2016) with phrase-based MT, artificially reducing the translation quality. In contrast, we focus on neural MT (NMT) of high quality, which has become the state-of-the-art approach since then and also got adopted by most translation companies. Through an experimental study involving over 30 professional translators for English -> Czech translation, we examine the relationship between NMT performance and post-editing time and quality. Across all models, we found that better MT systems indeed lead to fewer changes in the sentences in this industry setting. The relation between system quality and post-editing time is however not straightforward and, contrary to the results on phrase-based MT, BLEU is definitely not a stable predictor of the time or final output quality.

pdf bib
Backtranslation Feedback Improves User Confidence in MT, Not Quality
Vilém Zouhar | Michal Novák | Matúš Žilinec | Ondřej Bojar | Mateo Obregón | Robin L. Hill | Frédéric Blain | Marina Fomicheva | Lucia Specia | Lisa Yankovskaya
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Translating text into a language unknown to the text’s author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility. We demonstrate this by showing three ways in which user confidence in the outbound translation, as well as its overall final quality, can be affected: backward translation, quality estimation (with alignment) and source paraphrasing. In this paper, we describe an experiment on outbound translation from English to Czech and Estonian. We examine the effects of each proposed feedback module and further focus on how the quality of machine translation systems influence these findings and the user perception of success. We show that backward translation feedback has a mixed effect on the whole process: it increases user confidence in the produced translation, but not the objective quality.

pdf bib
Sampling and Filtering of Neural Machine Translation Distillation Data
Vilém Zouhar
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

In most of neural machine translation distillation or stealing scenarios, the highest-scoring hypothesis of the target model (teacher) is used to train a new model (student). If reference translations are also available, then better hypotheses (with respect to the references) can be oversampled and poor hypotheses either removed or undersampled. This paper explores the sampling method landscape (pruning, hypothesis oversampling and undersampling, deduplication and their combination) with English to Czech and English to German MT models using standard MT evaluation metrics. We show that careful oversampling and combination with the original data leads to better performance when compared to training only on the original or synthesized data or their direct combination.

2020

pdf bib
Outbound Translation User Interface Ptakopět: A Pilot Study
Vilém Zouhar | Ondřej Bojar
Proceedings of the Twelfth Language Resources and Evaluation Conference

It is not uncommon for Internet users to have to produce a text in a foreign language they have very little knowledge of and are unable to verify the translation quality. We call the task “outbound translation” and explore it by introducing an open-source modular system Ptakopět. Its main purpose is to inspect human interaction with MT systems enhanced with additional subsystems, such as backward translation and quality estimation. We follow up with an experiment on (Czech) human annotators tasked to produce questions in a language they do not speak (German), with the help of Ptakopět. We focus on three real-world use cases (communication with IT support, describing administrative issues and asking encyclopedic questions) from which we gain insight into different strategies users take when faced with outbound translation tasks. Round trip translation is known to be unreliable for evaluating MT systems but our experimental evaluation documents that it works very well for users, at least on MT systems of mid-range quality.

pdf bib
WMT20 Document-Level Markable Error Exploration
Vilém Zouhar | Tereza Vojtěchová | Ondřej Bojar
Proceedings of the Fifth Conference on Machine Translation

Even though sentence-centric metrics are used widely in machine translation evaluation, document-level performance is at least equally important for professional usage. In this paper, we bring attention to detailed document-level evaluation focused on markables (expressions bearing most of the document meaning) and the negative impact of various markable error phenomena on the translation. For an annotation experiment of two phases, we chose Czech and English documents translated by systems submitted to WMT20 News Translation Task. These documents are from the News, Audit and Lease domains. We show that the quality and also the kind of errors varies significantly among the domains. This systematic variance is in contrast to the automatic evaluation results. We inspect which specific markables are problematic for MT systems and conclude with an analysis of the effect of markable error types on the MT performance measured by humans and automatic evaluation tools.
Search
Co-authors
Fix author