Alon Lavie

Also published as: A. Lavie


2024

pdf
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs
John Mendonça | Isabel Trancoso | Alon Lavie
Findings of the Association for Computational Linguistics: EMNLP 2024

Although human evaluation remains the gold standard for open-domain dialogue evaluation, the growing popularity of automated evaluation using Large Language Models (LLMs) has also extended to dialogue. However, most frameworks leverage benchmarks that assess older chatbots on aspects such as fluency and relevance, which are not reflective of the challenges associated with contemporary models. In fact, a qualitative analysis on Soda. (Kim et al., 2023), a GPT-3.5 generated dialogue dataset, suggests that current chatbots may exhibit several recurring issues related to coherence and commonsense knowledge, but generally produce highly fluent and relevant responses.Noting the aforementioned limitations, this paper introduces Soda-Eval, an annotated dataset based on Soda that covers over 120K turn-level assessments across 10K dialogues, where the annotations were generated by GPT-4. Using Soda-Eval as a benchmark, we then study the performance of several open-access instruction-tuned LLMs, finding that dialogue evaluation remains challenging. Fine-tuning these models improves performance over few-shot inferences, both in terms of correlation and explanation.

pdf bib
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation
John Mendonça | Alon Lavie | Isabel Trancoso
Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024)

Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks. For automatic open-domain dialogue evaluation in particular, LLMs have been seamlessly integrated into evaluation frameworks, and together with human evaluation, compose the backbone of most evaluations. However, existing evaluation benchmarks often rely on outdated datasets and evaluate aspects like Fluency and Relevance, which fail to adequately capture the capabilities and limitations of state-of-the-art chatbot models. This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbot capabilities. A small annotation experiment on a recent LLM-generated dataset (SODA) reveals that LLM evaluators such as GPT-4 struggle to detect actual deficiencies in dialogues generated by current LLM chatbots.

pdf
ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues
John Mendonca | Isabel Trancoso | Alon Lavie
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Despite being heralded as the new standard for dialogue evaluation, the closed-source nature of GPT-4 poses challenges for the community. Motivated by the need for lightweight, open source, and multilingual dialogue evaluators, this paper introduces GenResCoh (Generated Responses targeting Coherence). GenResCoh is a novel LLM generated dataset comprising over 130k negative and positive responses and accompanying explanations seeded from XDailyDialog and XPersona covering English, French, German, Italian, and Chinese. Leveraging GenResCoh, we propose ECoh (Evaluation of Coherence), a family of evaluators trained to assess response coherence across multiple languages. Experimental results demonstrate that ECoh achieves multilingual detection capabilities superior to the teacher model (GPT-3.5-Turbo) on GenResCoh, despite being based on a much smaller architecture. Furthermore, the explanations provided by ECoh closely align in terms of quality with those generated by the teacher model.

pdf bib
Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task
Markus Freitag | Nitika Mathur | Daniel Deutsch | Chi-Kiu Lo | Eleftherios Avramidis | Ricardo Rei | Brian Thompson | Frederic Blain | Tom Kocmi | Jiayi Wang | David Ifeoluwa Adelani | Marianna Buchicchio | Chrysoula Zerva | Alon Lavie
Proceedings of the Ninth Conference on Machine Translation

The WMT24 Metrics Shared Task evaluated the performance of automatic metrics for machine translation (MT), with a major focus on LLM-based translations that were generated as part of the WMT24 General MT Shared Task. As LLMs become increasingly popular in MT, it is crucial to determine whether existing evaluation metrics can accurately assess the output of these systems.To provide a robust benchmark for this evaluation, human assessments were collected using Multidimensional Quality Metrics (MQM), continuing the practice from recent years. Furthermore, building on the success of the previous year, a challenge set subtask was included, requiring participants to design contrastive test suites that specifically target a metric’s ability to identify and penalize different types of translation errors.Finally, the meta-evaluation procedure was refined to better reflect real-world usage of MT metrics, focusing on pairwise accuracy at both the system- and segment-levels.We present an extensive analysis on how well metrics perform on three language pairs: English to Spanish (Latin America), Japanese to Chinese, and English to German. The results strongly confirm the results reported last year, that fine-tuned neural metrics continue to perform well, even when used to evaluate LLM-based translation systems.

2023

pdf
The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics
Ricardo Rei | Nuno M. Guerreiro | Marcos Treviso | Luisa Coheur | Alon Lavie | André Martins
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Neural metrics for machine translation evaluation, such as COMET, exhibit significant improvements in their correlation with human judgments, as compared to traditional metrics based on lexical overlap, such as BLEU. Yet, neural metrics are, to a great extent, “black boxes” returning a single sentence-level score without transparency about the decision-making process. In this work, we develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics. Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors, as assessed through comparison of token-level neural saliency maps with Multidimensional Quality Metrics (MQM) annotations and with synthetically-generated critical translation errors. To ease future research, we release our code at: https://github.com/Unbabel/COMET/tree/explainable-metrics

pdf
Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation
John Mendonça | Patrícia Pereira | Helena Moniz | Joao Paulo Carvalho | Alon Lavie | Isabel Trancoso
Proceedings of The Eleventh Dialog System Technology Challenge

Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English. At the same time, ensuring metrics are invariant to semantically similar responses is also an overlooked topic. In order to achieve the desired properties of robustness and multilinguality for dialogue evaluation metrics, we propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs). Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks and ranks first place on both the Robust and Multilingual tasks of the DSTC11 Track 4 “Automatic Evaluation Metrics for Open-Domain Dialogue Systems”, proving the evaluation capabilities of prompted LLMs.

pdf
Quality Fit for Purpose: Building Business Critical Errors Test Suites
Mariana Cabeça | Marianna Buchicchio | Madalena Gonçalves | Christine Maroti | João Godinho | Pedro Coelho | Helena Moniz | Alon Lavie
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

This paper illustrates a new methodology based on Test Suites (Avramidis et al., 2018) with focus on Business Critical Errors (BCEs) (Stewart et al., 2022) to evaluate the output of Machine Translation (MT) and Quality Estimation (QE) systems. We demonstrate the value of relying on semi-automatic evaluation done through scalable BCE-focused Test Suites to monitor both MT and QE systems’ performance for 8 language pairs (LPs) and a total of 4 error categories. This approach allows us to not only track the impact of new features and implementations in a real business environment, but also to identify strengths and weaknesses in models regarding different error types, and subsequently know what to improve henceforth.

pdf bib
Dialogue Quality and Emotion Annotations for Customer Support Conversations
John Mendonca | Patrícia Pereira | Miguel Menezes | Vera Cabarrão | Ana C Farinha | Helena Moniz | Alon Lavie | Isabel Trancoso
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Task-oriented conversational datasets often lack topic variability and linguistic diversity. However, with the advent of Large Language Models (LLMs) pretrained on extensive, multilingual and diverse text data, these limitations seem overcome. Nevertheless, their generalisability to different languages and domains in dialogue applications remains uncertain without benchmarking datasets. This paper presents a holistic annotation approach for emotion and conversational quality in the context of bilingual customer support conversations. By performing annotations that take into consideration the complete instances that compose a conversation, one can form a broader perspective of the dialogue as a whole. Furthermore, it provides a unique and valuable resource for the development of text classification models. To this end, we present benchmarks for Emotion Recognition and Dialogue Quality Estimation and show that further research is needed to leverage these models in a production setting.

pdf
Towards Multilingual Automatic Open-Domain Dialogue Evaluation
John Mendonca | Alon Lavie | Isabel Trancoso
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

The main limiting factor in the development of robust multilingual open-domain dialogue evaluation metrics is the lack of multilingual data and the limited availability of open-sourced multilingual dialogue systems. In this work, we propose a workaround for this lack of data by leveraging a strong multilingual pretrained encoder-based Language Model and augmenting existing English dialogue data using Machine Translation. We empirically show that the naive approach of finetuning a pretrained multilingual encoder model with translated data is insufficient to outperform the strong baseline of finetuning a multilingual model with only source data. Instead, the best approach consists in the careful curation of translated data using MT Quality Estimation metrics, excluding low quality translations that hinder its performance.

pdf
Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent
Markus Freitag | Nitika Mathur | Chi-kiu Lo | Eleftherios Avramidis | Ricardo Rei | Brian Thompson | Tom Kocmi | Frederic Blain | Daniel Deutsch | Craig Stewart | Chrysoula Zerva | Sheila Castilho | Alon Lavie | George Foster
Proceedings of the Eighth Conference on Machine Translation

This paper presents the results of the WMT23 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT23 News Translation Task. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). Following last year’s success, we also included a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics’ ability to capture and penalise specific types of translation errors. Furthermore, we improved our meta-evaluation procedure by considering fewer tasks and calculating a global score by weighted averaging across the various tasks. We present an extensive analysis on how well metrics perform on three language pairs: Chinese-English, Hebrew-English on the sentence-level and English-German on the paragraph-level. The results strongly confirm the results reported last year, that neural-based metrics are significantly better than non-neural metrics in their levels of correlation with human judgments. Further, we investigate the impact of bad reference translations on the correlations of metrics with human judgment. We present a novel approach for generating synthetic reference translations based on the collection of MT system outputs and their corresponding MQM ratings, which has the potential to mitigate bad reference issues we observed this year for some language pairs. Finally, we also study the connections between the magnitude of metric differences and their expected significance in human evaluation, which should help the community to better understand and adopt new metrics.

2022


Business Critical Errors: A Framework for Adaptive Quality Feedback
Craig A Stewart | Madalena Gonçalves | Marianna Buchicchio | Alon Lavie
Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track)

Frameworks such as Multidimensional Quality Metrics (MQM) provide detailed feedback on translation quality and can pinpoint concrete linguistic errors. The quality of a translation is, however, also closely tied to its utility in a particular use case. Many customers have highly subjective expectations of translation quality. Features such as register, discourse style and brand consistency can be difficult to accommodate given a broadly applied translation solution. In this presentation we will introduce the concept of Business Critical Errors (BCE). Adapted from MQM, the BCE framework provides a perspective on translation quality that allows us to be reactive and adaptive to expectation whilst also maintaining consistency in our translation evaluation. We will demonstrate tooling used at Unbabel that allows us to evaluate the performance of our MT models on BCE using specialized test suites as well as the ability of our AI evaluation models to successfully capture BCE information.

pdf
Searching for COMETINHO: The Little Metric That Could
Ricardo Rei | Ana C Farinha | José G.C. de Souza | Pedro G. Ramos | André F.T. Martins | Luisa Coheur | Alon Lavie
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

In recent years, several neural fine-tuned machine translation evaluation metrics such as COMET and BLEURT have been proposed. These metrics achieve much higher correlations with human judgments than lexical overlap metrics at the cost of computational efficiency and simplicity, limiting their applications to scenarios in which one has to score thousands of translation hypothesis (e.g. scoring multiple systems or Minimum Bayes Risk decoding). In this paper, we explore optimization techniques, pruning, and knowledge distillation to create more compact and faster COMET versions. Our results show that just by optimizing the code through the use of caching and length batching we can reduce inference time between 39% and 65% when scoring multiple systems. Also, we show that pruning COMET can lead to a 21% model reduction without affecting the model’s accuracy beyond 0.01 Kendall tau correlation. Furthermore, we present DISTIL-COMET a lightweight distilled version that is 80% smaller and 2.128x faster while attaining a performance close to the original model and above strong baselines such as BERTSCORE and PRISM.

pdf
Agent and User-Generated Content and its Impact on Customer Support MT
Madalena Gonçalves | Marianna Buchicchio | Craig Stewart | Helena Moniz | Alon Lavie
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper illustrates a new evaluation framework developed at Unbabel for measuring the quality of source language text and its effect on both Machine Translation (MT) and Human Post-Edition (PE) performed by non-professional post-editors. We examine both agent and user-generated content from the Customer Support domain and propose that differentiating the two is crucial to obtaining high quality translation output. Furthermore, we present results of initial experimentation with a new evaluation typology based on the Multidimensional Quality Metrics (MQM) Framework Lommel et al., 2014), specifically tailored toward the evaluation of source language text. We show how the MQM Framework Lommel et al., 2014) can be adapted to assess errors of monolingual source texts and demonstrate how very specific source errors propagate to the MT and PE targets. Finally, we illustrate how MT systems are not robust enough to handle very specific source noise in the context of Customer Support data.

pdf
A Case Study on the Importance of Named Entities in a Machine Translation Pipeline for Customer Support Content
Miguel Menezes | Vera Cabarrão | Pedro Mota | Helena Moniz | Alon Lavie
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper describes the research developed at Unbabel, a Portuguese Machine-translation start-up, that combines MT with human post-edition and focuses strictly on customer service content. We aim to contribute to furthering MT quality and good-practices by exposing the importance of having a continuously-in-development robust Named Entity Recognition system compliant with General Data Protection Regulation (GDPR). Moreover, we have tested semiautomatic strategies that support and enhance the creation of Named Entities gold standards to allow a more seamless implementation of Multilingual Named Entities Recognition Systems. The project described in this paper is the result of a shared work between Unbabel ́s linguists and Unbabel ́s AI engineering team, matured over a year. The project should, also, be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between the different scientific fields that compose and characterize the area of Natural Language Processing (NLP).

pdf
QualityAdapt: an Automatic Dialogue Quality Estimation Framework
John Mendonca | Alon Lavie | Isabel Trancoso
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Despite considerable advances in open-domain neural dialogue systems, their evaluation remains a bottleneck. Several automated metrics have been proposed to evaluate these systems, however, they mostly focus on a single notion of quality, or, when they do combine several sub-metrics, they are computationally expensive. This paper attempts to solve the latter: QualityAdapt leverages the Adapter framework for the task of Dialogue Quality Estimation. Using well defined semi-supervised tasks, we train adapters for different subqualities and score generated responses with AdapterFusion. This compositionality provides an easy to adapt metric to the task at hand that incorporates multiple subqualities. It also reduces computational costs as individual predictions of all subqualities are obtained in a single forward pass. This approach achieves comparable results to state-of-the-art metrics on several datasets, whilst keeping the previously mentioned advantages.

pdf bib
Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust
Markus Freitag | Ricardo Rei | Nitika Mathur | Chi-kiu Lo | Craig Stewart | Eleftherios Avramidis | Tom Kocmi | George Foster | Alon Lavie | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper presents the results of the WMT22 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT22 News Translation Task on four different domains: news, social, ecommerce, and chat. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages, among other things: (i) expert-based evaluation is more reliable, (ii) we extended the pool of translations by 5 additional translations based on MBR decoding or rescoring which are challenging for current metrics. In addition, we initiated a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics’ ability to capture and penalise specific types of translation errors. Finally, we present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. The results demonstrate the superiority of neural-based learned metrics and demonstrate again that overlap metrics like Bleu, spBleu or chrf correlate poorly with human ratings. The results also reveal that neural-based metrics are remarkably robust across different domains and challenges.

pdf
COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task
Ricardo Rei | José G. C. de Souza | Duarte Alves | Chrysoula Zerva | Ana C Farinha | Taisiya Glushkova | Alon Lavie | Luisa Coheur | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

In this paper, we present the joint contribution of Unbabel and IST to the WMT 2022 Metrics Shared Task. Our primary submission – dubbed COMET-22 – is an ensemble between a COMET estimator model trained with Direct Assessments and a newly proposed multitask model trained to predict sentence-level scores along with OK/BAD word-level tags derived from Multidimensional Quality Metrics error annotations. These models are ensembled together using a hyper-parameter search that weights different features extracted from both evaluation models and combines them into a single score. For the reference-free evaluation, we present CometKiwi. Similarly to our primary submission, CometKiwi is an ensemble between two models. A traditional predictor-estimator model inspired by OpenKiwi and our new multitask model trained on Multidimensional Quality Metrics which can also be used without references. Both our submissions show improved correlations compared to state-of-the-art metrics from last year as well as increased robustness to critical errors.

pdf
CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task
Ricardo Rei | Marcos Treviso | Nuno M. Guerreiro | Chrysoula Zerva | Ana C Farinha | Christine Maroti | José G. C. de Souza | Taisiya Glushkova | Duarte Alves | Luisa Coheur | Alon Lavie | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estimation (QE). Our team participated in all three subtasks: (i) Sentence and Word-level Quality Prediction; (ii) Explainable QE; and (iii) Critical Error Detection. For all tasks we build on top of the COMET framework, connecting it with the predictor-estimator architecture of OpenKiwi, and equipping it with a word-level sequence tagger and an explanation extractor. Our results suggest that incorporating references during pretraining improves performance across several language pairs on downstream tasks, and that jointly training with sentence and word-level objectives yields a further boost. Furthermore, combining attention and gradient information proved to be the top strategy for extracting good explanations of sentence-level QE models. Overall, our submissions achieved the best results for all three tasks for almost all language pairs by a considerable margin.

2021

pdf
MT-Telescope: An interactive platform for contrastive evaluation of MT systems
Ricardo Rei | Ana C Farinha | Craig Stewart | Luisa Coheur | Alon Lavie
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

We present MT-Telescope, a visualization platform designed to facilitate comparative analysis of the output quality of two Machine Translation (MT) systems. While automated MT evaluation metrics are commonly used to evaluate MT systems at a corpus-level, our platform supports fine-grained segment-level analysis and interactive visualisations that expose the fundamental differences in the performance of the compared systems. MT-Telescope also supports dynamic corpus filtering to enable focused analysis on specific phenomena such as; translation of named entities, handling of terminology, and the impact of input segment length on translation quality. Furthermore, the platform provides a bootstrapped t-test for statistical significance as a means of evaluating the rigor of the resulting system ranking. MT-Telescope is open source, written in Python, and is built around a user friendly and dynamic web interface. Complementing other existing tools, our platform is designed to facilitate and promote the broader adoption of more rigorous analysis practices in the evaluation of MT quality.

pdf
Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain
Markus Freitag | Ricardo Rei | Nitika Mathur | Chi-kiu Lo | Craig Stewart | George Foster | Alon Lavie | Ondřej Bojar
Proceedings of the Sixth Conference on Machine Translation

This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT21 News Translation Task with automatic metrics on two different domains: news and TED talks. All metrics were evaluated on how well they correlate at the system- and segment-level with human ratings. Contrary to previous years’ editions, this year we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages: (i) expert-based evaluation has been shown to be more reliable, (ii) we were able to evaluate all metrics on two different domains using translations of the same MT systems, (iii) we added 5 additional translations coming from the same system during system development. In addition, we designed three challenge sets that evaluate the robustness of all automatic metrics. We present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. We further show the impact of different reference translations on reference-based metrics and compare our expert-based MQM annotation with the DA scores acquired by WMT.

pdf
Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task
Ricardo Rei | Ana C Farinha | Chrysoula Zerva | Daan van Stigt | Craig Stewart | Pedro Ramos | Taisiya Glushkova | André F. T. Martins | Alon Lavie
Proceedings of the Sixth Conference on Machine Translation

In this paper, we present the joint contribution of Unbabel and IST to the WMT 2021 Metrics Shared Task. With this year’s focus on Multidimensional Quality Metric (MQM) as the ground-truth human assessment, our aim was to steer COMET towards higher correlations with MQM. We do so by first pre-training on Direct Assessments and then fine-tuning on z-normalized MQM scores. In our experiments we also show that reference-free COMET models are becoming competitive with reference-based models, even outperforming the best COMET model from 2020 on this year’s development data. Additionally, we present COMETinho, a lightweight COMET model that is 19x faster on CPU than the original model, while also achieving state-of-the-art correlations with MQM. Finally, in the “QE as a metric” track, we also participated with a QE model trained using the OpenKiwi framework leveraging MQM scores and word-level annotations.

2020


COMET - Deploying a New State-of-the-art MT Evaluation Metric in Production
Craig Stewart | Ricardo Rei | Catarina Farinha | Alon Lavie
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)

pdf
COMET: A Neural Framework for MT Evaluation
Ricardo Rei | Craig Stewart | Ana C Farinha | Alon Lavie
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present COMET, a neural framework for training multilingual machine translation evaluation models which obtains new state-of-the-art levels of correlation with human judgements. Our framework leverages recent breakthroughs in cross-lingual pretrained language modeling resulting in highly multilingual and adaptable MT evaluation models that exploit information from both the source input and a target-language reference translation in order to more accurately predict MT quality. To showcase our framework, we train three models with different types of human judgements: Direct Assessments, Human-mediated Translation Edit Rate and Multidimensional Quality Metric. Our models achieve new state-of-the-art performance on the WMT 2019 Metrics shared task and demonstrate robustness to high-performing systems.

pdf
Unbabel’s Participation in the WMT20 Metrics Shared Task
Ricardo Rei | Craig Stewart | Ana C Farinha | Alon Lavie
Proceedings of the Fifth Conference on Machine Translation

We present the contribution of the Unbabel team to the WMT 2020 Shared Task on Metrics. We intend to participate on the segmentlevel, document-level and system-level tracks on all language pairs, as well as the “QE as a Metric” track. Accordingly, we illustrate results of our models in these tracks with reference to test sets from the previous year. Our submissions build upon the recently proposed COMET framework: we train several estimator models to regress on different humangenerated quality scores and a novel ranking model trained on relative ranks obtained from Direct Assessments. We also propose a simple technique for converting segment-level predictions into a document-level score. Overall, our systems achieve strong results for all language pairs on previous test sets and in many cases set a new state-of-the-art.

2016

pdf
Synthesizing Compound Words for Machine Translation
Austin Matthews | Eva Schlinger | Alon Lavie | Chris Dyer
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf
Humor Recognition and Humor Anchor Extraction
Diyi Yang | Alon Lavie | Chris Dyer | Eduard Hovy
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf
Cognitive demand and cognitive effort in post-editing
Isabel Lacruz | Michael Denkowski | Alon Lavie
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

The pause to word ratio, the number of pauses per word in a post-edited MT segment, is an indicator of cognitive effort in post-editing (Lacruz and Shreve, 2014). We investigate how low the pause threshold can reasonably be taken, and we propose that 300 ms is a good choice, as pioneered by Schilperoord (1996). We then seek to identify a good measure of the cognitive demand imposed by MT output on the post-editor, as opposed to the cognitive effort actually exerted by the post-editor during post-editing. Measuring cognitive demand is closely related to measuring MT utility, the MT quality as perceived by the post-editor. HTER, an extrinsic edit to word ratio that does not necessarily correspond to actual edits per word performed by the post-editor, is a well-established measure of MT quality, but it does not comprehensively capture cognitive demand (Koponen, 2012). We investigate intrinsic measures of MT quality, and so of cognitive demand, through edited-error to word metrics. We find that the transfer-error to word ratio predicts cognitive effort better than mechanical-error to word ratio (Koby and Champe, 2013). We identify specific categories of cognitively challenging MT errors whose error to word ratios correlate well with cognitive effort.

pdf
Real time adaptive machine translation: cdec and TransCenter
Michael Denkowski | Alon Lavie | Isabel Lacruz | Chris Dyer
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

cdec Realtime and TransCenter provide an end-to-end experimental setup for machine translation post-editing research. Realtime provides a framework for building adaptive MT systems that learn from post-editor feedback while TransCenter incorporates a web-based translation interface that connects users to these systems and logs post-editing activity. This combination allows the straightforward deployment of MT systems specifically for post-editing and analysis of translator productivity when working with adaptive systems. Both toolkits are freely available under open source licenses.

pdf
Learning from Post-Editing: Online Model Adaptation for Statistical Machine Translation
Michael Denkowski | Chris Dyer | Alon Lavie
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf
Locally Non-Linear Learning for Statistical Machine Translation via Discretization and Structured Regularization
Jonathan H. Clark | Chris Dyer | Alon Lavie
Transactions of the Association for Computational Linguistics, Volume 2

Linear models, which support efficient learning and inference, are the workhorses of statistical machine translation; however, linear decision rules are less attractive from a modeling perspective. In this work, we introduce a technique for learning arbitrary, rule-local, non-linear feature transforms that improve model expressivity, but do not sacrifice the efficient inference and learning associated with linear models. To demonstrate the value of our technique, we discard the customary log transform of lexical probabilities and drop the phrasal translation probability in favor of raw counts. We observe that our algorithm learns a variation of a log transform that leads to better translation quality compared to the explicit log transform. We conclude that non-linear responses play an important role in SMT, an observation that we hope will inform the efforts of feature engineers.

pdf
Real Time Adaptive Machine Translation for Post-Editing with cdec and TransCenter
Michael Denkowski | Alon Lavie | Isabel Lacruz | Chris Dyer
Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation

pdf
The CMU Machine Translation Systems at WMT 2014
Austin Matthews | Waleed Ammar | Archna Bhatia | Weston Feely | Greg Hanneman | Eva Schlinger | Swabha Swayamdipta | Yulia Tsvetkov | Alon Lavie | Chris Dyer
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf
Meteor Universal: Language Specific Translation Evaluation for Any Target Language
Michael Denkowski | Alon Lavie
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf
Domain and Dialect Adaptation for Machine Translation into Egyptian Arabic
Serena Jeblee | Weston Feely | Houda Bouamor | Alon Lavie | Nizar Habash | Kemal Oflazer
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

2013

pdf bib
Analyzing and Predicting MT Utility and Post-Editing Productivity in Enterprise-scale Translation Projects
Alon Lavie | Olga Beregovaya | Michael Denkowski | David Clarke
Proceedings of Machine Translation Summit XIV: User track

pdf
Improving Syntax-Augmented Machine Translation by Coarsening the Label Set
Greg Hanneman | Alon Lavie
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Grouping Language Model Boundary Words to Speed K–Best Extraction from Hypergraphs
Kenneth Heafield | Philipp Koehn | Alon Lavie
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
The CMU Machine Translation Systems at WMT 2013: Syntax, Synthetic Translation Options, and Pseudo-References
Waleed Ammar | Victor Chahuneau | Michael Denkowski | Greg Hanneman | Wang Ling | Austin Matthews | Kenton Murray | Nicola Segall | Alon Lavie | Chris Dyer
Proceedings of the Eighth Workshop on Statistical Machine Translation

2012

pdf
One System, Many Domains: Open-Domain Statistical Machine Translation via Feature Augmentation
Jonathan Clark | Alon Lavie | Chris Dyer
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

In this paper, we introduce a simple technique for incorporating domain information into a statistical machine translation system that significantly improves translation quality when test data comes from multiple domains. Our approach augments (conjoins) standard translation model and language model features with domain indicator features and requires only minimal modifications to the optimization and decoding procedures. We evaluate our method on two language pairs with varying numbers of domains, and observe significant improvements of up to 1.0 BLEU.

pdf
Challenges in Predicting Machine Translation Utility for Human Post-Editors
Michael Denkowski | Alon Lavie
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

As machine translation quality continues to improve, the idea of using MT to assist human translators becomes increasingly attractive. In this work, we discuss and provide empirical evidence of the challenges faced when adapting traditional MT systems to provide automatic translations for human post-editors to correct. We discuss the differences between this task and traditional adequacy-based tasks and the challenges that arise when using automatic metrics to predict the amount of effort required to post-edit translations. A series of experiments simulating a real-world localization scenario shows that current metrics under-perform on this task, even when tuned to maximize correlation with expert translator judgments, illustrating the need to rethink traditional MT pipelines when addressing the challenges of this translation task.

pdf
Language Model Rest Costs and Space-Efficient Storage
Kenneth Heafield | Philipp Koehn | Alon Lavie
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf
The CMU-Avenue French-English Translation System
Michael Denkowski | Greg Hanneman | Alon Lavie
Proceedings of the Seventh Workshop on Statistical Machine Translation

2011

pdf bib
Evaluating the Output of Machine Translation Systems
Alon Lavie
Proceedings of Machine Translation Summit XIII: Tutorial Abstracts

This half-day tutorial provides a broad overview of how to evaluate translations that are produced by machine translation systems. The range of issues covered includes a broad survey of both human evaluation measures and commonly-used automated metrics, and a review of how these are used for various types of evaluation tasks, such as assessing the translation quality of MT-translated sentences, comparing the performance of alternative MT systems, or measuring the productivity gains of incorporating MT into translation workflows.

pdf
Unsupervised Word Alignment with Arbitrary Features
Chris Dyer | Jonathan H. Clark | Alon Lavie | Noah A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf
Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
Jonathan H. Clark | Chris Dyer | Alon Lavie | Noah A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf
Automatic Category Label Coarsening for Syntax-Based Machine Translation
Greg Hanneman | Alon Lavie
Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf
A General-Purpose Rule Extractor for SCFG-Based Machine Translation
Greg Hanneman | Michelle Burroughs | Alon Lavie
Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf
Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems
Michael Denkowski | Alon Lavie
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf
CMU System Combination in WMT 2011
Kenneth Heafield | Alon Lavie
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf
CMU Syntax-Based Machine Translation at WMT 2011
Greg Hanneman | Alon Lavie
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf
The Impact of Arabic Morphological Segmentation on Broad-coverage English-to-Arabic Statistical Machine Translation
Hassan Al-Haj | Alon Lavie
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

Morphologically rich languages pose a challenge for statistical machine translation (SMT). This challenge is magnified when translating into a morphologically rich language. In this work we address this challenge in the framework of a broad-coverage English-to-Arabic phrase based statistical machine translation (PBSMT). We explore the full spectrum of Arabic segmentation schemes ranging from full word form to fully segmented forms and examine the effects on system performance. Our results show a difference of 2.61 BLEU points between the best and worst segmentation schemes indicating that the choice of the segmentation scheme has a significant effect on the performance of a PBSMT system in a large data scenario. We also show that a simple segmentation scheme can perform as good as the best and more complicated segmentation scheme. We also report results on a wide set of techniques for recombining the segmented Arabic output.

pdf
Using Variable Decoding Weight for Language Model in Statistical Machine Translation
Behrang Mohit | Rebecca Hwa | Alon Lavie
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

This paper investigates varying the decoder weight of the language model (LM) when translating different parts of a sentence. We determine the condition under which the LM weight should be adapted. We find that a better translation can be achieved by varying the LM weight when decoding the most problematic spot in a sentence, which we refer to as a difficult segment. Two adaptation strategies are proposed and compared through experiments. We find that adapting a different LM weight for every difficult segment resulted in the largest improvement in translation quality.

pdf
Choosing the Right Evaluation for Machine Translation: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks
Michael Denkowski | Alon Lavie
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

This paper examines the motivation, design, and practical results of several types of human evaluation tasks for machine translation. In addition to considering annotator performance and task informativeness over multiple evaluations, we explore the practicality of tuning automatic evaluation metrics to each judgment type in a comprehensive experiment using the METEOR-NEXT metric. We present results showing clear advantages of tuning to certain types of judgments and discuss causes of inconsistency when tuning to various judgment data, as well as sources of difficulty in the human evaluation tasks themselves.

pdf
Voting on N-grams for Machine Translation System Combination
Kenneth Heafield | Alon Lavie
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

System combination exploits differences between machine translation systems to form a combined translation from several system outputs. Core to this process are features that reward n-gram matches between a candidate combination and each system output. Systems differ in performance at the n-gram level despite similar overall scores. We therefore advocate a new feature formulation: for each system and each small n, a feature counts n-gram matches between the system and candidate. We show post-evaluation improvement of 6.67 BLEU over the best system on NIST MT09 Arabic-English test data. Compared to a baseline system combination scheme from WMT 2009, we show improvement in the range of 1 BLEU point.

pdf
Machine Translation between Hebrew and Arabic: Needs, Challenges and Preliminary Solutions
Reshef Shilon | Nizar Habash | Alon Lavie | Shuly Wintner
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Student Research Workshop

Hebrew and Arabic are related but mutually incomprehensible languages with complex morphology and scarce parallel corpora. Machine translation between the two languages is therefore interesting and challenging. We discuss similarities and differences between Hebrew and Arabic, the benefits and challenges that they induce, respectively, and their implications for machine translation. We highlight the shortcomings of using English as a pivot language and advocate a direct, transfer-based and linguistically-informed (but still statistical, and hence scalable) approach. We report preliminary results of such a system that we are currently developing.

pdf
Evaluating the Output of Machine Translation Systems
Alon Lavie
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Tutorials

pdf
LoonyBin: Keeping Language Technologists Sane through Automated Management of Experimental (Hyper)Workflows
Jonathan H. Clark | Alon Lavie
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Many contemporary language technology systems are characterized by long pipelines of tools with complex dependencies. Too often, these workflows are implemented by ad hoc scripts; or, worse, tools are run manually, making experiments difficult to reproduce. These practices are difficult to maintain in the face of rapidly evolving workflows while they also fail to expose and record important details about intermediate data. Further complicating these systems are hyperparameters, which often cannot be directly optimized by conventional methods, requiring users to determine which combination of values is best via trial and error. We describe LoonyBin, an open-source tool that addresses these issues by providing: 1) a visual interface for the user to create and modify workflows; 2) a well-defined mechanism for tracking metadata and provenance; 3) a script generator that compiles visual workflows into shell scripts; and 4) a new workflow representation we call a HyperWorkflow, which intuitively and succinctly encodes small experimental variations within a larger workflow.

pdf
Extending the METEOR Machine Translation Evaluation Metric to the Phrase Level
Michael Denkowski | Alon Lavie
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf
Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk
Michael Denkowski | Alon Lavie
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf
Turker-Assisted Paraphrasing for English-Arabic Machine Translation
Michael Denkowski | Hassan Al-Haj | Alon Lavie
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf
Improved Features and Grammar Selection for Syntax-Based MT
Greg Hanneman | Jonathan Clark | Alon Lavie
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf
CMU Multi-Engine Machine Translation for WMT 2010
Kenneth Heafield | Alon Lavie
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf
METEOR-NEXT and the METEOR Paraphrase Tables: Improved Evaluation Support for Five Target Languages
Michael Denkowski | Alon Lavie
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib
Extraction of Syntactic Translation Models from Parallel Data using Syntax from Source and Target Languages
Vamshi Ambati | Alon Lavie | Jaime Carbonell
Proceedings of Machine Translation Summit XII: Posters

pdf
Machine Translation System Combination with Flexible Word Ordering
Kenneth Heafield | Greg Hanneman | Alon Lavie
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf
An Improved Statistical Transfer System for French-English Machine Translation
Greg Hanneman | Vamshi Ambati | Jonathan H. Clark | Alok Parlikar | Alon Lavie
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Decoding with Syntactic and Non-Syntactic Phrases in a Syntax-Based Machine Translation System
Greg Hanneman | Alon Lavie
Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009

2008

pdf bib
Improving Syntax-Driven Translation Models by Re-structuring Divergent and Nonisomorphic Parse Tree Structures
Vamshi Ambati | Alon Lavie
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop

Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems. This is primarily due to the highly restrictive space of constituent segmentations that the trees on two sides introduce, which adversely affects the recall of the resulting translation models. Approaches that project from trees on one side, on the other hand, have higher levels of recall, but suffer from lower precision, due to the lack of syntactically-aware word alignments. In this paper we explore the issue of lexical coverage of the translation models learned in both of these scenarios. We specifically look at how the non-isomorphic nature of the parse trees for the two languages affects recall and coverage. We then propose a novel technique for restructuring target parse trees, that generates highly isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. We evaluate the translation models learned from these restructured trees and show that they are significantly better than those learned using trees on both sides and trees on one side.

pdf
Linguistic Structure and Bilingual Informants Help Induce Machine Translation of Lesser-Resourced Languages
Christian Monson | Ariadna Font Llitjós | Vamshi Ambati | Lori Levin | Alon Lavie | Alison Alvarez | Roberto Aranovich | Jaime Carbonell | Robert Frederking | Erik Peterson | Katharina Probst
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Producing machine translation (MT) for the many minority languages in the world is a serious challenge. Minority languages typically have few resources for building MT systems. For many minor languages there is little machine readable text, few knowledgeable linguists, and little money available for MT development. For these reasons, our research programs on minority language MT have focused on leveraging to the maximum extent two resources that are available for minority languages: linguistic structure and bilingual informants. All natural languages contain linguistic structure. And although the details of that linguistic structure vary from language to language, language universals such as context-free syntactic structure and the paradigmatic structure of inflectional morphology, allow us to learn the specific details of a minority language. Similarly, most minority languages possess speakers who are bilingual with the major language of the area. This paper discusses our efforts to utilize linguistic structure and the translation information that bilingual informants can provide in three sub-areas of our rapid development MT program: morphology induction, syntactic transfer rule learning, and refinement of imperfect learned rules.

pdf
Meteor, M-BLEU and M-TER: Evaluation Metrics for High-Correlation with Human Rankings of Machine Translation Output
Abhaya Agarwal | Alon Lavie
Proceedings of the Third Workshop on Statistical Machine Translation

pdf
Statistical Transfer Systems for French-English and German-English Machine Translation
Greg Hanneman | Edmund Huber | Abhaya Agarwal | Vamshi Ambati | Alok Parlikar | Erik Peterson | Alon Lavie
Proceedings of the Third Workshop on Statistical Machine Translation

pdf
Syntax-Driven Learning of Sub-Sentential Translation Equivalents and Translation Rules from Parsed Parallel Corpora
Alon Lavie | Alok Parlikar | Vamshi Ambati
Proceedings of the ACL-08: HLT Second Workshop on Syntax and Structure in Statistical Translation (SSST-2)

pdf
Evaluating an Agglutinative Segmentation Model for ParaMor
Christian Monson | Alon Lavie | Jaime Carbonell | Lori Levin
Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology

2007

pdf
Improving transfer-based MT systems with automatic refinements
Ariadna Font Llitjós | Jaime Carbonell | Alon Lavie
Proceedings of Machine Translation Summit XI: Papers

pdf
Experiments with a noun-phrase driven statistical machine translation system
Sanjika Hewavitharana | Alon Lavie | Stephan Vogel
Proceedings of Machine Translation Summit XI: Papers

pdf
High-accuracy Annotation and Parsing of CHILDES Transcripts
Kenji Sagae | Eric Davis | Alon Lavie | Brian MacWhinney | Shuly Wintner
Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition

pdf
METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments
Alon Lavie | Abhaya Agarwal
Proceedings of the Second Workshop on Statistical Machine Translation

pdf
Cross Lingual and Semantic Retrieval for Cultural Heritage Appreciation
Idan Szpektor | Ido Dagan | Alon Lavie | Danny Shacham | Shuly Wintner
Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007).

pdf
ParaMor: Minimally Supervised Induction of Paradigm Structure and Morphological Analysis
Christian Monson | Jaime Carbonell | Alon Lavie | Lori Levin
Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology

2006

pdf
Parser Combination by Reparsing
Kenji Sagae | Alon Lavie
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

pdf
A Best-First Probabilistic Shift-Reduce Parser
Kenji Sagae | Alon Lavie
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf
A framework for interactive and automatic refinement of transfer-based machine translation
Ariadna Font Llitjós | Jaime G. Carbonell | Alon Lavie
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf
Multi-engine machine translation guided by explicit word matching
Shyamsundar Jayaraman | Alon Lavie
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf
BLANC: Learning Evaluation Metrics for MT
Lucian Lita | Monica Rogati | Alon Lavie
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

pdf
Automatic Measurement of Syntactic Development in Child Language
Kenji Sagae | Alon Lavie | Brian MacWhinney
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

pdf
Multi-Engine Machine Translation Guided by Explicit Word Matching
Shyamsundar Jayaraman | Alon Lavie
Proceedings of the ACL Interactive Poster and Demonstration Sessions

pdf bib
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization
Jade Goldstein | Alon Lavie | Chin-Yew Lin | Clare Voss
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization

pdf
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
Satanjeev Banerjee | Alon Lavie
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization

pdf
A Classifier-Based Parser with Linear Run-Time Complexity
Kenji Sagae | Alon Lavie
Proceedings of the Ninth International Workshop on Parsing Technology

2004

pdf
The significance of recall in automatic metrics for MT evaluation
Alon Lavie | Kenji Sagae | Shyamsundar Jayaraman
Proceedings of the 6th Conference of the Association for Machine Translation in the Americas: Technical Papers

Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that significantly better correlations can be achieved by placing more weight on recall than on precision. While this may seem unexpected, since BLEU and NIST focus on n-gram precision and disregard recall, our experiments show that correlation with human judgments is highest when almost all of the weight is assigned to recall. We also show that stemming is significantly beneficial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.

pdf
A structurally diverse minimal corpus for eliciting structural mappings between languages
Katharina Probst | Alon Lavie
Proceedings of the 6th Conference of the Association for Machine Translation in the Americas: Technical Papers

We describe an approach to creating a small but diverse corpus in English that can be used to elicit information about any target language. The focus of the corpus is on structural information. The resulting bilingual corpus can then be used for natural language processing tasks such as inferring transfer mappings for Machine Translation. The corpus is sufficiently small that a bilingual user can translate and word-align it within a matter of hours. We describe how the corpus is created and how its structural diversity is ensured. We then argue that it is not necessary to introduce a large amount of redundancy into the corpus. This is shown by creating an increasingly redundant corpus and observing that the information gained converges as redundancy increases.

pdf
A trainable transfer-based MT approach for languages with limited resources
Alon Lavie | Katharina Probst | Erik Peterson | Stephan Vogel | Lori Levin | Ariadna Font-Llitjos | Jaime Carbonell
Proceedings of the 9th EAMT Workshop: Broadening horizons of machine translation and its applications

pdf bib
Rapid prototyping of a transfer-based Hebrew-to-English machine translation system
Alon Lavie | Erik Peterson | Katharina Probst | Shuly Wintner | Yaniv Eytani
Proceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

pdf
Data Collection and Analysis of Mapudungun Morphology for Spelling Correction
Christian Monson | Lori Levin | Rodolfo Vega | Ralf Brown | Ariadna Font Llitjos | Alon Lavie | Jaime Carbonell | Eliseo Cañulef | Rosendo Huisca
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf
Adding Syntactic Annotations to Transcripts of Parent-Child Dialogs
Kenji Sagae | Brian MacWhinney | Alon Lavie
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf
Unsupervised Induction of Natural Language Morphology Inflection Classes
Christian Monson | Alon Lavie | Jaime Carbonell | Lori Levin
Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology

2003

pdf
Speechalator: Two-Way Speech-to-Speech Translation in Your Hand
Alex Waibel | Ahmed Badran | Alan W. Black | Robert Frederking | Donna Gates | Alon Lavie | Lori Levin | Kevin Lenzo | Laura Mayfield Tomokiyo | Juergen Reichert | Tanja Schultz | Dorcas Wallace | Monika Woszczyna | Jing Zhang
Companion Volume of the Proceedings of HLT-NAACL 2003 - Demonstrations

pdf
Domain Specific Speech Acts for Spoken Language Translation
Lori Levin | Chad Langley | Alon Lavie | Donna Gates | Dorcas Wallace | Kay Peterson
Proceedings of the Fourth SIGdial Workshop of Discourse and Dialogue

pdf
Parsing Domain Actions with Phrase-Level Grammars and Memory-Based Learners
Chad Langley | Alon Lavie
Proceedings of the Eighth International Conference on Parsing Technologies

In this paper, we describe an approach to analysis for spoken language translation that combines phrase-level grammar-based parsing and automatic domain action classification. The job of the analyzer is to transform utterances into a shallow semantic task-oriented interlingua representation. The goal of our hybrid approach is to provide accurate real-time analyses and to improve robustness and portability to new domains and languages.

pdf
Combining Rule-based and Data-driven Techniques for Grammatical Relation Extraction in Spoken Language
Kenji Sagae | Alon Lavie
Proceedings of the Eighth International Conference on Parsing Technologies

We investigate an aspect of the relationship between parsing and corpus-based methods in NLP that has received relatively little attention: coverage augmentation in rule-based parsers. In the specific task of determining grammatical relations (such as subjects and objects) in transcribed spoken language, we show that a combination of rule-based and corpus-based approaches, where a rule-based system is used as the teacher (or an automatic data annotator) to a corpus-based system, outperforms either system in isolation.

2002

pdf bib
Automatic rule learning for resource-limited MT
Jaime Carbonell | Katharina Probst | Erik Peterson | Christian Monson | Alon Lavie | Ralf Brown | Lori Levin
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers

Machine Translation of minority languages presents unique challenges, including the paucity of bilingual training data and the unavailability of linguistically-trained speakers. This paper focuses on a machine learning approach to transfer-based MT, where data in the form of translations and lexical alignments are elicited from bilingual speakers, and a seeded version-space learning algorithm formulates and refines transfer rules. A rule-generalization lattice is defined based on LFG-style f-structures, permitting generalization operators in the search for the most general rules consistent with the elicited data. The paper presents these methods and illustrates examples.

pdf
The NESPOLE! speech-to-speech translation system
Alon Lavie | Lori Levin | Robert Frederking | Fabio Pianesi
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: System Descriptions

NESPOLE! is a speech-to-speech machine translation research system designed to provide fully functional speech-to-speech capabilities within real-world settings of common users involved in e-commerce applications. The project is funded jointly by the European Commission and the US NSF. The NESPOLE! system uses a client-server architecture to allow a common user, who is browsing web-pages on the internet, to connect seamlessly in real-time to an agent of the service provider, using a video-conferencing channel and with speech-to-speech translation services mediating the conversation. Shared web pages and annotated images supported via a Whiteboard application are available to enhance the communication.

pdf
Rapid adaptive development of semantic analysis grammars
Alicia Tribble | Alon Lavie | Lori Levin
Proceedings of the 9th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf
Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers
Chad Langley | Alon Lavie | Lori Levin | Dorcas Wallace | Donna Gates | Kay Peterson
Proceedings of the ACL-02 Workshop on Speech-to-Speech Translation: Algorithms and Systems

pdf
Balancing Expressiveness and Simplicity in an Interlingua for Task Based Dialogue
Lori Levin | Donna Gates | Dorcas Pianta | Roldano Cattoni | Nadia Mana | Kay Peterson | Alon Lavie | Fabio Pianesi
Proceedings of the ACL-02 Workshop on Speech-to-Speech Translation: Algorithms and Systems

pdf
A Multi-Perspective Evaluation of the NESPOLE! Speech-to-Speech Translation System
Alon Lavie | Florian Metze | Roldano Cattoni | Erica Costantini
Proceedings of the ACL-02 Workshop on Speech-to-Speech Translation: Algorithms and Systems

2001

pdf
Pre-processing of bilingual corpora for Mandarin-English EBMT
Ying Zhang | Ralf Brown | Robert Frederking | Alon Lavie
Proceedings of Machine Translation Summit VIII

Pre-processing of bilingual corpora plays an important role in Example-Based Machine Translation (EBMT) and Statistical-Based Machine Translation (SBMT). For our Mandarin-English EBMT system, pre-processing includes segmentation for Mandarin, bracketing for English and building a statistical dictionary from the corpora. We used the Mandarin segmenter from the Linguistic Data Consortium (LDC). It uses dynamic programming with a frequency dictionary to segment the text. Although the frequency dictionary is large, it does not completely cover the corpora. In this paper, we describe the work we have done to improve the segmentation for Mandarin and the bracketing process for English to increase the length of English phrases. A statistical dictionary is built from the aligned bilingual corpus. It is used as feedback to segmentation and bracketing to re-segment / re-bracket the corpus. The process iterates several times to achieve better results. The final results of the corpus pre-processing are a segmented/bracketed aligned bilingual corpus and a statistical dictionary. We achieved positive results by increasing the average length of Chinese terms about 60% and 10% for English. The statistical dictionary gained about a 30% increase in coverage.

pdf
Design and implementation of controlled elicitation for machine translation of low-density languages
Katharina Probst | Ralf Brown | Jaime Carbonell | Alon Lavie | Lori Levin | Erik Peterson
Workshop on MT2010: Towards a Road Map for MT

NICE is a machine translation project for low-density languages. We are building a tool that will elicit a controlled corpus from a bilingual speaker who is not an expert in linguistics. The corpus is intended to cover major typological phenomena, as it is designed to work for any language. Using implicational universals, we strive to minimize the number of sentences that each informant has to translate. From the elicited sentences, we learn transfer rules with a version space algorithm. Our vision for MT in the future is one in which systems can be quickly trained for new languages by native speakers, so that speakers of minor languages can participate in education, health care, government, and internet without having to give up their languages.

pdf
Architecture and Design Considerations in NESPOLE!: a Speech Translation System for E-commerce Applications
Alon Lavie | Chad Langley | Alex Waibel | Fabio Pianesi | Gianni Lazzari | Paolo Coletti | Loredana Taddei | Franco Balducci
Proceedings of the First International Conference on Human Language Technology Research

pdf
Domain Portability in Speech-to-Speech Translation
Alon Lavie | Lori Levin | Tanja Schultz | Chad Langley | Benjamin Han | Alicia Tribble | Donna Gates | Dorcas Wallace | Kay Peterson
Proceedings of the First International Conference on Human Language Technology Research

pdf
Parsing the CHILDES Database: Methodology and Lessons Learned
Kenji Sagae | Alon Lavie | Brian MacWhinney
Proceedings of the Seventh International Workshop on Parsing Technologies

2000

pdf
Optimal Ambiguity Packing in Context-free Parsers with Interleaved Unification
Alon Lavie | Carolyn Penstein Rosé
Proceedings of the Sixth International Workshop on Parsing Technologies

Ambiguity packing is a well known technique for enhancing the efficiency of context-free parsers. However, in the case of unification-augmented context-free parsers where parsing is interleaved with feature unification, the propagation of feature structures imposes difficulties on the ability of the parser to effectively perform ambiguity packing. We demonstrate that a clever heuristic for prioritizing the execution order of grammar rules and parsing actions can achieve a high level of ambiguity packing that is provably optimal. We present empirical evaluations of the proposed technique, performed with both a Generalized LR parser and a chart parser, that demonstrate its effectiveness.

pdf
Lessons Learned from a Task-based Evaluation of Speech-to-Speech Machine Translation
Lori Levin | Boris Bartlog | Ariadna Font Llitjos | Donna Gates | Alon Lavie | Dorcas Wallace | Taro Watanabe | Monika Woszczyna
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf
Shallow Discourse Genre Annotation in CallHome Spanish
Klaus Ries | Lori Levin | Liza Valle | Alon Lavie | Alex Waibel
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf
Evaluation of a Practical Interlingua for Task-Oriented Dialogue
Lori Levin | Donna Gates | Alon Lavie | Fabio Pianesi | Dorcas Wallace | Taro Watanabe
NAACL-ANLP 2000 Workshop: Applied Interlinguas: Practical Applications of Interlingual Approaches to NLP

1999

pdf
Tagging of Speech Acts and Dialogue Games in Spanish Call Home
Lori Levin | Klaus Ries | Ann Thyme-Gobbel | Alon Lavie
Towards Standards and Tools for Discourse Tagging

1998

pdf bib
A modular approach to spoken language translation for large domains
Monika Woszczcyna | Matthew Broadhead | Donna Gates | Marsal Gavaldá | Alon Lavie | Lori Levin | Alex Waibel
Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Technical Papers

The MT engine of the JANUS speech-to-speech translation system is designed around four main principles: 1) an interlingua approach that allows the efficient addition of new languages, 2) the use of semantic grammars that yield low cost high quality translations for limited domains, 3) modular grammars that support easy expansion into new domains, and 4) efficient integration of multiple grammars using multi-domain parse lattices and domain re-scoring. Within the framework of the C-STAR-II speech-to-speech translation effort, these principles are tested against the challenge of providing translation for a number of domains and language pairs with the additional restriction of a common interchange format.

1997

pdf
An Efficient Distribution of Labor in a Two Stage Robust Interpretation Process
Carolyn Penstein Rose | Alon Lavie
Second Conference on Empirical Methods in Natural Language Processing

pdf
Expanding the Domain of a Multi-lingual Speech-to-Speech Translation System
Alon Lavie | Lori Levin | Puming Zhan | Maite Taboada | Donna Gates | Mirella Lapata | Cortis Clark | Matthew Broadhead | Alex Waibel
Spoken Language Translation

1996

pdf
JANUS: multi-lingual translation of spontaneous speech in limited domain
Alon Lavie | Lori Levin | Alex Waibel | Donna Gates | Marsal Gavalda | Laura Mayfield
Conference of the Association for Machine Translation in the Americas

pdf
Multi-lingual Translation of Spontaneously Spoken Language in a Limited Domain
Alon Lavie | Donna Gates | Marsal Gavalda | Laura Mayfield | Alex Waibel | Lori Levin
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics

1995

pdf
Using Context in Machine Translation of Spoken Language
Lori Levin | Oren Glickman | Yan Qu | Carolyn P. Rose | Donna Gates | Alon Lavie | Alex Waibel | Carol Van Ess-Dykema
Proceedings of the Sixth Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

1994

pdf
An Integrated Heuristic Scheme for Partial Parse Evaluation
Alon Lavie
32nd Annual Meeting of the Association for Computational Linguistics

1993

pdf
GLR* – An Efficient Noise-skipping Parsing Algorithm For Context Free Grammars
Alon Lavie | Masaru Tomita
Proceedings of the Third International Workshop on Parsing Technologies

This paper describes GLR*, a parser that can parse any input sentence by ignoring unrecognizable parts of the sentence. In case the standard parsing procedure fails to parse an input sentence, the parser nondeterministically skips some word(s) in the sentence, and returns the parse with fewest skipped words. Therefore, the parser will return some parse(s) with any input sentence, unless no part of the sentence can be recognized at all. The problem can be defined in the following way: Given a context-free grammar G and a sentence S, find and parse S' – the largest subset of words of S, such that S' ∈ L(G). The algorithm described in this paper is a modification of the Generalized LR (Tomita) parsing algorithm [Tomita, 1986] . The parser accommodates the skipping of words by allowing shift operations to be performed from inactive state nodes of the Graph Structured Stack. A heuristic similar to beam search makes the algorithm computationally tractable. There have been several other approaches to the problem of robust parsing, most of which are special purpose algorithms [Carbonell and Hayes, 1984] , [Ward, 1991] and others. Because our approach is a modification to a standard context-free parsing algorithm, all the techniques and grammars developed for the standard parser can be applied as they are. Also, in case the input sentence is by itself grammatical, our parser behaves exactly as the standard GLR parser. The modified parser, GLR*, has been implemented and integrated with the latest version of the Generalized LR Parser/Compiler [Tomita et al , 1988], [Tomita, 1990]. We discuss an application of the GLR* parser to spontaneous speech understanding and present some preliminary tests on the utility of the GLR* parser in such settings.

pdf
Recent Advances in Janus: A Speech Translation System
M. Woszczyna | N. Coccaro | A. Eisele | A. Lavie | A. McNair | T. Polzin | I. Rogina | C. P. Rose | T. Sloboda | M. Tomita | J. Tsutsumi | N. Aoki-Waibel | A. Waibel | W. Ward
Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993

Search
Co-authors