Francisco Casacuberta

Also published as: F. Casacuberta

2023

pdf abs
Segment-based Interactive Machine Translation at a Character Level
Angel Navarro | Miguel Domingo | Francisco Casacuberta
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

To produce high quality translations, human translators need to review and correct machine translation hypothesis in what it is known as post-editing. In order to reduce the human effort of this process, interactive machine translation proposed a collaborative framework in which human and machine work together to generate the translations. Among the many protocols proposed throughout the years, the segment-based one established a paradigm in which the post-editor was allowed to validate correct word sequences from a translation hypothesis and introduced a word correction to help the system improve the next hypothesis. In this work we propose an extension to this protocol: instead of having to the type the complete word correction, the system will complete the user’s correction while they are typing. We evaluated our proposal under a simulated environment, achieving a significant reduction of the human effort.

pdf abs
Exploring Multilingual Pretrained Machine Translation Models for Interactive Translation
Angel Navarro | Francisco Casacuberta
Proceedings of Machine Translation Summit XIX, Vol. 2: Users Track

Pre-trained large language models (LLM) constitute very important tools in many artificial intelligence applications. In this work, we explore the use of these models in interactive machine translation environments. In particular, we have chosen mBART (multilingual Bidirectional and Auto-Regressive Transformer) as one of these LLMs. The system enables users to refine the translation output interactively by providing feedback. The system utilizes a two-step process, where the NMT (Neural Machine Translation) model generates a preliminary translation in the first step, and the user performs one correction in the second step–repeating the process until the sentence is correctly translated. We assessed the performance of both mBART and the fine-tuned version by comparing them to a state-of-the-art machine translation model on a benchmark dataset regarding user effort, WSR (Word Stroke Ratio), and MAR (Mouse Action Ratio). The experimental results indicate that all the models performed comparably, suggesting that mBART is a viable option for an interactive machine translation environment, as it eliminates the need to train a model from scratch for this particular task. The implications of this finding extend to the development of new machine translation models for interactive environments, as it indicates that novel pre-trained models exhibit state-of-the-art performance in this domain, highlighting the potential benefits of adapting these models to specific needs.

This paper presents the overview of the second Word-Level autocompletion (WLAC) shared task for computer-aided translation, which aims to automatically complete a target word given a translation context including a human typed character sequence. We largely adhere to the settings of the previous round of the shared task, but with two main differences: 1) The typed character sequence is obtained from the typing process of human translators to demonstrate system performance under real-world scenarios when preparing some type of testing examples; 2) We conduct a thorough analysis on the results of the submitted systems from three perspectives. From the experimental results, we observe that translation tasks are helpful to improve the performance of WLAC models. Additionally, our further analysis shows that the semantic error accounts for a significant portion of all errors, and thus it would be promising to take this type of errors into account in future.

pdf abs
PRHLT’s Submission to WLAC 2023
Angel Navarro | Miguel Domingo | Francisco Casacuberta
Proceedings of the Eighth Conference on Machine Translation

This paper describes our submission to the Word-Level AutoCompletion shared task of WMT23. We participated in the English–German and German–English categories. We extended our last year segment-based interactive machine translation approach to address its weakness when no context is available. Additionally, we fine-tune the pre-trained mT5 large language model to be used for autocompletion.

2022

pdf abs
On the Effectiveness of Quasi Character-Level Models for Machine Translation
Salvador Carrión | Francisco Casacuberta
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Neural Machine Translation (NMT) models often use subword-level vocabularies to deal with rare or unknown words. Although some studies have shown the effectiveness of purely character-based models, these approaches have resulted in highly expensive models in computational terms. In this work, we explore the benefits of quasi-character-level models for very low-resource languages and their ability to mitigate the effects of the catastrophic forgetting problem. First, we conduct an empirical study on the efficacy of these models, as a function of the vocabulary and training set size, for a range of languages, domains, and architectures. Next, we study the ability of these models to mitigate the effects of catastrophic forgetting in machine translation. Our work suggests that quasi-character-level models have practically the same generalization capabilities as character-based models but at lower computational costs. Furthermore, they appear to help achieve greater consistency between domains than standard subword-level models, although the catastrophic forgetting problem is not mitigated.

pdf abs
Limitations and Challenges of Unsupervised Cross-lingual Pre-training
Martín Quesada Zaragoza | Francisco Casacuberta
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Cross-lingual alignment methods for monolingual language representations have received notable attention in recent years. However, their use in machine translation pre-training remains scarce. This work tries to shed light on the effects of some of the factors that play a role in cross-lingual pre-training, both for cross-lingual mappings and their integration in supervised neural models. The results show that unsupervised cross-lingual methods are effective at inducing alignment even for distant languages and they benefit noticeably from subword information. However, we find that their effectiveness as pre-training models in machine translation is severely limited due to their cross-lingual signal being easily distorted by the principal network during training. Moreover, the learned bilingual projection is too restrictive to allow said network to learn properly when the embedding weights are frozen.

pdf abs
Few-Shot Regularization to Tackle Catastrophic Forgetting in Multilingual Machine Translation
Salvador Carrión | Francisco Casacuberta
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Increasing the number of tasks supported by a machine learning model without forgetting previously learned tasks is the goal of any lifelong learning system. In this work, we study how to mitigate the effects of the catastrophic forgetting problem to sequentially train a multilingual neural machine translation model using minimal past information. First, we describe the catastrophic forgetting phenomenon as a function of the number of tasks learned (language pairs) and the ratios of past data used during the learning of the new task. Next, we explore the importance of applying oversampling strategies for scenarios where only minimal amounts of past data are available. Finally, we derive a new loss function that minimizes the forgetting of previously learned tasks by actively re-weighting past samples and penalizing weights that deviate too much from the original model. Our work suggests that by using minimal amounts of past data and a simple regularization function, we can significantly mitigate the effects of the catastrophic forgetting phenomenon without increasing the computational costs.

Recent years have witnessed rapid advancements in machine translation, but the state-of-the-art machine translation system still can not satisfy the high requirements in some rigorous translation scenarios. Computer-aided translation (CAT) provides a promising solution to yield a high-quality translation with a guarantee. Unfortunately, due to the lack of popular benchmarks, the research on CAT is not well developed compared with machine translation. In this year, we hold a new shared task called Word-level AutoCompletion (WLAC) for CAT in WMT. Specifically, we introduce some resources to train a WLAC model, and particularly we collect data from CAT systems as a part of test data for this shared task. In addition, we employ both automatic and human evaluations to measure the performance of the submitted systems, and our final evaluation results reveal some findings for the WLAC task.

pdf abs
PRHLT’s Submission to WLAC 2022
Angel Navarro | Miguel Domingo | Francisco Casacuberta
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes our submission to the Word-Level AutoCompletion shared task of WMT22. We participated in the English–German and German–English categories. We proposed a segment-based interactive machine translation approach whose central core is a machine translation (MT) model which generates a complete translation from the context provided by the task. From there, we obtain the word which corresponds to the autocompletion. With this approach, we aim to show that it is possible to use the MT models in the autocompletion task by simply performing minor changes at the decoding step, obtaining satisfactory results.

2021

pdf abs
Introducing Mouse Actions into Interactive-Predictive Neural Machine Translation
Ángel Navarro | Francisco Casacuberta
Proceedings of Machine Translation Summit XVIII: Research Track

The quality of the translations generated by Machine Translation (MT) systems has highly improved through the years and but we are still far away to obtain fully automatic high-quality translations. To generate them and translators make use of Computer-Assisted Translation (CAT) tools and among which we find the Interactive-Predictive Machine Translation (IPMT) systems. In this paper and we use bandit feedback as the main and only information needed to generate new predictions that correct the previous translations. The application of bandit feedback reduces significantly the number of words that the translator need to type in an IPMT session. In conclusion and the use of this technique saves useful time and effort to translators and its performance improves with the future advances in MT and so we recommend its application in the actuals IPMT systems.

2020

In the translation industry, human experts usually supervise and post-edit machine translation hypotheses. Adaptive neural machine translation systems, able to incrementally update the underlying models under an online learning regime, have been proven to be useful to improve the efficiency of this workflow. However, this incremental adaptation is somewhat unstable, and it may lead to undesirable side effects. One of them is the sporadic appearance of made-up words, as a byproduct of an erroneous application of subword segmentation techniques. In this work, we extend previous studies on on-the-fly adaptation of neural machine translation systems. We perform a user study involving professional, experienced post-editors, delving deeper on the aforementioned problems. Results show that adaptive systems were able to learn how to generate the correct translation for task-specific terms, resulting in an improvement of the user’s productivity. We also observed a close similitude, in terms of morphology, between made-up words and the words that were expected.

pdf abs
NICE: Neural Integrated Custom Engines
Daniel Marín Buj | Daniel Ibáñez García | Zuzanna Parcheta | Francisco Casacuberta
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

In this paper, we present a machine translation system implemented by the Translation Centre for the Bodies of the European Union (CdT). The main goal of this project is to create domain-specific machine translation engines in order to support machine translation services and applications to the Translation Centre’s clients. In this article, we explain the entire implementation process of NICE: Neural Integrated Custom Engines. We describe the problems identified and the solutions provided, and present the final results for different language pairs. Finally, we describe the work that will be done on this project in the future.

2019

pdf abs
Demonstration of a Neural Machine Translation System with Online Learning for Translators
Miguel Domingo | Mercedes García-Martínez | Amando Estela Pastor | Laurent Bié | Alexander Helle | Álvaro Peris | Francisco Casacuberta | Manuel Herranz Pérez
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We present a demonstration of our system, which implements online learning for neural machine translation in a production environment. These techniques allow the system to continuously learn from the corrections provided by the translators. We implemented an end-to-end platform integrating our machine translation servers to one of the most common user interfaces for professional translators: SDL Trados Studio. We pretend to save post-editing effort as the machine is continuously learning from its mistakes and adapting the models to a specific domain or user style.

pdf abs
A Neural, Interactive-predictive System for Multimodal Sequence to Sequence Tasks
Álvaro Peris | Francisco Casacuberta
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We present a demonstration of a neural interactive-predictive system for tackling multimodal sequence to sequence tasks. The system generates text predictions to different sequence to sequence tasks: machine translation, image and video captioning. These predictions are revised by a human agent, who introduces corrections in the form of characters. The system reacts to each correction, providing alternative hypotheses, compelling with the feedback provided by the user. The final objective is to reduce the human effort required during this correction process. This system is implemented following a client-server architecture. For accessing the system, we developed a website, which communicates with the neural model, hosted in a local server. From this website, the different tasks can be tackled following the interactive–predictive framework. We open-source all the code developed for building this system. The demonstration in hosted in http://casmacat.prhlt.upv.es/interactive-seq2seq.

pdf abs
Filtering of Noisy Parallel Corpora Based on Hypothesis Generation
Zuzanna Parcheta | Germán Sanchis-Trilles | Francisco Casacuberta
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

The filtering task of noisy parallel corpora in WMT2019 aims to challenge participants to create filtering methods to be useful for training machine translation systems. In this work, we introduce a noisy parallel corpora filtering system based on generating hypotheses by means of a translation model. We train translation models in both language pairs: Nepali–English and Sinhala–English using provided parallel corpora. We select the training subset for three language pairs (Nepali, Sinhala and Hindi to English) jointly using bilingual cross-entropy selection to create the best possible translation model for both language pairs. Once the translation models are trained, we translate the noisy corpora and generate a hypothesis for each sentence pair. We compute the smoothed BLEU score between the target sentence and generated hypothesis. In addition, we apply several rules to discard very noisy or inadequate sentences which can lower the translation score. These heuristics are based on sentence length, source and target similarity and source language detection. We compare our results with the baseline published on the shared task website, which uses the Zipporah model, over which we achieve significant improvements in one of the conditions in the shared task. The designed filtering system is domain independent and all experiments are conducted using neural machine translation.

2018

pdf abs
Are Automatic Metrics Robust and Reliable in Specific Machine Translation Tasks?
Mara Chinea-Rios | Alvaro Peris | Francisco Casacuberta
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

We present a comparison of automatic metrics against human evaluations of translation quality in several scenarios which were unexplored up to now. Our experimentation was conducted on translation hypotheses that were problematic for the automatic metrics, as the results greatly diverged from one metric to another. We also compared three different translation technologies. Our evaluation shows that in most cases, the metrics capture the human criteria. However, we face failures of the automatic metrics when applied to some domains and systems. Interestingly, we find that automatic metrics applied to the neural machine translation hypotheses provide the most reliable results. Finally, we provide some advice when dealing with these problematic domains.

pdf abs
Creating the best development corpus for Statistical Machine Translation systems
Mara Chinea-Rios | Germán Sanchis-Trilles | Francisco Casacuberta
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

We propose and study three different novel approaches for tackling the problem of development set selection in Statistical Machine Translation. We focus on a scenario where a machine translation system is leveraged for translating a specific test set, without further data from the domain at hand. Such test set stems from a real application of machine translation, where the texts of a specific e-commerce were to be translated. For developing our development-set selection techniques, we first conducted experiments in a controlled scenario, where labelled data from different domains was available, and evaluated the techniques both with classification and translation quality metrics. Then, the bestperforming techniques were evaluated on the e-commerce data at hand, yielding consistent improvements across two language directions.

pdf abs
Spelling Normalization of Historical Documents by Using a Machine Translation Approach
Miguel Domingo | Francisco Casacuberta
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

The lack of a spelling convention in historical documents makes their orthography to change depending on the author and the time period in which each document was written. This represents a problem for the preservation of the cultural heritage, which strives to create a digital text version of a historical document. With the aim of solving this problem, we propose three approaches—based on statistical, neural and character-based machine translation—to adapt the document’s spelling to modern standards. We tested these approaches in different scenarios, obtaining very encouraging results.

pdf abs
Data selection for NMT using Infrequent n-gram Recovery
Zuzanna Parcheta | Germán Sanchis-Trilles | Francisco Casacuberta
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

Neural Machine Translation (NMT) has achieved promising results comparable with Phrase-Based Statistical Machine Translation (PBSMT). However, to train a neural translation engine, much more powerful machines are required than those required to develop translation engines based on PBSMT. One solution to reduce the training cost of NMT systems is the reduction of the training corpus through data selection (DS) techniques. There are many DS techniques applied in PBSMT which bring good results. In this work, we show that the data selection technique based on infrequent n-gram occurrence described in (Gasco ́ et al., 2012) commonly used for PBSMT systems also works well for NMT systems. We focus our work on selecting data according to specific corpora using the previously mentioned technique. The specific-domain corpora used for our experiments are IT domain and medical domain. The DS technique significantly reduces the execution time required to train the model between 87% and 93%. Also, it improves translation quality by up to 2.8 BLEU points. The improvements are obtained with just a small fraction of the data that accounts for between 6% and 20% of the total data.

pdf abs
A Machine Translation Approach for Modernizing Historical Documents Using Backtranslation
Miguel Domingo | Francisco Casacuberta
Proceedings of the 15th International Conference on Spoken Language Translation

Human language evolves with the passage of time. This makes historical documents to be hard to comprehend by contemporary people and, thus, limits their accessibility to scholars specialized in the time period in which a certain document was written. Modernization aims at breaking this language barrier and increase the accessibility of historical documents to a broader audience. To do so, it generates a new version of a historical document, written in the modern version of the document’s original language. In this work, we propose several machine translation approaches for modernizing historical documents. We tested these approaches in different scenarios, obtaining very encouraging results.

pdf abs
Active Learning for Interactive Neural Machine Translation of Data Streams
Álvaro Peris | Francisco Casacuberta
Proceedings of the 22nd Conference on Computational Natural Language Learning

We study the application of active learning techniques to the translation of unbounded data streams via interactive neural machine translation. The main idea is to select, from an unbounded stream of source sentences, those worth to be supervised by a human agent. The user will interactively translate those samples. Once validated, these data is useful for adapting the neural machine translation model. We propose two novel methods for selecting the samples to be validated. We exploit the information from the attention mechanism of a neural machine translation system. Our experiments show that the inclusion of active learning techniques into this pipeline allows to reduce the effort required during the process, while increasing the quality of the translation system. Moreover, it enables to balance the human effort required for achieving a certain translation quality. Moreover, our neural system outperforms classical approaches by a large margin.

2017

pdf
Adapting Neural Machine Translation with Parallel Synthetic Data
Mara Chinea-Ríos | Álvaro Peris | Francisco Casacuberta
Proceedings of the Second Conference on Machine Translation

2016

pdf
Beyond Prefix-Based Interactive Translation Prediction
Jesús González-Rubio | Daniel Ortiz-Martínez | Francisco Casacuberta | José Miguel Benedi Ruiz
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

pdf
Interactive-Predictive Translation Based on Multiple Word-Segments
Miguel Domingo | Alvaro Peris | Francisco Casacuberta
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

2014

This paper describes a pilot study with a computed-assisted translation workbench aiming at testing the integration of online and active learning features. We investigate the effect of these features on translation productivity, using interactive translation prediction (ITP) as a baseline. User activity data were collected from five beta testers using key-logging and eye-tracking. User feedback was also collected at the end of the experiments in the form of retrospective think-aloud protocols. We found that OL performs better than ITP, especially in terms of translation speed. In addition, AL provides better translation quality than ITP for the same levels of user effort. We plan to incorporate these features in the final version of the workbench.

Efficient wordgraph for interactive translation prediction
Germán Sanchis-Trilles | Daniel Ortiz-Martínez | Francisco Casacuberta
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

pdf
CASMACAT: cognitive analysis and statistical methods for advanced computer aided translation
Philipp Koehn | Michael Carl | Francisco Casacuberta | Eva Marcos
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

pdf
The New Thot Toolkit for Fully-Automatic and Interactive Statistical Machine Translation
Daniel Ortiz-Martínez | Francisco Casacuberta
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf
Inference of Phrase-Based Translation Models via Minimum Description Length
Jesús González-Rubio | Francisco Casacuberta
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

This paper describes the field trial and subsequent evaluation of a post-editing workbench which is currently under development in the EU-funded CasMaCat project. Based on user evaluations of the initial prototype of the workbench, this second prototype of the workbench includes a number of interactive features designed to improve productivity and user satisfaction. Using CasMaCat’s own facilities for logging keystrokes and eye tracking, data were collected from nine post-editors in a professional setting. These data were then used to investigate the effects of the interactive features on productivity, quality, user satisfaction and cognitive load as reflected in the post-editors gaze activity. These quantitative results are combined with the qualitative results derived from user questionnaires and interviews conducted with all the participants.

pdf abs
Online optimisation of log-linear weights in interactive machine translation
Mara Chinea Rios | Germán Sanchis-Trilles | Daniel Ortiz-Martínez | Francisco Casacuberta
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Whenever the quality provided by a machine translation system is not enough, a human expert is required to correct the sentences provided by the machine translation system. In such a setup, it is crucial that the system is able to learn from the errors that have already been corrected. In this paper, we analyse the applicability of discriminative ridge regression for learning the log-linear weights of a state-of-the-art machine translation system underlying an interactive machine translation framework, with encouraging results.

2013

pdf abs
Improving the minimum Bayes’ risk combination of machine translation systems
Jesús González-Rubio | Francisco Casacuberta
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers

We investigate the problem of combining the outputs of different translation systems into a minimum Bayes’ risk consensus translation. We explore different risk formulations based on the BLEU score, and provide a dynamic programming decoding algorithm for each of them. In our experiments, these algorithms generated consensus translations with better risk, and more efficiently, than previous proposals.

pdf abs
Emprical study of a two-step approach to estimate translation quality
Jesús González-Rubio | J. Ramón Navarro-Cerdán | Francisco Casacuberta
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers

We present a method to estimate the quality of automatic translations when reference translations are not available. Quality estimation is addressed as a two-step regression problem where multiple features are combined to predict a quality score. Given a set of features, we aim at automatically extracting the variables that better explain translation quality, and use them to predict the quality score. The soundness of our approach is assessed by the encouraging results obtained in an exhaustive experimentation with several feature sets. Moreover, the studied approach is highly-scalable allowing us to employ hundreds of features to predict translation quality.

pdf bib
CASMACAT: Cognitive Analysis and Statistical Methods for Advanced Computer Aided Translation
Philipp Koehn | Michael Carl | Francisco Casacuberta | Eva Marcos
Proceedings of Machine Translation Summit XIV: European projects

pdf
Interactive Machine Translation using Hierarchical Translation Models
Jesús González-Rubio | Daniel Ortiz-Martínez | José-Miguel Benedí | Francisco Casacuberta
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

This paper presents the submissions of the PRHLT group for the evaluation campaign of the International Workshop on Spoken Language Translation. We focus on the development of reliable translation systems between syntactically different languages (DIALOG task) and on the efficient training of SMT models in resource-rich scenarios (TALK task).

pdf
Log-linear weight optimisation via Bayesian Adaptation in Statistical Machine Translation
Germán Sanchis-Trilles | Francisco Casacuberta
Coling 2010: Posters

pdf abs
Saturnalia: A Latin-Catalan Parallel Corpus for Statistical MT
Jesús González-Rubio | Jorge Civera | Alfons Juan | Francisco Casacuberta
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Currently, a great effort is being carried out in the digitalisation of large historical document collections for preservation purposes. The documents in these collections are usually written in ancient languages, such as Latin or Greek, which limits the access of the general public to their content due to the language barrier. Therefore, digital libraries aim not only at storing raw images of digitalised documents, but also to annotate them with their corresponding text transcriptions and translations into modern languages. Unfortunately, ancient languages have at their disposal scarce electronic resources to be exploited by natural language processing techniques. This paper describes the compilation process of a novel Latin-Catalan parallel corpus as a new task for statistical machine translation (SMT). Preliminary experimental results are also reported using a state-of-the-art phrase-based SMT system. The results presented in this work reveal the complexity of the task and its challenging, but interesting nature for future development.

pdf
Online Learning for Interactive Statistical Machine Translation
Daniel Ortiz-Martínez | Ismael García-Varea | Francisco Casacuberta
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf
Balancing User Effort and Translation Error in Interactive Machine Translation via Confidence Measures
Jesús González-Rubio | Daniel Ortiz-Martínez | Francisco Casacuberta
Proceedings of the ACL 2010 Conference Short Papers

2009

pdf
Statistical Post-Editing of a Rule-Based Machine Translation System
Antonio-L. Lagarda | Vicent Alabau | Francisco Casacuberta | Roberto Silva | Enrique Díaz-de-Liaño
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

pdf
Interactive Machine Translation Based on Partial Statistical Phrase-based Alignments
Daniel Ortiz-Martínez | Ismael García-Varea | Francisco Casacuberta
Proceedings of the International Conference RANLP-2009

pdf
GREAT: A Finite-State Machine Translation Toolkit Implementing a Grammatical Inference Approach for Transducer Inference (GIATI)
Jorge González | Francisco Casacuberta
Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference

2008

pdf
A finite-state framework for log-linear models in machine translation
Jorge González | Francisco Casacuberta
Proceedings of the 12th Annual Conference of the European Association for Machine Translation

pdf
A novel alignment model inspired on IBM Model 1
Jesús González-Rubio | Germán Sanchis-Trilles | Alfons Juan | Francisco Casacuberta
Proceedings of the 12th Annual Conference of the European Association for Machine Translation

pdf
Applying boosting to statistical machine translation
Antonio L. Lagarda | Francisco Casacuberta
Proceedings of the 12th Annual Conference of the European Association for Machine Translation

pdf
Phrase-level alignment generation using a smoothed loglinear phrase-based statistical alignment model
Daniel Ortiz-Martínez | Ismael García-Varea | Francisco Casacuberta
Proceedings of the 12th Annual Conference of the European Association for Machine Translation

pdf
Improving Interactive Machine Translation via Mouse Actions
Germán Sanchis-Trilles | Daniel Ortiz-Martínez | Jorge Civera | Francisco Casacuberta | Enrique Vidal | Hieu Hoang
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib abs
A comparison of linguistically and statistically enhanced models for speech-to-speech machine translation
Alicia Pérez | Víctor Guijarrubia | Raquel Justo | M. Inés Torres | Francisco Casacuberta
Proceedings of the Fourth International Workshop on Spoken Language Translation

The goal of this work is to improve current translation models by taking into account additional knowledge sources such as semantically motivated segmentation or statistical categorization. Specifically, two different approaches are discussed. On the one hand, phrase-based approach, and on the other hand, categorization. For both approaches, both statistical and linguistic alternatives are explored. As for translation framework, finite-state transducers are considered. These are versatile models that can be easily integrated on-the-fly with acoustic models for speech translation purposes. In what the experimental framework concerns, all the models presented were evaluated and compared taking confidence intervals into account.

pdf abs
Using word posterior probabilities in lattice translation
Vicente Alabau | Alberto Sanchis | Francisco Casacuberta
Proceedings of the Fourth International Workshop on Spoken Language Translation

In this paper we describe the statistical machine translation system developed at ITI/UPV, which aims especially at speech recognition and statistical machine translation integration, for the evaluation campaign of the International Workshop on Spoken Language Translation (2007). The system we have developed takes advantage of an improved word lattice representation that uses word posterior probabilities. These word posterior probabilities are then added as a feature to a log-linear model. This model includes a stochastic finite-state transducer which allows an easy lattice integration. Furthermore, it provides a statistical phrase-based reordering model that is able to perform local reorderings of the output. We have tested this model on the Italian-English corpus, for clean text, 1-best ASR and lattice ASR inputs. The results and conclusions of such experiments are reported at the end of this paper.

pdf bib
Improving speech-to-speech translation using word posterior probabilities
Vicente Alabau | Alberto Sanchis | Francisco Casacuberta
Proceedings of Machine Translation Summit XI: Papers

pdf bib
Combining translation models in statistical machine translation
Jesús Andrés-Ferrer | Ismael Garcia-Varea | Francisco Casacuberta
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf
Reordering via n-best lists for Spanish-Basque translation
Germán Sanchis | Francisco Casacuberta
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf
An Integrated Architecture for Speech-Input Multi-Target Machine Translation
Alicia Pérez | M. Teresa González | M. Inés Torres | Francisco Casacuberta
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf
Speech-Input Multi-Target Machine Translation
Alicia Pérez | M. Teresa González | M. Inés Torres | Francisco Casacuberta
Proceedings of the Second Workshop on Statistical Machine Translation

2006

pdf
Statistical Phrase-Based Models for Interactive Computer-Assisted Translation
Jesús Tomás | Francisco Casacuberta
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf
Generalized Stack Decoding Algorithms for Statistical Machine Translation
Daniel Ortiz Martínez | Ismael García Varea | Francisco Casacuberta
Proceedings on the Workshop on Statistical Machine Translation

2005

pdf abs
Thot: a Toolkit To Train Phrase-based Statistical Translation Models
Daniel Ortiz-Martínez | Ismael García-Varea | Francisco Casacuberta
Proceedings of Machine Translation Summit X: Papers

In this paper, we present the Thot toolkit, a set of tools to train phrase-based models for statistical machine translation, which is publicly available as open source software. The toolkit obtains phrase-based models from word-based alignment models; to our knowledge, this functionality has not been offered by any publicly available toolkit. The Thot toolkit also implements a new way for estimating phrase models, this allows to obtain more complete phrase models than the methods described in the literature, including a segmentation length submodel. The toolkit output can be given in different formats in order to be used by other statistical machine translation tools like Pharaoh, which is a beam search decoder for phrase-based alignment models which was used in order to perform translation experiments with the generated models. Additionally, the Thot toolkit can be used to obtain the best alignment between a sentence pair at phrase level.

2004

pdf
Machine Translation with Inferred Stochastic Finite-State Transducers
Francisco Casacuberta | Enrique Vidal
Computational Linguistics, Volume 30, Number 2, June 2004

pdf abs
Translation Memories Enrichment by Statistical Bilingual Segmentation
Francisco Nevado | Francisco Casacuberta | Josu Landa
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

A majority of Machine Aided Translation systems are based on comparisons between a source sentence and reference sentences stored in Translation Memories (TMs). The translation search is done by looking for sentences in a database which are similar to the source sentence. TMs have two basic limitations: the dependency on the repetition of complete sentences and the high cost of building a TM. As human translators do not only remember sentences from their preceding translations, but they also decompose the sentence to be translated and work with smaller units, it would be desirable to enrich the TM database with smaller translation units. This enrichment should also be automatic in order not to increase the cost of building a TM. We propose the application of two automatic bilingual segmentation techniques based on statistical translation methods in order to create new, shorter bilingual segments to be included in a TM database. An evaluation of the two techniques is carried out for a bilingual Basque-Spanish task.

2003

pdf
Adapting finite-state translation to the TransType2 project
Elsa Cubel | Jorge González | Antonio Lagarda | Francisco Casacuberta | Alfons Juan | Enrique Vidal
EAMT Workshop: Improving MT through other language technology tools: resources and tools for building MT

pdf abs
On the use of statistical machine-translation techniques within a memory-based translation system (AMETRA)
Daniel Ortíz | Ismael García-Varea | Francisco Casacuberta | Antonio Lagarda | Jorge González
Proceedings of Machine Translation Summit IX: Papers

The goal of the AMETRA project is to make a computer-assisted translation tool from the Spanish language to the Basque language under the memory-based translation framework. The system is based on a large collection of bilingual word-segments. These segments are obtained using linguistic or statistical techniques from a Spanish-Basque bilingual corpus consisting of sentences extracted from the Basque Country’s of£cial government record. One of the tasks within the global information document of the AMETRA project is to study the combination of well-known statistical techniques for the translation of short sequences and techniques for memory-based translation. In this paper, we address the problem of constructing a statistical module to deal with the task of translating segments. The task undertaken in the AMETRA project is compared with other existing translation tasks, This study includes the results of some preliminary experiments we have carried out using well-known statistical machine translation tools and techniques.

pdf
Parallel Corpora Segmentation Using Anchor Words
Francisco Nevado | Francisco Casacuberta | Enrique Vidal
Proceedings of the 7th International EAMT workshop on MT and other language technology tools, Improving MT through other language technology tools, Resource and tools for building MT at EACL 2003

pdf
A Quantitative Method for Machine Translation Evaluation
Jesús Tomás | Josep Àngel Mas | Francisco Casacuberta
Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: are evaluation methods, metrics and resources reusable?

2002

pdf abs
Efficient integration of maximum entropy lexicon models within the training of statistical alignment models
Ismael García-Varea | Franz J. Och | Hermann Ney | Francisco Casacuberta
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers

Maximum entropy (ME) models have been successfully applied to many natural language problems. In this paper, we show how to integrate ME models efficiently within a maximum likelihood training scheme of statistical machine translation models. Specifically, we define a set of context-dependent ME lexicon models and we present how to perform an efficient training of these ME models within the conventional expectation-maximization (EM) training of statistical translation models. Experimental results are also given in order to demonstrate how these ME models improve the results obtained with the traditional translation models. The results are presented by means of alignment quality comparing the resulting alignments with manually annotated reference alignments.

pdf
Improving Alignment Quality in Statistical Machine Translation Using Context-dependent Maximum Entropy Models
Ismael García Varea | Franz J. Och | Hermann Ney | Francisco Casacuberta
COLING 2002: The 19th International Conference on Computational Linguistics

pdf
Architectures for Speech-to-Speech Translation Using Finite-state Models
Francisco Casacuberta | Enrique Vidal | Juan Miguel Vilar
Proceedings of the ACL-02 Workshop on Speech-to-Speech Translation: Algorithms and Systems

2001

pdf abs
Search algorithms for statistical machine translation based on dynamic programming and pruning techniques
Ismael García-Varea | Francisco Casacuberta
Proceedings of Machine Translation Summit VIII

The increasing interest in the statistical approach to Machine Translation is due to the development of effective algorithms for training the probabilistic models proposed so far. However, one of the open problems with statistical machine translation is the design of efficient algorithms for translating a given input string. For some interesting models, only (good) approximate solutions can be found. Recently, a dynamic programming-like algorithm for the IBM-Model 2 has been proposed which is based on an iterative process of refinement solutions. A new dynamic programming-like algorithm is proposed here to deal with more complex IBM models (models 3 to 5). The computational cost of the algorithm is reduced by using an alignment-based pruning technique. Experimental results with the so-called “Tourist Task” are also presented.

A finite-state, rule-based morphological analyser is presented here, within the framework of machine translation system TAVAL. This morphological analyser introduces specific features which are particularly useful for translation, such as the detection and morphological tagging of word groups that act as a single lexical unit for translation purposes. The case where words in one such group are not strictly contiguous is also covered. A brief description of the Spanish-to-Catalan and Catalan-to-Spanish translation system TAVAL is given in the paper.

pdf abs
Monotone statistical translation using word groups
Jesús Tomás | Francisco Casacuberta
Proceedings of Machine Translation Summit VIII

A new system for statistical natural language translation for languages with similar grammar is introduced. Specifically, it can be used with Romanic Languages, such as French, Spanish or Catalan. The statistical translation uses two sources of information: a language model and a translation model. The language model used is a standard trigram model. A new approach is defined in the translation model. The two main properties of the translation model are: the translation probabilities are computed between groups of words and the alignment between those groups is monotone. That is, the order between the word groups in the source sentence is conserved in the target sentence. Once, the translation model has been defined, we present an algorithm to infer its parameters from training samples. The translation process is carried out with an efficient algorithm based on stack-decoding. Finally, we present some translation results from Catalan to Spanish and compare our model with other conventional models.

pdf
Refined Lexicon Models for Statistical Machine Translation using a Maximum Entropy Approach
Ismael García-Varea | Franz J. Och | Hermann Ney | Francisco Casacuberta
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics