2021
pdf
bib
abs
Online Learning over Time in Adaptive Neural Machine Translation
Thierry Etchegoyhen
|
David Ponce
|
Harritxu Gete
|
Victor Ruiz
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Adaptive Machine Translation purports to dynamically include user feedback to improve translation quality. In a post-editing scenario, user corrections of machine translation output are thus continuously incorporated into translation models, reducing or eliminating repetitive error editing and increasing the usefulness of automated translation. In neural machine translation, this goal may be achieved via online learning approaches, where network parameters are updated based on each new sample. This type of adaptation typically requires higher learning rates, which can affect the quality of the models over time. Alternatively, less aggressive online learning setups may preserve model stability, at the cost of reduced adaptation to user-generated corrections. In this work, we evaluate different online learning configurations over time, measuring their impact on user-generated samples, as well as separate in-domain and out-of-domain datasets. Results in two different domains indicate that mixed approaches combining online learning with periodic batch fine-tuning might be needed to balance the benefits of online learning with model stability.
2020
pdf
bib
abs
To Case or not to case: Evaluating Casing Methods for Neural Machine Translation
Thierry Etchegoyhen
|
Harritxu Gete
Proceedings of the 12th Language Resources and Evaluation Conference
We present a comparative evaluation of casing methods for Neural Machine Translation, to help establish an optimal pre- and post-processing methodology. We trained and compared system variants on data prepared with the main casing methods available, namely translation of raw data without case normalisation, lowercasing with recasing, truecasing, case factors and inline casing. Machine translation models were prepared on WMT 2017 English-German and English-Turkish datasets, for all translation directions, and the evaluation includes reference metric results as well as a targeted analysis of case preservation accuracy. Inline casing, where case information is marked along lowercased words in the training data, proved to be the optimal approach overall in these experiments.
pdf
bib
abs
Handle with Care: A Case Study in Comparable Corpora Exploitation for Neural Machine Translation
Thierry Etchegoyhen
|
Harritxu Gete
Proceedings of the 12th Language Resources and Evaluation Conference
We present the results of a case study in the exploitation of comparable corpora for Neural Machine Translation. A large comparable corpus for Basque-Spanish was prepared, on the basis of independently-produced news by the Basque public broadcaster EiTB, and we discuss the impact of various techniques to exploit the original data in order to determine optimal variants of the corpus. In particular, we show that filtering in terms of alignment thresholds and length-difference outliers has a significant impact on translation quality. The impact of tags identifying comparable data in the training datasets is also evaluated, with results indicating that this technique might be useful to help the models discriminate noisy information, in the form of informational imbalance between aligned sentences. The final corpus was prepared according to the experimental results and is made available to the scientific community for research purposes.
2018
pdf
bib
Using Discourse Information for Education with a Spanish-Chinese Parallel Corpus
Shuyuan Cao
|
Harritxu Gete
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)