David Ponce
2023
Unsupervised Subtitle Segmentation with Masked Language Models
David Ponce
|
Thierry Etchegoyhen
|
Victor Ruiz
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
We describe a novel unsupervised approach to subtitle segmentation, based on pretrained masked language models, where line endings and subtitle breaks are predicted according to the likelihood of punctuation to occur at candidate segmentation points. Our approach obtained competitive results in terms of segmentation accuracy across metrics, while also fully preserving the original text and complying with length constraints. Although supervised models trained on in-domain data and with access to source audio information can provide better segmentation accuracy, our approach is highly portable across languages and domains and may constitute a robust off-the-shelf solution for subtitle segmentation.
2022
TANDO: A Corpus for Document-level Machine Translation
Harritxu Gete
|
Thierry Etchegoyhen
|
David Ponce
|
Gorka Labaka
|
Nora Aranberri
|
Ander Corral
|
Xabier Saralegi
|
Igor Ellakuria
|
Maite Martin
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Document-level Neural Machine Translation aims to increase the quality of neural translation models by taking into account contextual information. Properly modelling information beyond the sentence level can result in improved machine translation output in terms of coherence, cohesion and consistency. Suitable corpora for context-level modelling are necessary to both train and evaluate context-aware systems, but are still relatively scarce. In this work we describe TANDO, a document-level corpus for the under-resourced Basque-Spanish language pair, which we share with the scientific community. The corpus is composed of parallel data from three different domains and has been prepared with context-level information. Additionally, the corpus includes contrastive test sets for fine-grained evaluations of gender and register contextual phenomena on both source and target language sides. To establish the usefulness of the corpus, we trained and evaluated baseline Transformer models and context-aware variants based on context concatenation. Our results indicate that the corpus is suitable for fine-grained evaluation of document-level machine translation systems.
2021
Online Learning over Time in Adaptive Neural Machine Translation
Thierry Etchegoyhen
|
David Ponce
|
Harritxu Gete
|
Victor Ruiz
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Adaptive Machine Translation purports to dynamically include user feedback to improve translation quality. In a post-editing scenario, user corrections of machine translation output are thus continuously incorporated into translation models, reducing or eliminating repetitive error editing and increasing the usefulness of automated translation. In neural machine translation, this goal may be achieved via online learning approaches, where network parameters are updated based on each new sample. This type of adaptation typically requires higher learning rates, which can affect the quality of the models over time. Alternatively, less aggressive online learning setups may preserve model stability, at the cost of reduced adaptation to user-generated corrections. In this work, we evaluate different online learning configurations over time, measuring their impact on user-generated samples, as well as separate in-domain and out-of-domain datasets. Results in two different domains indicate that mixed approaches combining online learning with periodic batch fine-tuning might be needed to balance the benefits of online learning with model stability.
Search
Co-authors
- Thierry Etchegoyhen 3
- Victor Ruiz 2
- Harritxu Gete 2
- Gorka Labaka 1
- Nora Aranberri 1
- show all...