2020
pdf
abs
OpusTools and Parallel Corpus Diagnostics
Mikko Aulamo
|
Umut Sulubacak
|
Sami Virpioja
|
Jörg Tiedemann
Proceedings of the Twelfth Language Resources and Evaluation Conference
This paper introduces OpusTools, a package for downloading and processing parallel corpora included in the OPUS corpus collection. The package implements tools for accessing compressed data in their archived release format and make it possible to easily convert between common formats. OpusTools also includes tools for language identification and data filtering as well as tools for importing data from various sources into the OPUS format. We show the use of these tools in parallel corpus creation and data diagnostics. The latter is especially useful for the identification of potential problems and errors in the extensive data set. Using these tools, we can now monitor the validity of data sets and improve the overall quality and consitency of the data collection.
pdf
MT for Subtitling: Investigating professional translators’ user experience and feedback
Maarit Koponen
|
Umut Sulubacak
|
Kaisa Vitikainen
|
Jörg Tiedemann
Proceedings of 1st Workshop on Post-Editing in Modern-Day Translation
pdf
abs
MT for subtitling: User evaluation of post-editing productivity
Maarit Koponen
|
Umut Sulubacak
|
Kaisa Vitikainen
|
Jörg Tiedemann
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
This paper presents a user evaluation of machine translation and post-editing for TV subtitles. Based on a process study where 12 professional subtitlers translated and post-edited subtitles, we compare effort in terms of task time and number of keystrokes. We also discuss examples of specific subtitling features like condensation, and how these features may have affected the post-editing results. In addition to overall MT quality, segmentation and timing of the subtitles are found to be important issues to be addressed in future work.
pdf
abs
The University of Helsinki Submission to the IWSLT2020 Offline SpeechTranslation Task
Raúl Vázquez
|
Mikko Aulamo
|
Umut Sulubacak
|
Jörg Tiedemann
Proceedings of the 17th International Conference on Spoken Language Translation
This paper describes the University of Helsinki Language Technology group’s participation in the IWSLT 2020 offline speech translation task, addressing the translation of English audio into German text. In line with this year’s task objective, we train both cascade and end-to-end systems for spoken language translation. We opt for an end-to-end multitasking architecture with shared internal representations and a cascade approach that follows a standard procedure consisting of ASR, correction, and MT stages. We also describe the experiments that served as a basis for the submitted systems. Our experiments reveal that multitasking training with shared internal representations is not only possible but allows for knowledge-transfer across modalities.
2019
pdf
abs
Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches
Talha Çolakoğlu
|
Umut Sulubacak
|
Ahmet Cüneyd Tantuğ
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-canonical text normalization provides a way to exploit this as a practical source of training data for language processing systems. The state of the art in Turkish text normalization is composed of a token level pipeline of modules, heavily dependent on external linguistic resources and manually defined rules. Instead, we propose a fully automated, context-aware machine translation approach with fewer stages of processing. Experiments with various implementations of our approach show that we are able to surpass the current best-performing system by a large margin.
pdf
abs
The University of Helsinki Submissions to the WMT19 News Translation Task
Aarne Talman
|
Umut Sulubacak
|
Raúl Vázquez
|
Yves Scherrer
|
Sami Virpioja
|
Alessandro Raganato
|
Arvi Hurskainen
|
Jörg Tiedemann
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
In this paper we present the University of Helsinki submissions to the WMT 2019 shared news translation task in three language pairs: English-German, English-Finnish and Finnish-English. This year we focused first on cleaning and filtering the training data using multiple data-filtering approaches, resulting in much smaller and cleaner training sets. For English-German we trained both sentence-level transformer models as well as compared different document-level translation approaches. For Finnish-English and English-Finnish we focused on different segmentation approaches and we also included a rule-based system for English-Finnish.
pdf
abs
The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task
Raúl Vázquez
|
Umut Sulubacak
|
Jörg Tiedemann
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.
2018
pdf
abs
The MeMAD Submission to the IWSLT 2018 Speech Translation Task
Umut Sulubacak
|
Jörg Tiedemann
|
Aku Rouhe
|
Stig-ArneGrönroos
|
Mikko Kurimo
Proceedings of the 15th International Conference on Spoken Language Translation
This paper describes the MeMAD project entry to the IWSLT Speech Translation Shared Task, addressing the translation of English audio into German text. Between the pipeline and end-to-end model tracks, we participated only in the former, with three contrastive systems. We tried also the latter, but were not able to finish our end-to-end model in time. All of our systems start by transcribing the audio into text through an automatic speech recognition (ASR) model trained on the TED-LIUM English Speech Recognition Corpus (TED-LIUM). Afterwards, we feed the transcripts into English-German text-based neural machine translation (NMT) models. Our systems employ three different translation models trained on separate training sets compiled from the English-German part of the TED Speech Translation Corpus (TED-TRANS) and the OPENSUBTITLES2018 section of the OPUS collection. In this paper, we also describe the experiments leading up to our final systems. Our experiments indicate that using OPENSUBTITLES2018 in training significantly improves translation performance. We also experimented with various preand postprocessing routines for the NMT module, but we did not have much success with these. Our best-scoring system attains a BLEU score of 16.45 on the test set for this year’s task.
pdf
abs
The MeMAD Submission to the WMT18 Multimodal Translation Task
Stig-Arne Grönroos
|
Benoit Huet
|
Mikko Kurimo
|
Jorma Laaksonen
|
Bernard Merialdo
|
Phu Pham
|
Mats Sjöberg
|
Umut Sulubacak
|
Jörg Tiedemann
|
Raphael Troncy
|
Raúl Vázquez
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
This paper describes the MeMAD project entry to the WMT Multimodal Machine Translation Shared Task. We propose adapting the Transformer neural machine translation (NMT) architecture to a multi-modal setting. In this paper, we also describe the preliminary experiments with text-only translation systems leading us up to this choice. We have the top scoring system for both English-to-German and English-to-French, according to the automatic metrics for flickr18. Our experiments show that the effect of the visual features in our system is small. Our largest gains come from the quality of the underlying text-only NMT system. We find that appropriate use of additional data is effective.
2017
pdf
bib
abs
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman
|
Martin Popel
|
Milan Straka
|
Jan Hajič
|
Joakim Nivre
|
Filip Ginter
|
Juhani Luotolahti
|
Sampo Pyysalo
|
Slav Petrov
|
Martin Potthast
|
Francis Tyers
|
Elena Badmaeva
|
Memduh Gokirmak
|
Anna Nedoluzhko
|
Silvie Cinková
|
Jan Hajič jr.
|
Jaroslava Hlaváčová
|
Václava Kettnerová
|
Zdeňka Urešová
|
Jenna Kanerva
|
Stina Ojala
|
Anna Missilä
|
Christopher D. Manning
|
Sebastian Schuster
|
Siva Reddy
|
Dima Taji
|
Nizar Habash
|
Herman Leung
|
Marie-Catherine de Marneffe
|
Manuela Sanguinetti
|
Maria Simi
|
Hiroshi Kanayama
|
Valeria de Paiva
|
Kira Droganova
|
Héctor Martínez Alonso
|
Çağrı Çöltekin
|
Umut Sulubacak
|
Hans Uszkoreit
|
Vivien Macketanz
|
Aljoscha Burchardt
|
Kim Harris
|
Katrin Marheinecke
|
Georg Rehm
|
Tolga Kayadelen
|
Mohammed Attia
|
Ali Elkahky
|
Zhuoran Yu
|
Emily Pitler
|
Saran Lertpradit
|
Michael Mandl
|
Jesse Kirchner
|
Hector Fernandez Alcalde
|
Jana Strnadová
|
Esha Banerjee
|
Ruli Manurung
|
Antonio Stella
|
Atsuko Shimada
|
Sookyoung Kwak
|
Gustavo Mendonça
|
Tatiana Lando
|
Rattima Nitisaroj
|
Josie Li
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.
2016
pdf
abs
Universal Dependencies for Turkish
Umut Sulubacak
|
Memduh Gokirmak
|
Francis Tyers
|
Çağrı Çöltekin
|
Joakim Nivre
|
Gülşen Eryiğit
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
The Universal Dependencies (UD) project was conceived after the substantial recent interest in unifying annotation schemes across languages. With its own annotation principles and abstract inventory for parts of speech, morphosyntactic features and dependency relations, UD aims to facilitate multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. This paper presents the Turkish IMST-UD Treebank, the first Turkish treebank to be in a UD release. The IMST-UD Treebank was automatically converted from the IMST Treebank, which was also recently released. We describe this conversion procedure in detail, complete with mapping tables. We also present our evaluation of the parsing performances of both versions of the IMST Treebank. Our findings suggest that the UD framework is at least as viable for Turkish as the original annotation framework of the IMST Treebank.
2015
pdf
Annotation and Extraction of Multiword Expressions in Turkish Treebanks
Gülşen Eryiǧit
|
Kübra Adali
|
Dilara Torunoğlu-Selamet
|
Umut Sulubacak
|
Tuğba Pamay
Proceedings of the 11th Workshop on Multiword Expressions
pdf
The Annotation Process of the ITU Web Treebank
Tuğba Pamay
|
Umut Sulubacak
|
Dilara Torunoğlu-Selamet
|
Gülşen Eryiğit
Proceedings of the 9th Linguistic Annotation Workshop
2013
pdf
Representation of Morphosyntactic Units and Coordination Structures in the Turkish Dependency Treebank
Umut Sulubacak
|
Gülşen Eryiğit
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages