Jörg Tiedemann

Also published as: Joerg Tiedemann, Jorg Tiedemann

2021

pdf bib abs
NLI Data Sanity Check: Assessing the Effect of Data Corruption on Model Performance
Aarne Talman | Marianna Apidianaki | Stergios Chatzikyriakidis | Jörg Tiedemann
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Pre-trained neural language models give high performance on natural language inference (NLI) tasks. But whether they actually understand the meaning of the processed sequences is still unclear. We propose a new diagnostics test suite which allows to assess whether a dataset constitutes a good testbed for evaluating the models’ meaning understanding capabilities. We specifically apply controlled corruption transformations to widely used benchmarks (MNLI and ANLI), which involve removing entire word classes and often lead to non-sensical sentence pairs. If model accuracy on the corrupted data remains high, then the dataset is likely to contain statistical biases and artefacts that guide prediction. Inversely, a large decrease in model accuracy indicates that the original dataset provides a proper challenge to the models’ reasoning capabilities. Hence, our proposed controls can serve as a crash test for developing high quality data for NLI tasks.

pdf bib abs
Boosting Neural Machine Translation from Finnish to Northern Sámi with Rule-Based Backtranslation
Mikko Aulamo | Sami Virpioja | Yves Scherrer | Jörg Tiedemann
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We consider a low-resource translation task from Finnish into Northern Sámi. Collecting all available parallel data between the languages, we obtain around 30,000 sentence pairs. However, there exists a significantly larger monolingual Northern Sámi corpus, as well as a rule-based machine translation (RBMT) system between the languages. To make the best use of the monolingual data in a neural machine translation (NMT) system, we use the backtranslation approach to create synthetic parallel data from it using both NMT and RBMT systems. Evaluating the results on an in-domain test set and a small out-of-domain set, we find that the RBMT backtranslation outperforms NMT backtranslation clearly for the out-of-domain test set, but also slightly for the in-domain data, for which the NMT backtranslation model provided clearly better BLEU scores than the RBMT. In addition, combining both backtranslated data sets improves the RBMT approach only for the in-domain test set. This suggests that the RBMT system provides general-domain knowledge that cannot be found from the relative small parallel training data.

pdf bib
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Paola Merlo | Jorg Tiedemann | Reut Tsarfaty
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

pdf bib abs
The Helsinki submission to the AmericasNLP shared task
Raúl Vázquez | Yves Scherrer | Sami Virpioja | Jörg Tiedemann
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

The University of Helsinki participated in the AmericasNLP shared task for all ten language pairs. Our multilingual NMT models reached the first rank on all language pairs in track 1, and first rank on nine out of ten language pairs in track 2. We focused our efforts on three aspects: (1) the collection of additional data from various sources such as Bibles and political constitutions, (2) the cleaning and filtering of training data with the OpusFilter toolkit, and (3) different multilingual training techniques enabled by the latest version of the OpenNMT-py toolkit to make the most efficient use of the scarce data. This paper describes our efforts in detail.

pdf bib
Towards a balanced annotated Low Saxon dataset for diachronic investigation of dialectal variation
Janine Siewert | Yves Scherrer | Jörg Tiedemann
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf bib abs
Creating an Aligned Russian Text Simplification Dataset from Language Learner Data
Anna Dmitrieva | Jörg Tiedemann
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

Parallel language corpora where regular texts are aligned with their simplified versions can be used in both natural language processing and theoretical linguistic studies. They are essential for the task of automatic text simplification, but can also provide valuable insights into the characteristics that make texts more accessible and reveal strategies that human experts use to simplify texts. Today, there exist a few parallel datasets for English and Simple English, but many other languages lack such data. In this paper we describe our work on creating an aligned Russian-Simple Russian dataset composed of Russian literature texts adapted for learners of Russian as a foreign language. This will be the first parallel dataset in this domain, and one of the first Simple Russian datasets in general.

pdf bib abs
On the differences between BERT and MT encoder spaces and how to address them in translation tasks
Raúl Vázquez | Hande Celikkanat | Mathias Creutz | Jörg Tiedemann
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

Various studies show that pretrained language models such as BERT cannot straightforwardly replace encoders in neural machine translation despite their enormous success in other tasks. This is even more astonishing considering the similarities between the architectures. This paper sheds some light on the embedding spaces they create, using average cosine similarity, contextuality metrics and measures for representational similarity for comparison, revealing that BERT and NMT encoder representations look significantly different from one another. In order to address this issue, we propose a supervised transformation from one into the other using explicit alignment and fine-tuning. Our results demonstrate the need for such a transformation to improve the applicability of BERT in MT.

pdf bib abs
An Empirical Investigation of Word Alignment Supervision for Zero-Shot Multilingual Neural Machine Translation
Alessandro Raganato | Raúl Vázquez | Mathias Creutz | Jörg Tiedemann
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Zero-shot translations is a fascinating feature of Multilingual Neural Machine Translation (MNMT) systems. These MNMT models are usually trained on English-centric data, i.e. English either as the source or target language, and with a language label prepended to the input indicating the target language. However, recent work has highlighted several flaws of these models in zero-shot scenarios where language labels are ignored and the wrong language is generated or different runs show highly unstable results. In this paper, we investigate the benefits of an explicit alignment to language labels in Transformer-based MNMT models in the zero-shot context, by jointly training one cross attention head with word alignment supervision to stress the focus on the target language label. We compare and evaluate several MNMT systems on three multilingual MT benchmarks of different sizes, showing that simply supervising one cross attention head to focus both on word alignments and language labels reduces the bias towards translating into the wrong language, improving the zero-shot performance overall. Moreover, as an additional advantage, we find that our alignment supervision leads to more stable results across different training runs.

pdf bib
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer | Tommi Jauhiainen
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

2020

pdf bib
MT for Subtitling: Investigating professional translators’ user experience and feedback
Maarit Koponen | Umut Sulubacak | Kaisa Vitikainen | Jörg Tiedemann
Proceedings of 1st Workshop on Post-Editing in Modern-Day Translation

pdf bib abs
A Systematic Study of Inner-Attention-Based Sentence Representations in Multilingual Neural Machine Translation
Raúl Vázquez | Alessandro Raganato | Mathias Creutz | Jörg Tiedemann
Computational Linguistics, Volume 46, Issue 2 - June 2020

Neural machine translation has considerably improved the quality of automatic translations by learning good representations of input sentences. In this article, we explore a multilingual translation model capable of producing fixed-size sentence representations by incorporating an intermediate crosslingual shared layer, which we refer to as attention bridge. This layer exploits the semantics from each language and develops into a language-agnostic meaning representation that can be efficiently used for transfer learning. We systematically study the impact of the size of the attention bridge and the effect of including additional languages in the model. In contrast to related previous work, we demonstrate that there is no conflict between translation performance and the use of sentence representations in downstream tasks. In particular, we show that larger intermediate layers not only improve translation quality, especially for long sentences, but also push the accuracy of trainable classification tasks. Nevertheless, shorter representations lead to increased compression that is beneficial in non-trainable similarity tasks. Similarly, we show that trainable downstream tasks benefit from multilingual models, whereas additional language signals do not improve performance in non-trainable benchmarks. This is an important insight that helps to properly design models for specific applications. Finally, we also include an in-depth analysis of the proposed attention bridge and its ability to encode linguistic properties. We carefully analyze the information that is captured by individual attention heads and identify interesting patterns that explain the performance of specific settings in linguistic probing tasks.

pdf bib abs
The MUCOW word sense disambiguation test suite at WMT 2020
Yves Scherrer | Alessandro Raganato | Jörg Tiedemann
Proceedings of the Fifth Conference on Machine Translation

This paper reports on our participation with the MUCOW test suite at the WMT 2020 news translation task. We introduced MUCOW at WMT 2019 to measure the ability of MT systems to perform word sense disambiguation (WSD), i.e., to translate an ambiguous word with its correct sense. MUCOW is created automatically using existing resources, and the evaluation process is also entirely automated. We evaluate all participating systems of the language pairs English -> Czech, English -> German, and English -> Russian and compare the results with those obtained at WMT 2019. While current NMT systems are fairly good at handling ambiguous source words, we could not identify any substantial progress - at least to the extent that it is measurable by the MUCOW method - in that area over the last year.

pdf bib abs
The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT
Jörg Tiedemann
Proceedings of the Fifth Conference on Machine Translation

This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages and tools for creating state-of-the-art translation models from that collection. The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World’s languages. Using the package it is possible to work on realistic low-resource scenarios avoiding artificially reduced setups that are common when demonstrating zero-shot or few-shot learning. For the first time, this package provides a comprehensive collection of diverse data sets in hundreds of languages with systematic language and script annotation and data splits to extend the narrow coverage of existing benchmarks. Together with the data release, we also provide a growing number of pre-trained baseline models for individual language pairs and selected language groups.

pdf bib abs
MT for subtitling: User evaluation of post-editing productivity
Maarit Koponen | Umut Sulubacak | Kaisa Vitikainen | Jörg Tiedemann
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper presents a user evaluation of machine translation and post-editing for TV subtitles. Based on a process study where 12 professional subtitlers translated and post-edited subtitles, we compare effort in terms of task time and number of keystrokes. We also discuss examples of specific subtitling features like condensation, and how these features may have affected the post-editing results. In addition to overall MT quality, segmentation and timing of the subtitles are found to be important issues to be addressed in future work.

pdf bib abs
OPUS-MT – Building open translation services for the World
Jörg Tiedemann | Santhosh Thottingal
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper presents OPUS-MT a project that focuses on the development of free resources and tools for machine translation. The current status is a repository of over 1,000 pre-trained neural machine translation models that are ready to be launched in on-line translation services. For this we also provide open source implementations of web applications that can run efficiently on average desktop hardware with a straightforward setup and installation.

pdf bib abs
LT@Helsinki at SemEval-2020 Task 12: Multilingual or Language-specific BERT?
Marc Pàmies | Emily Öhman | Kaisla Kajava | Jörg Tiedemann
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper presents the different models submitted by the LT@Helsinki team for the SemEval 2020 Shared Task 12. Our team participated in sub-tasks A and C; titled offensive language identification and offense target identification, respectively. In both cases we used the so-called Bidirectional Encoder Representation from Transformer (BERT), a model pre-trained by Google and fine-tuned by us on the OLID and SOLID datasets. The results show that offensive tweet classification is one of several language-based tasks where BERT can achieve state-of-the-art results.

pdf bib abs
Controlling the Imprint of Passivization and Negation in Contextualized Representations
Hande Celikkanat | Sami Virpioja | Jörg Tiedemann | Marianna Apidianaki
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Contextualized word representations encode rich information about syntax and semantics, alongside specificities of each context of use. While contextual variation does not always reflect actual meaning shifts, it can still reduce the similarity of embeddings for word instances having the same meaning. We explore the imprint of two specific linguistic alternations, namely passivization and negation, on the representations generated by neural models trained with two different objectives: masked language modeling and translation. Our exploration methodology is inspired by an approach previously proposed for removing societal biases from word vectors. We show that passivization and negation leave their traces on the representations, and that neutralizing this information leads to more similar embeddings for words that should preserve their meaning in the transformation. We also find clear differences in how the respective features generalize across datasets.

pdf bib abs
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation
Alessandro Raganato | Yves Scherrer | Jörg Tiedemann
Findings of the Association for Computational Linguistics: EMNLP 2020

Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. However, recent works have shown that most attention heads learn simple, and often redundant, positional patterns. In this paper, we propose to replace all but one attention head of each encoder layer with simple fixed – non-learnable – attentive patterns that are solely based on position and do not require any external knowledge. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality and even increases BLEU scores by up to 3 points in low-resource scenarios.

pdf bib abs
XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection
Emily Öhman | Marc Pàmies | Kaisla Kajava | Jörg Tiedemann
Proceedings of the 28th International Conference on Computational Linguistics

We introduce XED, a multilingual fine-grained emotion dataset. The dataset consists of human-annotated Finnish (25k) and English sentences (30k), as well as projected annotations for 30 additional languages, providing new resources for many low-resource languages. We use Plutchik’s core emotions to annotate the dataset with the addition of neutral to create a multilabel multiclass dataset. The dataset is carefully evaluated using language-specific BERT models and SVMs to show that XED performs on par with other similar datasets and is therefore a useful tool for sentiment analysis and emotion detection.

pdf bib abs
The University of Helsinki Submission to the IWSLT2020 Offline SpeechTranslation Task
Raúl Vázquez | Mikko Aulamo | Umut Sulubacak | Jörg Tiedemann
Proceedings of the 17th International Conference on Spoken Language Translation

This paper describes the University of Helsinki Language Technology group’s participation in the IWSLT 2020 offline speech translation task, addressing the translation of English audio into German text. In line with this year’s task objective, we train both cascade and end-to-end systems for spoken language translation. We opt for an end-to-end multitasking architecture with shared internal representations and a cascade approach that follows a standard procedure consisting of ASR, correction, and MT stages. We also describe the experiments that served as a basis for the submitted systems. Our experiments reveal that multitasking training with shared internal representations is not only possible but allows for knowledge-transfer across modalities.

pdf bib abs
An Evaluation Benchmark for Testing the Word Sense Disambiguation Capabilities of Machine Translation Systems
Alessandro Raganato | Yves Scherrer | Jörg Tiedemann
Proceedings of the 12th Language Resources and Evaluation Conference

Lexical ambiguity is one of the many challenging linguistic phenomena involved in translation, i.e., translating an ambiguous word with its correct sense. In this respect, previous work has shown that the translation quality of neural machine translation systems can be improved by explicitly modeling the senses of ambiguous words. Recently, several evaluation test sets have been proposed to measure the word sense disambiguation (WSD) capability of machine translation systems. However, to date, these evaluation test sets do not include any training data that would provide a fair setup measuring the sense distributions present within the training data itself. In this paper, we present an evaluation benchmark on WSD for machine translation for 10 language pairs, comprising training data with known sense distributions. Our approach for the construction of the benchmark builds upon the wide-coverage multilingual sense inventory of BabelNet, the multilingual neural parsing pipeline TurkuNLP, and the OPUS collection of translated texts from the web. The test suite is available at http://github.com/Helsinki-NLP/MuCoW.

pdf bib abs
OpusTools and Parallel Corpus Diagnostics
Mikko Aulamo | Umut Sulubacak | Sami Virpioja | Jörg Tiedemann
Proceedings of the 12th Language Resources and Evaluation Conference

This paper introduces OpusTools, a package for downloading and processing parallel corpora included in the OPUS corpus collection. The package implements tools for accessing compressed data in their archived release format and make it possible to easily convert between common formats. OpusTools also includes tools for language identification and data filtering as well as tools for importing data from various sources into the OPUS format. We show the use of these tools in parallel corpus creation and data diagnostics. The latter is especially useful for the identification of potential problems and errors in the extensive data set. Using these tools, we can now monitor the validity of data sets and improve the overall quality and consitency of the data collection.

This paper presents FISKMÖ, a project that focuses on the development of resources and tools for cross-linguistic research and machine translation between Finnish and Swedish. The goal of the project is the compilation of a massive parallel corpus out of translated material collected from web sources, public and private organisations and language service providers in Finland with its two official languages. The project also aims at the development of open and freely accessible translation services for those two languages for the general purpose and for domain-specific use. We have released new data sets with over 3 million translation units, a benchmark test set for MT development, pre-trained neural MT models with high coverage and competitive performance and a self-contained MT plugin for a popular CAT tool. The latter enables offline translation without dependencies on external services making it possible to work with highly sensitive data without compromising security concerns.

pdf bib
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib abs
LSDC - A comprehensive dataset for Low Saxon Dialect Classification
Janine Siewert | Yves Scherrer | Martijn Wieling | Jörg Tiedemann
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

We present a new comprehensive dataset for the unstandardised West-Germanic language Low Saxon covering the last two centuries, the majority of modern dialects and various genres, which will be made openly available in connection with the final version of this paper. Since so far no such comprehensive dataset of contemporary Low Saxon exists, this provides a great contribution to NLP research on this language. We also test the use of this dataset for dialect classification by training a few baseline models comparing statistical and neural approaches. The performance of these models shows that in spite of an imbalance in the amount of data per dialect, enough features can be learned for a relatively high classification accuracy.

pdf bib abs
OpusFilter: A Configurable Parallel Corpus Filtering Toolbox
Mikko Aulamo | Sami Virpioja | Jörg Tiedemann
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora. It implements a number of components based on heuristic filters, language identification libraries, character-based language models, and word alignment tools, and it can easily be extended with custom filters. Bitext segments can be ranked according to their quality or domain match using single features or a logistic regression model that can be trained without manually labeled training data. We demonstrate the effectiveness of OpusFilter on the example of a Finnish-English news translation task based on noisy web-crawled training data. Applying our tool leads to improved translation quality while significantly reducing the size of the training data, also clearly outperforming an alternative ranking given in the crawled data set. Furthermore, we show the ability of OpusFilter to perform data selection for domain adaptation.

2019

pdf bib
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Shervin Malmasi | Nikola Ljubešić | Jörg Tiedemann | Ahmed Ali
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib abs
Measuring Semantic Abstraction of Multilingual NMT with Paraphrase Recognition and Generation Tasks
Jörg Tiedemann | Yves Scherrer
Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP

In this paper, we investigate whether multilingual neural translation models learn stronger semantic abstractions of sentences than bilingual ones. We test this hypotheses by measuring the perplexity of such models when applied to paraphrases of the source language. The intuition is that an encoder produces better representations if a decoder is capable of recognizing synonymous sentences in the same language even though the model is never trained for that task. In our setup, we add 16 different auxiliary languages to a bidirectional bilingual baseline model (English-French) and test it with in-domain and out-of-domain paraphrases in English. The results show that the perplexity is significantly reduced in each of the cases, indicating that meaning can be grounded in translation. This is further supported by a study on paraphrase generation that we also include at the end of the paper.

pdf bib abs
Revisiting NMT for Normalization of Early English Letters
Mika Hämäläinen | Tanja Säily | Jack Rueter | Jörg Tiedemann | Eetu Mäkelä
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus. The corpus has previously been normalized so that only less frequent deviant forms are left out without normalization. This paper discusses different methods for improving the normalization of these deviant forms by using different approaches. Adding features to the training data is found to be unhelpful, but using a lexicographical resource to filter the top candidates produced by the NMT model together with lemmatization improves results.

pdf bib abs
An Evaluation of Language-Agnostic Inner-Attention-Based Representations in Machine Translation
Alessandro Raganato | Raúl Vázquez | Mathias Creutz | Jörg Tiedemann
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

In this paper, we explore a multilingual translation model with a cross-lingually shared layer that can be used as fixed-size sentence representation in different downstream tasks. We systematically study the impact of the size of the shared layer and the effect of including additional languages in the model. In contrast to related previous work, we demonstrate that the performance in translation does correlate with trainable downstream tasks. In particular, we show that larger intermediate layers not only improve translation quality, especially for long sentences, but also push the accuracy of trainable classification tasks. On the other hand, shorter representations lead to increased compression that is beneficial in non-trainable similarity tasks. We hypothesize that the training procedure on the downstream task enables the model to identify the encoded information that is useful for the specific task whereas non-trainable benchmarks can be confused by other types of information also encoded in the representation of a sentence.

pdf bib abs
Multilingual NMT with a Language-Independent Attention Bridge
Raúl Vázquez | Alessandro Raganato | Jörg Tiedemann | Mathias Creutz
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

In this paper, we propose an architecture for machine translation (MT) capable of obtaining multilingual sentence representations by incorporating an intermediate attention bridge that is shared across all languages. We train the model with language-specific encoders and decoders that are connected through an inner-attention layer on the encoder side. The attention bridge exploits the semantics from each language for translation and develops into a language-agnostic meaning representation that can efficiently be used for transfer learning. We present a new framework for the efficient development of multilingual neural machine translation (NMT) using this model and scheduled training. We have tested the approach in a systematic way with a multi-parallel data set. The model achieves substantial improvements over strong bilingual models and performs well for zero-shot translation, which demonstrates its ability of abstraction and transfer learning.

In this paper we present the University of Helsinki submissions to the WMT 2019 shared news translation task in three language pairs: English-German, English-Finnish and Finnish-English. This year we focused first on cleaning and filtering the training data using multiple data-filtering approaches, resulting in much smaller and cleaner training sets. For English-German we trained both sentence-level transformer models as well as compared different document-level translation approaches. For Finnish-English and English-Finnish we focused on different segmentation approaches and we also included a rule-based system for English-Finnish.

pdf bib abs
The MuCoW Test Suite at WMT 2019: Automatically Harvested Multilingual Contrastive Word Sense Disambiguation Test Sets for Machine Translation
Alessandro Raganato | Yves Scherrer | Jörg Tiedemann
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

Supervised Neural Machine Translation (NMT) systems currently achieve impressive translation quality for many language pairs. One of the key features of a correct translation is the ability to perform word sense disambiguation (WSD), i.e., to translate an ambiguous word with its correct sense. Existing evaluation benchmarks on WSD capabilities of translation systems rely heavily on manual work and cover only few language pairs and few word types. We present MuCoW, a multilingual contrastive test suite that covers 16 language pairs with more than 200 thousand contrastive sentence pairs, automatically built from word-aligned parallel corpora and the wide-coverage multilingual sense inventory of BabelNet. We evaluate the quality of the ambiguity lexicons and of the resulting test suite on all submissions from 9 language pairs presented in the WMT19 news shared translation task, plus on other 5 language pairs using NMT pretrained models. The MuCoW test suite is available at http://github.com/Helsinki-NLP/MuCoW.

pdf bib abs
The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task
Raúl Vázquez | Umut Sulubacak | Jörg Tiedemann
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.

pdf bib abs
Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations
Aarne Talman | Antti Suni | Hande Celikkanat | Sofoklis Kakouros | Jörg Tiedemann | Martti Vainio
Proceedings of the 22nd Nordic Conference on Computational Linguistics

In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models ranging from feature-based classifiers to neural network systems for the prediction of discretized prosodic prominence. We show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10% of the training data. Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. The dataset and the code for the models will be made publicly available.

pdf bib abs
The OPUS Resource Repository: An Open Package for Creating Parallel Corpora and Machine Translation Services
Mikko Aulamo | Jörg Tiedemann
Proceedings of the 22nd Nordic Conference on Computational Linguistics

This paper presents a flexible and powerful system for creating parallel corpora and for running neural machine translation services. Our package provides a scalable data repository backend that offers transparent data pre-processing pipelines and automatic alignment procedures that facilitate the compilation of extensive parallel data sets from a variety of sources. Moreover, we develop a web-based interface that constitutes an intuitive frontend for end-users of the platform. The whole system can easily be distributed over virtual machines and implements a sophisticated permission system with secure connections and a flexible database for storing arbitrary metadata. Furthermore, we also provide an interface for neural machine translation that can run as a service on virtual machines, which also incorporates a connection to the data repository software.

pdf bib abs
Analysing concatenation approaches to document-level NMT in two different domains
Yves Scherrer | Jörg Tiedemann | Sharid Loáiciga
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)

In this paper, we investigate how different aspects of discourse context affect the performance of recent neural MT systems. We describe two popular datasets covering news and movie subtitles and we provide a thorough analysis of the distribution of various document-level features in their domains. Furthermore, we train a set of context-aware MT models on both datasets and propose a comparative evaluation scheme that contrasts coherent context with artificially scrambled documents and absent context, arguing that the impact of discourse-aware MT models will become visible in this way. Our results show that the models are indeed affected by the manipulation of the test data, providing a different view on document-level translation quality than absolute sentence-level scores.

pdf bib abs
What Do Language Representations Really Represent?
Johannes Bjerva | Robert Östling | Maria Han Veiga | Jörg Tiedemann | Isabelle Augenstein
Computational Linguistics, Volume 45, Issue 2 - June 2019

A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just as it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, whereas genetic relationships—a convenient benchmark used for evaluation in previous work—appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.

2018

pdf bib
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Shervin Malmasi | Ahmed Ali
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.

pdf bib abs
Normalizing Early English Letters to Present-day English Spelling
Mika Hämäläinen | Tanja Säily | Jack Rueter | Jörg Tiedemann | Eetu Mäkelä
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century. The methods include machine translation (neural and statistical), edit distance and rule-based FST. Different normalization methods are compared and evaluated. All of the methods have their own strengths in word normalization. This calls for finding ways of combining the results from these methods to leverage their individual strengths.

pdf bib abs
An Analysis of Encoder Representations in Transformer-Based Machine Translation
Alessandro Raganato | Jörg Tiedemann
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

The attention mechanism is a successful technique in modern NLP, especially in tasks like machine translation. The recently proposed network architecture of the Transformer is based entirely on attention mechanisms and achieves new state of the art results in neural machine translation, outperforming other sequence-to-sequence models. However, so far not much is known about the internal properties of the model and the representations it learns to achieve that performance. To study this question, we investigate the information that is learned by the attention mechanism in Transformer models with different translation quality. We assess the representations of the encoder by extracting dependency relations based on self-attention weights, we perform four probing tasks to study the amount of syntactic and semantic captured information and we also test attention in a transfer learning scenario. Our analysis sheds light on the relative strengths and weaknesses of the various encoder representations. We observe that specific attention heads mark syntactic dependency relations and we can also confirm that lower layers tend to learn more about syntax while higher layers tend to encode more semantics.

pdf bib abs
Creating a Dataset for Multilingual Fine-grained Emotion-detection Using Gamification-based Annotation
Emily Öhman | Kaisla Kajava | Jörg Tiedemann | Timo Honkela
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

This paper introduces a gamified framework for fine-grained sentiment analysis and emotion detection. We present a flexible tool, Sentimentator, that can be used for efficient annotation based on crowd sourcing and a self-perpetuating gold standard. We also present a novel dataset with multi-dimensional annotations of emotions and sentiments in movie subtitles that enables research on sentiment preservation across languages and the creation of robust multilingual emotion detection tools. The tools and datasets are public and open-source and can easily be extended and applied for various purposes.

pdf bib abs
The University of Helsinki submissions to the WMT18 news task
Alessandro Raganato | Yves Scherrer | Tommi Nieminen | Arvi Hurskainen | Jörg Tiedemann
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the University of Helsinki’s submissions to the WMT18 shared news translation task for English-Finnish and English-Estonian, in both directions. This year, our main submissions employ a novel neural architecture, the Transformer, using the open-source OpenNMT framework. Our experiments couple domain labeling and fine tuned multilingual models with shared vocabularies between the source and target language, using the provided parallel data of the shared task and additional back-translations. Finally, we compare, for the English-to-Finnish case, the effectiveness of different machine translation architectures, starting from a rule-based approach to our best neural model, analyzing the output and highlighting future research.

This paper describes the MeMAD project entry to the WMT Multimodal Machine Translation Shared Task. We propose adapting the Transformer neural machine translation (NMT) architecture to a multi-modal setting. In this paper, we also describe the preliminary experiments with text-only translation systems leading us up to this choice. We have the top scoring system for both English-to-German and English-to-French, according to the automatic metrics for flickr18. Our experiments show that the effect of the visual features in our system is small. Our largest gains come from the quality of the underlying text-only NMT system. We find that appropriate use of additional data is effective.

pdf bib abs
The MeMAD Submission to the IWSLT 2018 Speech Translation Task
Umut Sulubacak | Jörg Tiedemann | Aku Rouhe | Stig-ArneGrönroos | Mikko Kurimo
Proceedings of the 15th International Conference on Spoken Language Translation

This paper describes the MeMAD project entry to the IWSLT Speech Translation Shared Task, addressing the translation of English audio into German text. Between the pipeline and end-to-end model tracks, we participated only in the former, with three contrastive systems. We tried also the latter, but were not able to finish our end-to-end model in time. All of our systems start by transcribing the audio into text through an automatic speech recognition (ASR) model trained on the TED-LIUM English Speech Recognition Corpus (TED-LIUM). Afterwards, we feed the transcripts into English-German text-based neural machine translation (NMT) models. Our systems employ three different translation models trained on separate training sets compiled from the English-German part of the TED Speech Translation Corpus (TED-TRANS) and the OPENSUBTITLES2018 section of the OPUS collection. In this paper, we also describe the experiments leading up to our final systems. Our experiments indicate that using OPENSUBTITLES2018 in training significantly improves translation performance. We also experimented with various preand postprocessing routines for the NMT module, but we did not have much success with these. Our best-scoring system attains a BLEU score of 16.45 on the test set for this year’s task.

pdf bib
OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora
Pierre Lison | Jörg Tiedemann | Milen Kouylekov
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF
Yan Shao | Christian Hardmeier | Jörg Tiedemann | Joakim Nivre
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We present a character-based model for joint segmentation and POS tagging for Chinese. The bidirectional RNN-CRF architecture for general sequence tagging is adapted and applied with novel vector representations of Chinese characters that capture rich contextual information and lower-than-character level features. The proposed model is extensively evaluated and compared with a state-of-the-art tagger respectively on CTB5, CTB9 and UD Chinese. The experimental results indicate that our model is accurate and robust across datasets in different sizes, genres and annotation schemes. We obtain state-of-the-art performance on CTB5, achieving 94.38 F1-score for joint segmentation and POS tagging.

pdf bib
Proceedings of the 21st Nordic Conference on Computational Linguistics
Jörg Tiedemann | Nina Tahmasebi
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
Preslav Nakov | Marcos Zampieri | Nikola Ljubešić | Jörg Tiedemann | Shevin Malmasi | Ahmed Ali
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL’2017. This year, we included four shared tasks: Discriminating between Similar Languages (DSL), Arabic Dialect Identification (ADI), German Dialect Identification (GDI), and Cross-lingual Dependency Parsing (CLP). A total of 19 teams submitted runs across the four tasks, and 15 of them wrote system description papers.

pdf bib abs
Cross-lingual dependency parsing for closely related languages - Helsinki’s submission to VarDial 2017
Jörg Tiedemann
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper describes the submission from the University of Helsinki to the shared task on cross-lingual dependency parsing at VarDial 2017. We present work on annotation projection and treebank translation that gave good results for all three target languages in the test set. In particular, Slovak seems to work well with information coming from the Czech treebank, which is in line with related work. The attachment scores for cross-lingual models even surpass the fully supervised models trained on the target language treebank. Croatian is the most difficult language in the test set and the improvements over the baseline are rather modest. Norwegian works best with information coming from Swedish whereas Danish contributes surprisingly little.

pdf bib
Rule-based Machine translation from English to Finnish
Arvi Hurskainen | Jörg Tiedemann
Proceedings of the Second Conference on Machine Translation

pdf bib
The Helsinki Neural Machine Translation System
Robert Östling | Yves Scherrer | Jörg Tiedemann | Gongbo Tang | Tommi Nieminen
Proceedings of the Second Conference on Machine Translation

pdf bib
Proceedings of the Third Workshop on Discourse in Machine Translation
Bonnie Webber | Andrei Popescu-Belis | Jörg Tiedemann
Proceedings of the Third Workshop on Discourse in Machine Translation

We describe the design, the setup, and the evaluation results of the DiscoMT 2017 shared task on cross-lingual pronoun prediction. The task asked participants to predict a target-language pronoun given a source-language pronoun in the context of a sentence. We further provided a lemmatized target-language human-authored translation of the source sentence, and automatic word alignments between the source sentence words and the target-language lemmata. The aim of the task was to predict, for each target-language pronoun placeholder, the word that should replace it from a small, closed set of classes, using any type of information that can be extracted from the entire document. We offered four subtasks, each for a different language pair and translation direction: English-to-French, English-to-German, German-to-English, and Spanish-to-English. Five teams participated in the shared task, making submissions for all language pairs. The evaluation results show that most participating teams outperformed two strong n-gram-based language model-based baseline systems by a sizable margin.

pdf bib abs
Neural Machine Translation with Extended Context
Jörg Tiedemann | Yves Scherrer
Proceedings of the Third Workshop on Discourse in Machine Translation

We investigate the use of extended context in attention-based neural machine translation. We base our experiments on translated movie subtitles and discuss the effect of increasing the segments beyond single translation units. We study the use of extended source language context as well as bilingual context extensions. The models learn to distinguish between information from different segments and are surprisingly robust with respect to translation quality. In this pilot study, we observe interesting cross-sentential attention patterns that improve textual coherence in translation at least in some selected cases.

pdf bib abs
Continuous multilinguality with language vectors
Robert Östling | Jörg Tiedemann
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Most existing models for multilingual natural language processing (NLP) treat language as a discrete category, and make predictions for either one language or the other. In contrast, we propose using continuous vector representations of language. We show that these can be learned efficiently with a character-based neural language model, and used to improve inference about language varieties not seen during training. In experiments with 1303 Bible translations into 990 different languages, we empirically explore the capacity of multilingual language models, and also show that the language vectors capture genetic relationships between languages.

2016

pdf bib
Phrase-Based SMT for Finnish with More Data, Better Models and Alternative Alignment and Translation Tools
Jörg Tiedemann | Fabienne Cap | Jenna Kanerva | Filip Ginter | Sara Stymne | Robert Östling | Marion Weller-Di Marco
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
A Linear Baseline Classifier for Cross-Lingual Pronoun Prediction
Jörg Tiedemann
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Climbing Mont BLEU: The Strange World of Reachable High-BLEU Translations
Aaron Smith | Christian Hardmeier | Joerg Tiedemann
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

pdf bib abs
Tagging Ingush - Language Technology For Low-Resource Languages Using Resources From Linguistic Field Work
Jörg Tiedemann | Johanna Nichols | Ronald Sprouse
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

This paper presents on-going work on creating NLP tools for under-resourced languages from very sparse training data coming from linguistic field work. In this work, we focus on Ingush, a Nakh-Daghestanian language spoken by about 300,000 people in the Russian republics Ingushetia and Chechnya. We present work on morphosyntactic taggers trained on transcribed and linguistically analyzed recordings and dependency parsers using English glosses to project annotation for creating synthetic treebanks. Our preliminary results are promising, supporting the goal of bootstrapping efficient NLP tools with limited or no task-specific annotated data resources available.

pdf bib abs
The Challenges of Multi-dimensional Sentiment Analysis Across Languages
Emily Öhman | Timo Honkela | Jörg Tiedemann
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)

This paper outlines a pilot study on multi-dimensional and multilingual sentiment analysis of social media content. We use parallel corpora of movie subtitles as a proxy for colloquial language in social media channels and a multilingual emotion lexicon for fine-grained sentiment analyses. Parallel data sets make it possible to study the preservation of sentiments and emotions in translation and our assessment reveals that the lexical approach shows great inter-language agreement. However, our manual evaluation also suggests that the use of purely lexical methods is limited and further studies are necessary to pinpoint the cross-lingual differences and to develop better sentiment classifiers.

pdf bib
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Preslav Nakov | Marcos Zampieri | Liling Tan | Nikola Ljubešić | Jörg Tiedemann | Shervin Malmasi
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

pdf bib abs
Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task
Shervin Malmasi | Marcos Zampieri | Nikola Ljubešić | Preslav Nakov | Ahmed Ali | Jörg Tiedemann
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

We present the results of the third edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial’2016 workshop at COLING’2016. The challenge offered two subtasks: subtask 1 focused on the identification of very similar languages and language varieties in newswire texts, whereas subtask 2 dealt with Arabic dialect identification in speech transcripts. A total of 37 teams registered to participate in the task, 24 teams submitted test results, and 20 teams also wrote system description papers. High-order character n-grams were the most successful feature, and the best classification approaches included traditional supervised learning methods such as SVM, logistic regression, and language models, while deep learning approaches did not perform very well.

pdf bib
OPUS – parallel corpora for everyone
Jörg Tiedemann
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

pdf bib abs
OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles
Pierre Lison | Jörg Tiedemann
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a new major release of the OpenSubtitles collection of parallel corpora. The release is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. The release also incorporates a number of enhancements in the preprocessing and alignment of the subtitles, such as the automatic correction of OCR errors and the use of meta-data to estimate the quality of each subtitle and score subtitle pairs.

pdf bib abs
Finding Alternative Translations in a Large Corpus of Movie Subtitle
Jörg Tiedemann
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

OpenSubtitles.org provides a large collection of user contributed subtitles in various languages for movies and TV programs. Subtitle translations are valuable resources for cross-lingual studies and machine translation research. A less explored feature of the collection is the inclusion of alternative translations, which can be very useful for training paraphrase systems or collecting multi-reference test suites for machine translation. However, differences in translation may also be due to misspellings, incomplete or corrupt data files, or wrongly aligned subtitles. This paper reports our efforts in recognising and classifying alternative subtitle translations with language independent techniques. We use time-based alignment with lexical re-synchronisation techniques and BLEU score filters and sort alternative translations into categories using edit distance metrics and heuristic rules. Our approach produces large numbers of sentence-aligned translation alternatives for over 50 languages provided via the OPUS corpus collection.

2015

pdf bib
Improving the Cross-Lingual Projection of Syntactic Dependencies
Jörg Tiedemann
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf bib
Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels
Jörg Tiedemann
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

pdf bib
Pronoun-Focused MT and Cross-Lingual Pronoun Prediction: Findings of the 2015 DiscoMT Shared Task on Pronoun Translation
Christian Hardmeier | Preslav Nakov | Sara Stymne | Jörg Tiedemann | Yannick Versley | Mauro Cettolo
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib
Part-of-Speech Driven Cross-Lingual Pronoun Prediction with Feed-Forward Neural Networks
Jimmy Callin | Christian Hardmeier | Jörg Tiedemann
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib
Baseline Models for Pronoun Prediction and Pronoun-Aware Translation
Jörg Tiedemann
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib
Morphological Segmentation and OPUS for Finnish-English Machine Translation
Jörg Tiedemann | Filip Ginter | Jenna Kanerva
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Boosting English-Chinese Machine Transliteration via High Quality Alignment and Multilingual Resources
Yan Shao | Jörg Tiedemann | Joakim Nivre
Proceedings of the Fifth Named Entity Workshop

pdf bib
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects
Preslav Nakov | Marcos Zampieri | Petya Osenova | Liling Tan | Cristina Vertan | Nikola Ljubešić | Jörg Tiedemann
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

pdf bib
Overview of the DSL Shared Task 2015
Marcos Zampieri | Liling Tan | Nikola Ljubešić | Jörg Tiedemann | Preslav Nakov
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

2014

pdf bib abs
ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT
Liane Guillou | Christian Hardmeier | Aaron Smith | Jörg Tiedemann | Bonnie Webber
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present ParCor, a parallel corpus of texts in which pronoun coreference ― reduced coreference in which pronouns are used as referring expressions ― has been annotated. The corpus is intended to be used both as a resource from which to learn systematic differences in pronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed at addressing the problem of pronoun coreference in translation. At present, the corpus consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent. We provide details of the texts that we selected, the guidelines and tools used to support annotation and some corpus statistics. The texts in the corpus have already been translated into many languages, and we plan to expand the corpus into these other languages, as well as other genres, in the future.

pdf bib abs
Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus
Raivis Skadiņš | Jörg Tiedemann | Roberts Rozis | Daiga Deksne
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The European Union is a great source of high quality documents with translations into several languages. Parallel corpora from its publications are frequently used in various tasks, machine translation in particular. A source that has not systematically been explored yet is the EU Bookshop ― an online service and archive of publications from various European institutions. The service contains a large body of publications in the 24 official of the EU. This paper describes our efforts in collecting those publications and converting them to a format that is useful for natural language processing in particular statistical machine translation. We report our procedure of crawling the website and various pre-processing steps that were necessary to clean up the data after the conversion from the original PDF files. Furthermore, we demonstrate the use of this dataset in training SMT models for English, French, German, Spanish, and Latvian.

pdf bib
Treebank Translation for Cross-Lingual Parser Induction
Jörg Tiedemann | Željko Agić | Joakim Nivre
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

pdf bib
Anaphora Models and Reordering for Phrase-Based SMT
Christian Hardmeier | Sara Stymne | Jörg Tiedemann | Aaron Smith | Joakim Nivre
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Estimating Word Alignment Quality for SMT Reordering Tasks
Sara Stymne | Jörg Tiedemann | Joakim Nivre
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Word’s Vector Representations meet Machine Translation
Eva Martínez Garcia | Jörg Tiedemann | Cristina España-Bonet | Lluís Màrquez
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets
Željko Agić | Jörg Tiedemann | Danijela Merkler | Simon Krek | Kaja Dobrovoljc | Sara Može
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants

pdf bib
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects
Marcos Zampieri | Liling Tan | Nikola Ljubešić | Jörg Tiedemann
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

pdf bib
A Report on the DSL Shared Task 2014
Marcos Zampieri | Liling Tan | Nikola Ljubešić | Jörg Tiedemann
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

pdf bib
Rediscovering Annotation Projection for Cross-Lingual Parser Induction
Jörg Tiedemann
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Latent Anaphora Resolution for Cross-Lingual Pronoun Prediction
Christian Hardmeier | Jörg Tiedemann | Joakim Nivre
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets
Jörg Tiedemann | Preslav Nakov
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Tunable Distortion Limits and Corpus Cleaning for SMT
Sara Stymne | Christian Hardmeier | Jörg Tiedemann | Joakim Nivre
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
Proceedings of the Workshop on Discourse in Machine Translation
Bonnie Webber | Andrei Popescu-Belis | Katja Markert | Jörg Tiedemann
Proceedings of the Workshop on Discourse in Machine Translation

pdf bib
Feature Weight Optimization for Discourse-Level SMT
Sara Stymne | Christian Hardmeier | Jörg Tiedemann | Joakim Nivre
Proceedings of the Workshop on Discourse in Machine Translation

pdf bib
Experiences in Building the Let’s MT! Portal on Amazon EC2
Jörg Tiedemann
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

pdf bib
Statistical Machine Translation with Readability Constraints
Sara Stymne | Jörg Tiedemann | Christian Hardmeier | Joakim Nivre
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

pdf bib
Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
Christian Hardmeier | Sara Stymne | Jörg Tiedemann | Joakim Nivre
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

pdf bib
Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages
Preslav Nakov | Jörg Tiedemann
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation
Andrejs Vasiļjevs | Raivis Skadiņš | Jörg Tiedemann
Proceedings of the ACL 2012 System Demonstrations

pdf bib
Document-Wide Decoding for Phrase-Based Statistical Machine Translation
Christian Hardmeier | Joakim Nivre | Jörg Tiedemann
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Tree Kernels for Machine Translation Quality Estimation
Christian Hardmeier | Joakim Nivre | Jörg Tiedemann
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
Efficient Discrimination Between Closely Related Languages
Jörg Tiedemann | Nikola Ljubešić
Proceedings of COLING 2012

pdf bib
Character-Based Pivot Translation for Under-Resourced Languages and Domains
Jörg Tiedemann
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib abs
A Distributed Resource Repository for Cloud-Based Machine Translation
Jörg Tiedemann | Dorte Haltrup Hansen | Lene Offersgaard | Sussi Olsen | Matthias Zumpe
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present the architecture of a distributed resource repository developed for collecting training data for building customized statistical machine translation systems. The repository is designed for the cloud-based translation service integrated in the Let'sMT! platform which is about to be launched to the public. The system includes important features such as automatic import and alignment of textual documents in a variety of formats, a flexible database for meta-information using modern key-value stores and a grid-based backend for running off-line processes. The entire system is very modular and supports highly distributed setups to enable a maximum of flexibility and scalability. The system uses secure connections and includes an effective permission management to ensure data integrity. In this paper, we also take a closer look at the task of sentence alignment. The process of alignment is extremely important for the success of translation models trained on the platform. Alignment decisions significantly influence the quality of SMT engines.

pdf bib abs
Parallel Data, Tools and Interfaces in OPUS
Jörg Tiedemann
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.

pdf bib abs
Large aligned treebanks for syntax-based machine translation
Gideon Kotzé | Vincent Vandeghinste | Scott Martens | Jörg Tiedemann
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the nonterminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we present evaluation scores of both the nonterminal constituent alignments and the MT system itself, and in the latter case, compare them with those of Moses, a current state-of-the-art statistical MT system, when trained on the same data.

2011

pdf bib
LetsMT!: Cloud-Based Platform for Building User Tailored Machine Translation Engines
Andrejs Vasiljevs | Raivis Skadinš | Jörg Tiedemann
Proceedings of Machine Translation Summit XIII: System Presentations

pdf bib
The Uppsala-FBK systems at WMT 2011
Christian Hardmeier | Jörg Tiedemann | Markus Saers | Marcello Federico | Prashant Mathur
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Proceedings of The Second Workshop on Annotation and Exploitation of Parallel Corpora
Kiril Simov | Petya Osenova | Jörg Tiedemann | Radovan Garabik
Proceedings of The Second Workshop on Annotation and Exploitation of Parallel Corpora

2010

pdf bib abs
Lingua-Align: An Experimental Toolbox for Automatic Tree-to-Tree Alignment
Jörg Tiedemann
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we present an experimental toolbox for automatic tree-to-tree alignment based on a binary classification model. The aligner implements a recurrent architecture for structural prediction using history features and a sequential classification procedure. The discriminative base classifier uses a log-linear model in the current setup which enables simple integration of various features extracted from the data. The Lingua-Align toolbox provides a flexible framework for feature extraction including contextual properties and implements several alignment inference procedures. Various settings and constraints can be controlled via a simple frontend or called from external scripts. Lingua-Align supports different treebank formats and includes additional tools for conversion and evaluation. In our experiments we can show that our tree aligner produces results with high quality and outperforms unsupervised techniques proposed otherwise. It also integrates well with another existing tool for manual tree alignment which makes it possible to quickly integrate additional training material and to run semi-automatic alignment strategies.

pdf bib
English to Bangla Phrase-Based Machine Translation
Zahurul Islam | Jörg Tiedemann | Andreas Eisele
Proceedings of the 14th Annual conference of the European Association for Machine Translation

pdf bib
To Cache or Not To Cache? Experiments with Adaptive Models in Statistical Machine Translation
Jörg Tiedemann
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing
Hal Daumé III | Tejaswini Deoskar | David McClosky | Barbara Plank | Jörg Tiedemann
Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing

pdf bib
Context Adaptation in Statistical Machine Translation Using Models with Exponentially Decaying Cache
Jörg Tiedemann
Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing

pdf bib
Finding Medical Term Variations using Parallel Corpora and Distributional Similarity
Lonneke van der Plas | Jörg Tiedemann
Proceedings of the 6th Workshop on Ontologies and Lexical Resources

2009

pdf bib
Evidence-Based Word Alignment
Jörg Tiedemann
Proceedings of the Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning

pdf bib
A Discriminative Approach to Tree Alignment
Jörg Tiedemann | Gideon Kotzé
Proceedings of the Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning

pdf bib
Character-Based PSMT for Closely Related Languages
Jörg Tiedemann
Proceedings of the 13th Annual conference of the European Association for Machine Translation

pdf bib
Translating Questions for Cross-Lingual QA
Jörg Tiedemann
Proceedings of the 13th Annual conference of the European Association for Machine Translation

2008

pdf bib abs
Synchronizing Translated Movie Subtitles
Jörg Tiedemann
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper addresses the problem of synchronizing movie subtitles, which is necessary to improve alignment quality when building a parallel corpus out of translated subtitles. In particular, synchronization is done on the basis of aligned anchor points. Previous studies have shown that cognate filters are useful for the identification of such points. However, this restricts the approach to related languages with similar alphabets. Here, we propose a dictionary-based approach using automatic word alignment. We can show an improvement in alignment quality even for related languages compared to the cognate-based approach.

pdf bib
Simple is Best: Experiments with Different Document Segmentation Strategies for Passage Retrieval
Jörg Tiedemann | Jori Mur
Coling 2008: Proceedings of the 2nd workshop on Information Retrieval for Question Answering

pdf bib
Using Lexico-Semantic Information for Query Expansion in Passage Retrieval for Question Answering
Lonneke van der Plas | Jörg Tiedemann
Coling 2008: Proceedings of the 2nd workshop on Information Retrieval for Question Answering

2006

pdf bib abs
ISA & ICA - Two Web Interfaces for Interactive Alignment of Bitexts alignment of parallel texts
Jörg Tiedemann
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

ISA and ICA are two web interfaces for interactive alignment of parallel texts. ISA provides an interface for automatic and manual sentence alignment. It includes cognate filters and uses structural markup to improve automatic alignment and provides intuitive tools for editing them. Alignment results can be saved to disk or sent via e-mail. ICA provides an interface to the clue aligner from the Uplug toolbox. It allows one to set various parameters and visualizes alignment results in a two-dimensional matrix. Word alignments can be edited and saved to disk.

pdf bib
Identifying idiomatic expressions using automatic word-alignment
Begoña Villada Moirón | Jörg Tiedemann
Proceedings of the Workshop on Multi-word-expressions in a multilingual context

pdf bib
Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity
Lonneke van der Plas | Jörg Tiedemann
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
Integrating Linguistic Knowledge in Passage Retrieval for Question Answering
Jörg Tiedemann
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2004

pdf bib
Word to word alignment strategies
Jörg Tiedemann
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib abs
The OPUS Corpus - Parallel and Free: http://logos.uio.no/opus
Jörg Tiedemann | Lars Nygaard
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

The OPUS corpus is a growing collection of translated documents collected from the internet. The current version contains about 30 million words in 60 languages. The entire corpus is sentence aligned and it also contains linguistic markup for certain languages.

pdf bib
MT Goes Farming: Comparing Two Machine Translation Approaches on a New Domain
Per Weijnitz | Eva Forsbom | Ebba Gustavii | Eva Pettersson | Jörg Tiedemann
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Jörg Tiedemann

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2006

2005

2004

2003

2002

2001

2000

1999

1998

Co-authors

Venues