Josef van Genabith

Also published as: J. Van Genabith, Josef Van Genabith

2023

pdf abs
Exploring Paracrawl for Document-level Neural Machine Translation
Yusser Al Ghussin | Jingyi Zhang | Josef van Genabith
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Document-level neural machine translation (NMT) has outperformed sentence-level NMT on a number of datasets. However, document-level NMT is still not widely adopted in realworld translation systems mainly due to the lack of large-scale general-domain training data for document-level NMT. We examine the effectiveness of using Paracrawl for learning document-level translation. Paracrawl is a large-scale parallel corpus crawled from the Internet and contains data from various domains. The official Paracrawl corpus was released as parallel sentences (extracted from parallel webpages) and therefore previous works only used Paracrawl for learning sentence-level translation. In this work, we extract parallel paragraphs from Paracrawl parallel webpages using automatic sentence alignments and we use the extracted parallel paragraphs as parallel documents for training document-level translation models. We show that document-level NMT models trained with only parallel paragraphs from Paracrawl can be used to translate real documents from TED, News and Europarl, outperforming sentence-level NMT models. We also perform a targeted pronoun evaluation and show that document-level models trained with Paracrawl data can help context-aware pronoun translation.

pdf abs
Enriching WayunaikiSpanish Neural Machine Translation with Linguistic Information
Nora Graichen | Josef Van Genabith | Cristina Espaa-bonet
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

We present the first neural machine translation system for the low-resource language pair WayunaikiSpanish and explore strategies to inject linguistic knowledge into the model to improve translation quality. We explore a wide range of methods and combine complementary approaches. Results indicate that incorporating linguistic information through linguistically motivated subword segmentation, factored models, and pretrained embeddings helps the system to generate improved translations, with the segmentation contributing the most.In order to evaluate translation quality in a general domain and go beyond the available religious domain data, we gather and make publicly available a new test set and supplementary material.Although translation quality as measured with automatic metrics is low, we hope these resources will facilitate and support further research on Wayunaiki.

pdf abs
Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?
Sonal Sannigrahi | Josef van Genabith | Cristina España-Bonet
Findings of the Association for Computational Linguistics: EACL 2023

Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art resultsin various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back on computing document embeddings based on sentence representations. Although there exist architectures and models to encode documents fully, they are in general limited to English and few other high-resourced languages. In this work, we provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare input token number truncation, sentence averaging as well as some simple windowing and in some cases new augmented and learnable approaches, on 3 multi- and cross-lingual tasks in 8 languages belonging to 3 different language families. Our task-based extrinsic evaluations show that, independently of the language, a clever combination of sentence embeddings is usually better than encoding the full document as a single unit, even when this is possible. We demonstrate that while a simple sentence average results in a strong baseline for classification tasks, more complex combinations are necessary for semantic tasks

2022

pdf abs
Spatio-temporal Sign Language Representation and Translation
Yasser Hamidullah | Josef Van Genabith | Cristina España-bonet
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved 5 ± 1 BLEU points on the development set, but the performance on the test dropped to 0.11 ± 0.06 BLEU points.

pdf abs
Chop and Change: Anaphora Resolution in Instructional Cooking Videos
Cennet Oguz | Ivana Kruijff-Korbayova | Emmanuel Vincent | Pascal Denis | Josef van Genabith
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Linguistic ambiguities arising from changes in entities in action flows are a key challenge in instructional cooking videos. In particular, temporally evolving entities present rich and to date understudied challenges for anaphora resolution. For example “oil” mixed with “salt” is later referred to as a “mixture”. In this paper we propose novel annotation guidelines to annotate recipes for the anaphora resolution task, reflecting change in entities. Moreover, we present experimental results for end-to-end multimodal anaphora resolution with the new annotation scheme and propose the use of temporal features for performance improvement.

pdf bib abs
Exploiting Social Media Content for Self-Supervised Style Transfer
Dana Ruiter | Thomas Kleinbauer | Cristina España-Bonet | Josef van Genabith | Dietrich Klakow
Proceedings of the Tenth International Workshop on Natural Language Processing for Social Media

Recent research on style transfer takes inspiration from unsupervised neural machine translation (UNMT), learning from large amounts of non-parallel data by exploiting cycle consistency loss, back-translation, and denoising autoencoders. By contrast, the use of selfsupervised NMT (SSNMT), which leverages (near) parallel instances hidden in non-parallel data more efficiently than UNMT, has not yet been explored for style transfer. In this paper we present a novel Self-Supervised Style Transfer (3ST) model, which augments SSNMT with UNMT methods in order to identify and efficiently exploit supervisory signals in non-parallel social media posts. We compare 3ST with state-of-the-art (SOTA) style transfer models across civil rephrasing, formality and polarity tasks. We show that 3ST is able to balance the three major objectives (fluency, content preservation, attribute transfer accuracy) the best, outperforming SOTA models on averaged performance across their tested tasks in automatic and human evaluation.

pdf abs
Combining Noisy Semantic Signals with Orthographic Cues: Cognate Induction for the Indic Dialect Continuum
Niyati Bafna | Josef van Genabith | Cristina España-Bonet | Zdeněk Žabokrtský
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

We present a novel method for unsupervised cognate/borrowing identification from monolingual corpora designed for low and extremely low resource scenarios, based on combining noisy semantic signals from joint bilingual spaces with orthographic cues modelling sound change. We apply our method to the North Indian dialect continuum, containing several dozens of dialects and languages spoken by more than 100 million people. Many of these languages are zero-resource and therefore natural language processing for them is non-existent. We first collect monolingual data for 26 Indic languages, 16 of which were previously zero-resource, and perform exploratory character, lexical and subword cross-lingual alignment experiments for the first time at this scale on this dialect continuum. We create bilingual evaluation lexicons against Hindi for 20 of the languages. We then apply our cognate identification method on the data, and show that our method outperforms both traditional orthography baselines as well as EM-style learnt edit distance matrices. To the best of our knowledge, this is the first work to combine traditional orthographic cues with noisy bilingual embeddings to tackle unsupervised cognate detection in a (truly) low-resource setup, showing that even noisy bilingual embeddings can act as good guides for this task. We release our multilingual dialect corpus, called HinDialect, as well as our scripts for evaluation data collection and cognate induction.

2021

pdf abs
Learning Hard Retrieval Decoder Attention for Transformers
Hongfei Xu | Qiuhui Liu | Josef van Genabith | Deyi Xiong
Findings of the Association for Computational Linguistics: EMNLP 2021

The Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. The multi-head attention network performs the scaled dot-product attention function in parallel, empowering the model by jointly attending to information from different representation subspaces at different positions. In this paper, we present an approach to learning a hard retrieval attention where an attention head only attends to one token in the sentence rather than all tokens. The matrix multiplication between attention probabilities and the value sequence in the standard scaled dot-product attention can thus be replaced by a simple and efficient retrieval operation. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding, while preserving translation quality on a wide range of machine translation tasks when used in the decoder self- and cross-attention networks.

pdf abs
Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages
Dana Ruiter | Dietrich Klakow | Josef van Genabith | Cristina España-Bonet
Proceedings of Machine Translation Summit XVIII: Research Track

For most language combinations and parallel data is either scarce or simply unavailable. To address this and unsupervised machine translation (UMT) exploits large amounts of monolingual data by using synthetic data generation techniques such as back-translation and noising and while self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them. To this date and the inclusion of UMT data generation techniques in SSNMT has not been investigated. We show that including UMT techniques into SSNMT significantly outperforms SSNMT (up to +4.3 BLEU and af2en) as well as statistical (+50.8 BLEU) and hybrid UMT (+51.5 BLEU) baselines on related and distantly-related and unrelated language pairs.

pdf abs
Comparing Feature-Engineering and Feature-Learning Approaches for Multilingual Translationese Classification
Daria Pylypenko | Kwabena Amponsah-Kaakyire | Koel Dutta Chowdhury | Josef van Genabith | Cristina España-Bonet
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Traditional hand-crafted linguistically-informed features have often been used for distinguishing between translated and original non-translated texts. By contrast, to date, neural architectures without manual feature engineering have been less explored for this task. In this work, we (i) compare the traditional feature-engineering-based approach to the feature-learning-based one and (ii) analyse the neural architectures in order to investigate how well the hand-crafted features explain the variance in the neural models’ predictions. We use pre-trained neural word embeddings, as well as several end-to-end neural architectures in both monolingual and multilingual settings and compare them to feature-engineering-based SVM classifiers. We show that (i) neural architectures outperform other approaches by more than 20 accuracy points, with the BERT-based model performing the best in both the monolingual and multilingual settings; (ii) while many individual hand-crafted translationese features correlate with neural model predictions, feature importance analysis shows that the most important features for neural and classical architectures differ; and (iii) our multilingual experiments provide empirical evidence for translationese universals across languages.

pdf abs
Investigating the Helpfulness of Word-Level Quality Estimation for Post-Editing Machine Translation Output
Raksha Shenoy | Nico Herbig | Antonio Krüger | Josef van Genabith
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Compared to fully manual translation, post-editing (PE) machine translation (MT) output can save time and reduce errors. Automatic word-level quality estimation (QE) aims to predict the correctness of words in MT output and holds great promise to aid PE by flagging problematic output. Quality of QE is crucial, as incorrect QE might lead to translators missing errors or wasting time on already correct MT output. Achieving accurate automatic word-level QE is very hard, and it is currently not known (i) at what quality threshold QE is actually beginning to be useful for human PE, and (ii), how to best present word-level QE information to translators. In particular, should word-level QE visualization indicate uncertainty of the QE model or not? In this paper, we address both research questions with real and simulated word-level QE, visualizations, and user studies, where time, subjective ratings, and quality of the final translations are assessed. Results show that current word-level QE models are not yet good enough to support PE. Instead, quality levels of > 80% F1 are required. For helpful quality levels, a visualization reflecting the uncertainty of the QE model is preferred. Our analysis further shows that speed gains achieved through QE are not merely a result of blindly trusting the QE system, but that the quality of the final translations also improves. The threshold results from the paper establish a quality goal for future word-level QE research.

pdf abs
TransIns: Document Translation with Markup Reinsertion
Jörg Steffen | Josef van Genabith
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

For many use cases, it is required that MT does not just translate raw text, but complex formatted documents (e.g. websites, slides, spreadsheets) and the result of the translation should reflect the formatting. This is challenging, as markup can be nested, apply to spans contiguous in source but non-contiguous in target etc. Here we present TransIns, a system for non-plain text document translation that builds on the Okapi framework and MT models trained with Marian NMT. We develop, implement and evaluate different strategies for reinserting markup into translated sentences using token alignments between source and target sentences. We propose a simple and effective strategy that compiles down all markup to single source tokens and transfers them to aligned target tokens. A first evaluation shows that this strategy yields highly accurate markup in the translated documents that outperforms the markup quality found in documents translated with popular translation services. We release TransIns under the MIT License as open-source software on https://github.com/DFKI-MLT/TransIns. An online demonstrator is available at https://transins.dfki.de.

pdf abs
Multi-Head Highly Parallelized LSTM Decoder for Neural Machine Translation
Hongfei Xu | Qiuhui Liu | Josef van Genabith | Deyi Xiong | Meng Zhang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

One of the reasons Transformer translation models are popular is that self-attention networks for context modelling can be easily parallelized at sequence level. However, the computational complexity of a self-attention network is O(n²), increasing quadratically with sequence length. By contrast, the complexity of LSTM-based approaches is only O(n). In practice, however, LSTMs are much slower to train than self-attention networks as they cannot be parallelized at sequence level: to model context, the current LSTM state relies on the full LSTM computation of the preceding state. This has to be computed n times for a sequence of length n. The linear transformations involved in the LSTM gate and state computations are the major cost factors in this. To enable sequence-level parallelization of LSTMs, we approximate full LSTM context modelling by computing hidden states and gates with the current input and a simple bag-of-words representation of the preceding tokens context. This allows us to compute each input step efficiently in parallel, avoiding the formerly costly sequential linear transformations. We then connect the outputs of each parallel step with computationally cheap element-wise computations. We call this the Highly Parallelized LSTM. To further constrain the number of LSTM parameters, we compute several small HPLSTMs in parallel like multi-head attention in the Transformer. The experiments show that our MHPLSTM decoder achieves significant BLEU improvements, while being even slightly faster than the self-attention network in training, and much faster than the standard LSTM.

pdf abs
A Bidirectional Transformer Based Alignment Model for Unsupervised Word Alignment
Jingyi Zhang | Josef van Genabith
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Word alignment and machine translation are two closely related tasks. Neural translation models, such as RNN-based and Transformer models, employ a target-to-source attention mechanism which can provide rough word alignments, but with a rather low accuracy. High-quality word alignment can help neural machine translation in many different ways, such as missing word detection, annotation transfer and lexicon injection. Existing methods for learning word alignment include statistical word aligners (e.g. GIZA++) and recently neural word alignment models. This paper presents a bidirectional Transformer based alignment (BTBA) model for unsupervised learning of the word alignment task. Our BTBA model predicts the current target word by attending the source context and both left-side and right-side target context to produce accurate target-to-source attention (alignment). We further fine-tune the target-to-source attention in the BTBA model to obtain better alignments using a full context based optimization method and self-supervised training. We test our method on three word alignment tasks and show that our method outperforms both previous neural word alignment approaches and the popular statistical word aligner GIZA++.

pdf abs
Mid-Air Hand Gestures for Post-Editing of Machine Translation
Rashad Albo Jamara | Nico Herbig | Antonio Krüger | Josef van Genabith
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

To translate large volumes of text in a globally connected world, more and more translators are integrating machine translation (MT) and post-editing (PE) into their translation workflows to generate publishable quality translations. While this process has been shown to save time and reduce errors, the task of translation is changing from mostly text production from scratch to fixing errors within useful but partly incorrect MT output. This is affecting the interface design of translation tools, where better support for text editing tasks is required. Here, we present the first study that investigates the usefulness of mid-air hand gestures in combination with the keyboard (GK) for text editing in PE of MT. Guided by a gesture elicitation study with 14 freelance translators, we develop a prototype supporting mid-air hand gestures for cursor placement, text selection, deletion, and reordering. These gestures combined with the keyboard facilitate all editing types required for PE. An evaluation of the prototype shows that the average editing duration of GK is only slightly slower than the standard mouse and keyboard (MK), even though participants are very familiar with the latter, and relative novices to the former. Furthermore, the qualitative analysis shows positive attitudes towards hand gestures for PE, especially when manipulating single words.

pdf abs
Modeling Task-Aware MIMO Cardinality for Efficient Multilingual Neural Machine Translation
Hongfei Xu | Qiuhui Liu | Josef van Genabith | Deyi Xiong
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Neural machine translation has achieved great success in bilingual settings, as well as in multilingual settings. With the increase of the number of languages, multilingual systems tend to underperform their bilingual counterparts. Model capacity has been found crucial for massively multilingual NMT to support language pairs with varying typological characteristics. Previous work increases the modeling capacity by deepening or widening the Transformer. However, modeling cardinality based on aggregating a set of transformations with the same topology has been proven more effective than going deeper or wider when increasing capacity. In this paper, we propose to efficiently increase the capacity for multilingual NMT by increasing the cardinality. Unlike previous work which feeds the same input to several transformations and merges their outputs into one, we present a Multi-Input-Multi-Output (MIMO) architecture that allows each transformation of the block to have its own input. We also present a task-aware attention mechanism to learn to selectively utilize individual transformations from a set of transformations for different translation directions. Our model surpasses previous work and establishes a new state-of-the-art on the large scale OPUS-100 corpus while being 1.31 times as fast.

pdf bib
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)
Thoudam Doren Singh | Cristina España i Bonet | Sivaji Bandyopadhyay | Josef van Genabith
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)

pdf abs
Tracing Source Language Interference in Translation with Graph-Isomorphism Measures
Koel Dutta Chowdhury | Cristina España-Bonet | Josef van Genabith
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Previous research has used linguistic features to show that translations exhibit traces of source language interference and that phylogenetic trees between languages can be reconstructed from the results of translations into the same language. Recent research has shown that instances of translationese (source language interference) can even be detected in embedding spaces, comparing embeddings spaces of original language data with embedding spaces resulting from translations into the same language, using a simple Eigenvector-based divergence from isomorphism measure. To date, it remains an open question whether alternative graph-isomorphism measures can produce better results. In this paper, we (i) explore Gromov-Hausdorff distance, (ii) present a novel spectral version of the Eigenvector-based method, and (iii) evaluate all approaches against a broad linguistic typological database (URIEL). We show that language distances resulting from our spectral isomorphism approaches can reproduce genetic trees on a par with previous work without requiring any explicit linguistic information and that the results can be extended to non-Indo-European languages. Finally, we show that the methods are robust under a variety of modeling conditions.

pdf abs
Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers
Hongfei Xu | Josef van Genabith | Qiuhui Liu | Deyi Xiong
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Due to its effectiveness and performance, the Transformer translation model has attracted wide attention, most recently in terms of probing-based approaches. Previous work focuses on using or probing source linguistic features in the encoder. To date, the way word translation evolves in Transformer layers has not yet been investigated. Naively, one might assume that encoder layers capture source information while decoder layers translate. In this work, we show that this is not quite the case: translation already happens progressively in encoder layers and even in the input embeddings. More surprisingly, we find that some of the lower decoder layers do not actually do that much decoding. We show all of this in terms of a probing approach where we project representations of the layer analyzed to the final trained and frozen classifier level of the Transformer decoder to measure word translation accuracy. Our findings motivate and explain a Transformer configuration change: if translation already happens in the encoder layers, perhaps we can increase the number of encoder layers, while decreasing the number of decoder layers, boosting decoding speed, without loss in translation quality? Our experiments show that this is indeed the case: we can increase speed by up to a factor 2.3 with small gains in translation quality, while an 18-4 deep encoder configuration boosts translation quality by +1.42 BLEU (En-De) at a speed-up of 1.4.

pdf bib
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age
Yuri Bizzoni | Elke Teich | Cristina España-Bonet | Josef van Genabith
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age

pdf bib
Do not Rely on Relay Translations: Multilingual Parallel Direct Europarl
Kwabena Amponsah-Kaakyire | Daria Pylypenko | Cristina España-Bonet | Josef van Genabith
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age

2020

pdf abs
English to Manipuri and Mizo Post-Editing Effort and its Impact on Low Resource Machine Translation
Loitongbam Sanayai Meetei | Thoudam Doren Singh | Sivaji Bandyopadhyay | Mihaela Vela | Josef van Genabith
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

We present the first study on the post-editing (PE) effort required to build a parallel dataset for English-Manipuri and English-Mizo, in the context of a project on creating data for machine translation (MT). English source text from a local daily newspaper are machine translated into Manipuri and Mizo using PBSMT systems built in-house. A Computer Assisted Translation (CAT) tool is used to record the time, keystroke and other indicators to measure PE effort in terms of temporal and technical effort. A positive correlation between the technical effort and the number of function words is seen for English-Manipuri and English-Mizo but a negative correlation between the technical effort and the number of noun words for English-Mizo. However, average time spent per token in PE English-Mizo text is negatively correlated with the temporal effort. The main reason for these results are due to (i) English and Mizo using the same script, while Manipuri uses a different script and (ii) the agglutinative nature of Manipuri. Further, we check the impact of training a MT system in an incremental approach, by including the post-edited dataset as additional training data. The result shows an increase in HBLEU of up to 4.6 for English-Manipuri.

pdf abs
UdS-DFKI@WMT20: Unsupervised MT and Very Low Resource Supervised MT for German-Upper Sorbian
Sourav Dutta | Jesujoba Alabi | Saptarashmi Bandyopadhyay | Dana Ruiter | Josef van Genabith
Proceedings of the Fifth Conference on Machine Translation

This paper describes the UdS-DFKI submission to the shared task for unsupervised machine translation (MT) and very low-resource supervised MT between German (de) and Upper Sorbian (hsb) at the Fifth Conference of Machine Translation (WMT20). We submit systems for both the supervised and unsupervised tracks. Apart from various experimental approaches like bitext mining, model pre-training, and iterative back-translation, we employ a factored machine translation approach on a small BPE vocabulary.

pdf
Improving the Multi-Modal Post-Editing (MMPE) CAT Environment based on Professional Translators’ Feedback
Nico Herbig | Santanu Pal | Tim Düwel | Raksha Shenoy | Antonio Krüger | Josef van Genabith
Proceedings of 1st Workshop on Post-Editing in Modern-Day Translation

In automatic post-editing (APE) it makes sense to condition post-editing (pe) decisions on both the source (src) and the machine translated text (mt) as input. This has led to multi-encoder based neural APE approaches. A research challenge now is the search for architectures that best support the capture, preparation and provision of src and mt information and its integration with pe decisions. In this paper we present an efficient multi-encoder based APE model, called transference. Unlike previous approaches, it (i) uses a transformer encoder block for src, (ii) followed by a decoder block, but without masking for self-attention on mt, which effectively acts as second encoder combining src –> mt, and (iii) feeds this representation into a final decoder block generating pe. Our model outperforms the best performing systems by 1 BLEU point on the WMT 2016, 2017, and 2018 English–German APE shared tasks (PBSMT and NMT). Furthermore, the results of our model on the WMT 2019 APE task using NMT data shows a comparable performance to the state-of-the-art system. The inference time of our model is similar to the vanilla transformer-based NMT system although our model deals with two separate encoders. We further investigate the importance of our newly introduced second encoder and find that a too small amount of layers does hurt the performance, while reducing the number of layers of the decoder does not matter much.

pdf abs
Understanding Translationese in Multi-view Embedding Spaces
Koel Dutta Chowdhury | Cristina España-Bonet | Josef van Genabith
Proceedings of the 28th International Conference on Computational Linguistics

Recent studies use a combination of lexical and syntactic features to show that footprints of the source language remain visible in translations, to the extent that it is possible to predict the original source language from the translation. In this paper, we focus on embedding-based semantic spaces, exploiting departures from isomorphism between spaces built from original target language and translations into this target language to predict relations between languages in an unsupervised way. We use different views of the data — words, parts of speech, semantic tags and synsets — to track translationese. Our analysis shows that (i) semantic distances between original target language and translations into this target language can be detected using the notion of isomorphism, (ii) language family ties with characteristics similar to linguistically motivated phylogenetic trees can be inferred from the distances and (iii) with delexicalised embeddings exhibiting source-language interference most significantly, other levels of abstraction display the same tendency, indicating the lexicalised results to be not “just” due to possible topic differences between original and translated texts. To the best of our knowledge, this is the first time departures from isomorphism between embedding spaces are used to track translationese.

pdf abs
Learning Source Phrase Representations for Neural Machine Translation
Hongfei Xu | Josef van Genabith | Deyi Xiong | Qiuhui Liu | Jingyi Zhang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The Transformer translation model (Vaswani et al., 2017) based on a multi-head attention mechanism can be computed effectively in parallel and has significantly pushed forward the performance of Neural Machine Translation (NMT). Though intuitively the attentional network can connect distant words via shorter network paths than RNNs, empirical analysis demonstrates that it still has difficulty in fully capturing long-distance dependencies (Tang et al., 2018). Considering that modeling phrases instead of words has significantly improved the Statistical Machine Translation (SMT) approach through the use of larger translation blocks (“phrases”) and its reordering ability, modeling NMT at phrase level is an intuitive proposal to help the model capture long-distance relationships. In this paper, we first propose an attentive phrase representation generation mechanism which is able to generate phrase representations from corresponding token representations. In addition, we incorporate the generated phrase representations into the Transformer translation model to enhance its ability to capture long-distance relationships. In our experiments, we obtain significant improvements on the WMT 14 English-German and English-French tasks on top of the strong Transformer baseline, which shows the effectiveness of our approach. Our approach helps Transformer Base models perform at the level of Transformer Big models, and even significantly better for long sentences, but with substantially fewer parameters and training steps. The fact that phrase representations help even in the big setting further supports our conjecture that they make a valuable contribution to long-distance relations.

pdf abs
Lipschitz Constrained Parameter Initialization for Deep Transformers
Hongfei Xu | Qiuhui Liu | Josef van Genabith | Deyi Xiong | Jingyi Zhang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The Transformer translation model employs residual connection and layer normalization to ease the optimization difficulties caused by its multi-layer encoder/decoder structure. Previous research shows that even with residual connection and layer normalization, deep Transformers still have difficulty in training, and particularly Transformer models with more than 12 encoder/decoder layers fail to converge. In this paper, we first empirically demonstrate that a simple modification made in the official implementation, which changes the computation order of residual connection and layer normalization, can significantly ease the optimization of deep Transformers. We then compare the subtle differences in computation order in considerable detail, and present a parameter initialization method that leverages the Lipschitz constraint on the initialization of Transformer parameters that effectively ensures training convergence. In contrast to findings in previous research we further demonstrate that with Lipschitz parameter initialization, deep Transformers with the original computation order can converge, and obtain significant BLEU improvements with up to 24 layers. In contrast to previous research which focuses on deep encoders, our approach additionally enables Transformers to also benefit from deep decoders.

Current advances in machine translation (MT) increase the need for translators to switch from traditional translation to post-editing (PE) of machine-translated text, a process that saves time and reduces errors. This affects the design of translation interfaces, as the task changes from mainly generating text to correcting errors within otherwise helpful translation proposals. Since this paradigm shift offers potential for modalities other than mouse and keyboard, we present MMPE, the first prototype to combine traditional input modes with pen, touch, and speech modalities for PE of MT. The results of an evaluation with professional translators suggest that pen and touch interaction are suitable for deletion and reordering tasks, while they are of limited use for longer insertions. On the other hand, speech and multi-modal combinations of select & speech are considered suitable for replacements and insertions but offer less potential for deletion and reordering. Overall, participants were enthusiastic about the new modalities and saw them as good extensions to mouse & keyboard, but not as a complete substitute.

pdf abs
Dynamically Adjusting Transformer Batch Size by Monitoring Gradient Direction Change
Hongfei Xu | Josef van Genabith | Deyi Xiong | Qiuhui Liu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The choice of hyper-parameters affects the performance of neural models. While much previous research (Sutskever et al., 2013; Duchi et al., 2011; Kingma and Ba, 2015) focuses on accelerating convergence and reducing the effects of the learning rate, comparatively few papers concentrate on the effect of batch size. In this paper, we analyze how increasing batch size affects gradient direction, and propose to evaluate the stability of gradients with their angle change. Based on our observations, the angle change of gradient direction first tends to stabilize (i.e. gradually decrease) while accumulating mini-batches, and then starts to fluctuate. We propose to automatically and dynamically determine batch sizes by accumulating gradients of mini-batches and performing an optimization step at just the time when the direction of gradients starts to fluctuate. To improve the efficiency of our approach for large models, we propose a sampling approach to select gradients of parameters sensitive to the batch size. Our approach dynamically determines proper and efficient batch sizes during training. In our experiments on the WMT 14 English to German and English to French tasks, our approach improves the Transformer with a fixed 25k batch size by +0.73 and +0.82 BLEU respectively.

pdf abs
MMPE: A Multi-Modal Interface using Handwriting, Touch Reordering, and Speech Commands for Post-Editing Machine Translation
Nico Herbig | Santanu Pal | Tim Düwel | Kalliopi Meladaki | Mahsa Monshizadeh | Vladislav Hnatovskiy | Antonio Krüger | Josef van Genabith
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

The shift from traditional translation to post-editing (PE) of machine-translated (MT) text can save time and reduce errors, but it also affects the design of translation interfaces, as the task changes from mainly generating text to correcting errors within otherwise helpful translation proposals. Since this paradigm shift offers potential for modalities other than mouse and keyboard, we present MMPE, the first prototype to combine traditional input modes with pen, touch, and speech modalities for PE of MT. Users can directly cross out or hand-write new text, drag and drop words for reordering, or use spoken commands to update the text in place. All text manipulations are logged in an easily interpretable format to simplify subsequent translation process research. The results of an evaluation with professional translators suggest that pen and touch interaction are suitable for deletion and reordering tasks, while speech and multi-modal combinations of select & speech are considered suitable for replacements and insertions. Overall, experiment participants were enthusiastic about the new modalities and saw them as useful extensions to mouse & keyboard, but not as a complete substitute.

pdf abs
Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation
Dana Ruiter | Josef van Genabith | Cristina España-Bonet
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self-supervised neural machine translation (SSNMT) jointly learns to identify and select suitable training data from comparable (rather than parallel) corpora and to translate, in a way that the two tasks support each other in a virtuous circle. In this study, we provide an in-depth analysis of the sampling choices the SSNMT model makes during training. We show how, without it having been told to do so, the model self-selects samples of increasing (i) complexity and (ii) task-relevance in combination with (iii) performing a denoising curriculum. We observe that the dynamics of the mutual-supervision signals of both system internal representation types are vital for the extraction and translation performance. We show that in terms of the Gunning-Fog Readability index, SSNMT starts extracting and learning from Wikipedia data suitable for high school students and quickly moves towards content suitable for first year undergraduate students.

pdf abs
Translation Quality Estimation by Jointly Learning to Score and Rank
Jingyi Zhang | Josef van Genabith
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The translation quality estimation (QE) task, particularly the QE as a Metric task, aims to evaluate the general quality of a translation based on the translation and the source sentence without using reference translations. Supervised learning of this QE task requires human evaluation of translation quality as training data. Human evaluation of translation quality can be performed in different ways, including assigning an absolute score to a translation or ranking different translations. In order to make use of different types of human evaluation data for supervised learning, we present a multi-task learning QE model that jointly learns two tasks: score a translation and rank two translations. Our QE model exploits cross-lingual sentence embeddings from pre-trained multilingual language models. We obtain new state-of-the-art results on the WMT 2019 QE as a Metric task and outperform sentBLEU on the WMT 2019 Metrics task.

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

pdf abs
Language Data Sharing in European Public Services – Overcoming Obstacles and Creating Sustainable Data Sharing Infrastructures
Lilli Smal | Andrea Lösch | Josef van Genabith | Maria Giagkou | Thierry Declerck | Stephan Busemann
Proceedings of the Twelfth Language Resources and Evaluation Conference

Data is key in training modern language technologies. In this paper, we summarise the findings of the first pan-European study on obstacles to sharing language data across 29 EU Member States and CEF-affiliated countries carried out under the ELRC White Paper action on Sustainable Language Data Sharing to Support Language Equality in Multilingual Europe. Why Language Data Matters. We present the methodology of the study, the obstacles identified and report on recommendations on how to overcome those. The obstacles are classified into (1) lack of appreciation of the value of language data, (2) structural challenges, (3) disposition towards CAT tools and lack of digital skills, (4) inadequate language data management practices, (5) limited access to outsourced translations, and (6) legal concerns. Recommendations are grouped into addressing the European/national policy level, and the organisational/institutional level.

pdf abs
How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech
Yuri Bizzoni | Tom S Juzek | Cristina España-Bonet | Koel Dutta Chowdhury | Josef van Genabith | Elke Teich
Proceedings of the 17th International Conference on Spoken Language Translation

Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons. In our analysis we – (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs machine) rather than to the data (written vs spoken).

2019

pdf bib abs
Analysing Coreference in Transformer Outputs
Ekaterina Lapshinova-Koltunski | Cristina España-Bonet | Josef van Genabith
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)

We analyse coreference phenomena in three neural machine translation systems trained with different data settings with or without access to explicit intra- and cross-sentential anaphoric information. We compare system performance on two different genres: news and TED talks. To do this, we manually annotate (the possibly incorrect) coreference chains in the MT outputs and evaluate the coreference chain translations. We define an error typology that aims to go further than pronoun translation adequacy and includes types such as incorrect word selection or missing words. The features of coreference chains in automatic translations are also compared to those of the source texts and human translations. The analysis shows stronger potential translationese effects in machine translated outputs than in human translations.

pdf abs
Self-Supervised Neural Machine Translation
Dana Ruiter | Cristina España-Bonet | Josef van Genabith
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present a simple new method where an emergent NMT system is used for simultaneously selecting training data and learning internal NMT representations. This is done in a self-supervised way without parallel data, in such a way that both tasks enhance each other during training. The method is language independent, introduces no additional hyper-parameters, and achieves BLEU scores of 29.21 (en2fr) and 27.36 (fr2en) on newstest2014 using English and French Wikipedia data for training.

pdf abs
JU-Saarland Submission to the WMT2019 English–Gujarati Translation Shared Task
Riktim Mondal | Shankha Raj Nayek | Aditya Chowdhury | Santanu Pal | Sudip Kumar Naskar | Josef van Genabith
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

In this paper we describe our joint submission (JU-Saarland) from Jadavpur University and Saarland University in the WMT 2019 news translation shared task for English–Gujarati language pair within the translation task sub-track. Our baseline and primary submissions are built using Recurrent neural network (RNN) based neural machine translation (NMT) system which follows attention mechanism. Given the fact that the two languages belong to different language families and there is not enough parallel data for this language pair, building a high quality NMT system for this language pair is a difficult task. We produced synthetic data through back-translation from available monolingual data. We report the translation quality of our English–Gujarati and Gujarati–English NMT systems trained at word, byte-pair and character encoding levels where RNN at word level is considered as the baseline and used for comparison purpose. Our English–Gujarati system ranked in the second position in the shared task.

pdf abs
DFKI-NMT Submission to the WMT19 News Translation Task
Jingyi Zhang | Josef van Genabith
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the DFKI-NMT submission to the WMT19 News translation task. We participated in both English-to-German and German-to-English directions. We trained Transformer models and adopted various techniques for effectively training our models, including data selection, back-translation and in-domain fine-tuning. We give a detailed analysis of the performance of our system.

pdf abs
USAAR-DFKI – The Transference Architecture for English–German Automatic Post-Editing
Santanu Pal | Hongfei Xu | Nico Herbig | Antonio Krüger | Josef van Genabith
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In this paper we present an English–German Automatic Post-Editing (APE) system called transference, submitted to the APE Task organized at WMT 2019. Our transference model is based on a multi-encoder transformer architecture. Unlike previous approaches, it (i) uses a transformer encoder block for src, (ii) followed by a transformer decoder block, but without masking, for self-attention on mt, which effectively acts as second encoder combining src –> mt, and (iii) feeds this representation into a final decoder block generating pe. Our model improves over the raw black-box neural machine translation system by 0.9 and 1.0 absolute BLEU points on the WMT 2019 APE development and test set. Our submission ranked 3rd, however compared to the two top systems, performance differences are not statistically significant.

pdf abs
UdS Submission for the WMT 19 Automatic Post-Editing Task
Hongfei Xu | Qiuhui Liu | Josef van Genabith
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In this paper, we describe our submission to the English-German APE shared task at WMT 2019. We utilize and adapt an NMT architecture originally developed for exploiting context information to APE, implement this in our own transformer model and explore joint training of the APE task with a de-noising encoder.

pdf abs
UDS–DFKI Submission to the WMT2019 Czech–Polish Similar Language Translation Shared Task
Santanu Pal | Marcos Zampieri | Josef van Genabith
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In this paper we present the UDS-DFKI system submitted to the Similar Language Translation shared task at WMT 2019. The first edition of this shared task featured data from three pairs of similar languages: Czech and Polish, Hindi and Nepali, and Portuguese and Spanish. Participants could choose to participate in any of these three tracks and submit system outputs in any translation direction. We report the results obtained by our system in translating from Czech to Polish and comment on the impact of out-of-domain test data in the performance of our system. UDS-DFKI achieved competitive performance ranking second among ten teams in Czech to Polish translation.

pdf bib
Improving CAT Tools in the Translation Workflow: New Approaches and Evaluation
Mihaela Vela | Santanu Pal | Marcos Zampieri | Sudip Naskar | Josef van Genabith
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

2018

pdf
How Robust Are Character-Based Word Embeddings in Tagging and MT Against Wrod Scramlbing or Randdm Nouse?
Georg Heigold | Stalin Varanasi | Günter Neumann | Josef van Genabith
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Code-Mixing (CM) is the phenomenon of alternating between two or more languages which is prevalent in bi- and multi-lingual communities. Most NLP applications today are still designed with the assumption of a single interaction language and are most likely to break given a CM utterance with multiple languages mixed at a morphological, phrase or sentence level. For example, popular commercial search engines do not yet fully understand the intents expressed in CM queries. As a first step towards fostering research which supports CM in NLP applications, we systematically crowd-sourced and curated an evaluation dataset for factoid question answering in three CM languages - Hinglish (Hindi+English), Tenglish (Telugu+English) and Tamlish (Tamil+English) which belong to two language families (Indo-Aryan and Dravidian). We share the details of our data collection process, techniques which were used to avoid inducing lexical bias amongst the crowd workers and other CM specific linguistic properties of the dataset. Our final dataset, which is available freely for research purposes, has 1,694 Hinglish, 2,848 Tamlish and 1,391 Tenglish factoid questions and their answers. We discuss the techniques used by the participants for the first edition of this ongoing challenge.

pdf abs
A Transformer-Based Multi-Source Automatic Post-Editing System
Santanu Pal | Nico Herbig | Antonio Krüger | Josef van Genabith
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper presents our English–German Automatic Post-Editing (APE) system submitted to the APE Task organized at WMT 2018 (Chatterjee et al., 2018). The proposed model is an extension of the transformer architecture: two separate self-attention-based encoders encode the machine translation output (mt) and the source (src), followed by a joint encoder that attends over a combination of these two encoded sequences (encsrc and encmt) for generating the post-edited sentence. We compare this multi-source architecture (i.e, {src, mt} → pe) to a monolingual transformer (i.e., mt → pe) model and an ensemble combining the multi-source {src, mt} → pe and single-source mt → pe models. For both the PBSMT and the NMT task, the ensemble yields the best results, followed by the multi-source model and last the single-source approach. Our best model, the ensemble, achieves a BLEU score of 66.16 and 74.22 for the PBSMT and NMT task, respectively.

2017

pdf abs
Predicting the Law Area and Decisions of French Supreme Court Cases
Octavia-Maria Şulea | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this paper, we investigate the application of text classification methods to predict the law area and the decision of cases judged by the French Supreme Court. We also investigate the influence of the time period in which a ruling was made over the textual form of the case description and the extent to which it is necessary to mask the judge’s motivation for a ruling to emulate a real-world test scenario. We report results of 96% f1 score in predicting a case ruling, 90% f1 score in predicting the law area of a case, and 75.9% f1 score in estimating the time span when a ruling has been issued using a linear Support Vector Machine (SVM) classifier trained on lexical features.

pdf abs
An Extensive Empirical Evaluation of Character-Based Morphological Tagging for 14 Languages
Georg Heigold | Guenter Neumann | Josef van Genabith
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This paper investigates neural character-based morphological tagging for languages with complex morphology and large tag sets. Character-based approaches are attractive as they can handle rarely- and unseen words gracefully. We evaluate on 14 languages and observe consistent gains over a state-of-the-art morphological tagger across all languages except for English and French, where we match the state-of-the-art. We compare two architectures for computing character-based word vectors using recurrent (RNN) and convolutional (CNN) nets. We show that the CNN based approach performs slightly worse and less consistently than the RNN based approach. Small but systematic gains are observed when combining the two architectures by ensembling.

pdf abs
Neural Automatic Post-Editing Using Prior Alignment and Reranking
Santanu Pal | Sudip Kumar Naskar | Mihaela Vela | Qun Liu | Josef van Genabith
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present a second-stage machine translation (MT) system based on a neural machine translation (NMT) approach to automatic post-editing (APE) that improves the translation quality provided by a first-stage MT system. Our APE system (APE_Sym) is an extended version of an attention based NMT model with bilingual symmetry employing bidirectional models, mt–pe and pe–mt. APE translations produced by our system show statistically significant improvements over the first-stage MT, phrase-based APE and the best reported score on the WMT 2016 APE dataset by a previous neural APE system. Re-ranking (APE_Rerank) of the n-best translations from the phrase-based APE and APE_Sym systems provides further substantial improvements over the symmetric neural APE model. Human evaluation confirms that the APE_Rerank generated PE translations improve on the previous best neural APE system at WMT 2016.

Web debates play an important role in enabling broad participation of constituencies in social, political and economic decision-taking. However, it is challenging to organize, structure, and navigate a vast number of diverse argumentations and comments collected from many participants over a long time period. In this paper we demonstrate Common Round, a next generation platform for large-scale web debates, which provides functions for eliciting the semantic content and structures from the contributions of participants. In particular, Common Round applies language technologies for the extraction of semantic essence from textual input, aggregation of the formulated opinions and arguments. The platform also provides a cross-lingual access to debates using machine translation.

pdf abs
The Effect of Error Rate in Artificially Generated Data for Automatic Preposition and Determiner Correction
Fraser Bowen | Jon Dehdari | Josef van Genabith
Proceedings of the 3rd Workshop on Noisy User-generated Text

In this research we investigate the impact of mismatches in the density and type of error between training and test data on a neural system correcting preposition and determiner errors. We use synthetically produced training data to control error density and type, and “real” error data for testing. Our results show it is possible to combine error types, although prepositions and determiners behave differently in terms of how much error should be artificially introduced into the training data in order to get the best results.

pdf abs
Massively Multilingual Neural Grapheme-to-Phoneme Conversion
Ben Peters | Jon Dehdari | Josef van Genabith
Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems

Grapheme-to-phoneme conversion (g2p) is necessary for text-to-speech and automatic speech recognition systems. Most g2p systems are monolingual: they require language-specific data or handcrafting of rules. Such systems are difficult to extend to low resource languages, for which data and handcrafted rules are not available. As an alternative, we present a neural sequence-to-sequence approach to g2p which is trained on spelling–pronunciation pairs in hundreds of languages. The system shares a single encoder and decoder across all languages, allowing it to utilize the intrinsic similarities between different writing systems. We show an 11% improvement in phoneme error rate over an approach based on adapting high-resource monolingual g2p models to low-resource languages. Our model is also much more compact relative to previous approaches.

pdf bib abs
Going beyond zero-shot MT: combining phonological, morphological and semantic factors. The UdS-DFKI System at IWSLT 2017
Cristina España-Bonet | Josef van Genabith
Proceedings of the 14th International Conference on Spoken Language Translation

This paper describes the UdS-DFKI participation to the multilingual task of the IWSLT Evaluation 2017. Our approach is based on factored multilingual neural translation systems following the small data and zero-shot training conditions. Our systems are designed to fully exploit multilinguality by including factors that increase the number of common elements among languages such as phonetic coarse encodings and synsets, besides shallow part-of-speech tags, stems and lemmas. Document level information is also considered by including the topic of every document. This approach improves a baseline without any additional factor for all the language pairs and even allows beyond-zero-shot translation. That is, the translation from unseen languages is possible thanks to the common elements —especially synsets in our models— among languages.

2016

pdf abs
CATaLog Online: Porting a Post-editing Tool to the Web
Santanu Pal | Marcos Zampieri | Sudip Kumar Naskar | Tapas Nayak | Mihaela Vela | Josef van Genabith
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents CATaLog online, a new web-based MT and TM post-editing tool. CATaLog online is a freeware software that can be used through a web browser and it requires only a simple registration. The tool features a number of editing and log functions similar to the desktop version of CATaLog enhanced with several new features that we describe in detail in this paper. CATaLog online is designed to allow users to post-edit both translation memory segments as well as machine translation output. The tool provides a complete set of log information currently not available in most commercial CAT tools. Log information can be used both for project management purposes as well as for the study of the translation process and translator’s productivity.

pdf abs
Fostering the Next Generation of European Language Technology: Recent Developments ― Emerging Initiatives ― Challenges and Opportunities
Georg Rehm | Jan Hajič | Josef van Genabith | Andrejs Vasiljevs
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

META-NET is a European network of excellence, founded in 2010, that consists of 60 research centres in 34 European countries. One of the key visions and goals of META-NET is a truly multilingual Europe, which is substantially supported and realised through language technologies. In this article we provide an overview of recent developments around the multilingual Europe topic, we also describe recent and upcoming events as well as recent and upcoming strategy papers. Furthermore, we provide overviews of two new emerging initiatives, the CEF.AT and ELRC activity on the one hand and the Cracking the Language Barrier federation on the other. The paper closes with several suggested next steps in order to address the current challenges and to open up new opportunities.

pdf
SAARSHEFF at SemEval-2016 Task 1: Semantic Textual Similarity with Machine Translation Evaluation Metrics and (eXtreme) Boosted Tree Ensembles
Liling Tan | Carolina Scarton | Lucia Specia | Josef van Genabith
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
WOLVESAAR at SemEval-2016 Task 1: Replicating the Success of Monolingual Word Alignment and Neural Embeddings for Semantic Textual Similarity
Hannah Bechara | Rohit Gupta | Liling Tan | Constantin Orăsan | Ruslan Mitkov | Josef van Genabith
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
MacSaar at SemEval-2016 Task 11: Zipfian and Character Features for ComplexWord Identification
Marcos Zampieri | Liling Tan | Josef van Genabith
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
USAAR at SemEval-2016 Task 13: Hyponym Endocentricity
Liling Tan | Francis Bond | Josef van Genabith
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf abs
Modeling Diachronic Change in Scientific Writing with Information Density
Raphael Rubino | Stefania Degaetano-Ortlieb | Elke Teich | Josef van Genabith
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Previous linguistic research on scientific writing has shown that language use in the scientific domain varies considerably in register and style over time. In this paper we investigate the introduction of information theory inspired features to study long term diachronic change on three levels: lexis, part-of-speech and syntax. Our approach is based on distinguishing between sentences from 19th and 20th century scientific abstracts using supervised classification models. To the best of our knowledge, the introduction of information theoretic features to this task is novel. We show that these features outperform more traditional features, such as token or character n-grams, while leading to more compact models. We present a detailed analysis of feature informativeness in order to gain a better understanding of diachronic change on different linguistic levels.

pdf abs
Multi-Engine and Multi-Alignment Based Automatic Post-Editing and its Impact on Translation Productivity
Santanu Pal | Sudip Kumar Naskar | Josef van Genabith
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper we combine two strands of machine translation (MT) research: automatic post-editing (APE) and multi-engine (system combination) MT. APE systems learn a target-language-side second stage MT system from the data produced by human corrected output of a first stage MT system, to improve the output of the first stage MT in what is essentially a sequential MT system combination architecture. At the same time, there is a rich research literature on parallel MT system combination where the same input is fed to multiple engines and the best output is selected or smaller sections of the outputs are combined to obtain improved translation output. In the paper we show that parallel system combination in the APE stage of a sequential MT-APE combination yields substantial translation improvements both measured in terms of automatic evaluation metrics as well as in terms of productivity improvements measured in a post-editing experiment. We also show that system combination on the level of APE alignments yields further improvements. Overall our APE system yields statistically significant improvement of 5.9% relative BLEU over a strong baseline (English–Italian Google MT) and 21.76% productivity increase in a human post-editing experiment with professional translators.

pdf abs
CATaLog Online: A Web-based CAT Tool for Distributed Translation with Data Capture for APE and Translation Process Research
Santanu Pal | Sudip Kumar Naskar | Marcos Zampieri | Tapas Nayak | Josef van Genabith
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present a free web-based CAT tool called CATaLog Online which provides a novel and user-friendly online CAT environment for post-editors/translators. The goal is to support distributed translation, reduce post-editing time and effort, improve the post-editing experience and capture data for incremental MT/APE (automatic post-editing) and translation process research. The tool supports individual as well as batch mode file translation and provides translations from three engines – translation memory (TM), MT and APE. TM suggestions are color coded to accelerate the post-editing task. The users can integrate their personal TM/MT outputs. The tool remotely monitors and records post-editing activities generating an extensive range of post-editing logs.

pdf
Information Density and Quality Estimation Features as Translationese Indicators for Human Translation Classification
Raphael Rubino | Ekaterina Lapshinova-Koltunski | Josef van Genabith
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
BIRA: Improved Predictive Exchange Word Clustering
Jon Dehdari | Liling Tan | Josef van Genabith
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Scaling Up Word Clustering
Jon Dehdari | Liling Tan | Josef van Genabith
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf
A Neural Network based Approach to Automatic Post-Editing
Santanu Pal | Sudip Kumar Naskar | Mihaela Vela | Josef van Genabith
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
USAAR: An Operation Sequential Model for Automatic Statistical Post-Editing
Santanu Pal | Marcos Zampieri | Josef van Genabith
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2015

pdf
UdS-Sant: English–German Hybrid Machine Translation System
Santanu Pal | Sudip Naskar | Josef van Genabith
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf
USAAR-SAPE: An English–Spanish Statistical Automatic Post-Editing System
Santanu Pal | Mihaela Vela | Sudip Kumar Naskar | Josef van Genabith
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf
Machine Translation Evaluation using Recurrent Neural Networks
Rohit Gupta | Constantin Orăsan | Josef van Genabith
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf
Passive and Pervasive Use of Bilingual Dictionary in Statistical Machine Translation
Liling Tan | Josef van Genabith | Francis Bond
Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)

pdf
Can Translation Memories afford not to use paraphrasing?
Rohit Gupta | Constantin Orăsan | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Searching for Context: a Study on Document-Level Labels for Translation Quality Estimation
Carolina Scarton | Marcos Zampieri | Mihaela Vela | Josef van Genabith | Lucia Specia
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Re-assessing the WMT2013 Human Evaluation with Professional Translators Trainees
Mihaela Vela | Josef van Genabith
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
An Awkward Disparity between BLEU / RIBES Scores and Human Judgements in Machine Translation
Liling Tan | Jon Dehdari | Josef van Genabith
Proceedings of the 2nd Workshop on Asian Translation (WAT2015)

pdf
Comparing Approaches to the Identification of Similar Languages
Marcos Zampieri | Binyam Gebrekidan Gebre | Hernani Costa | Josef van Genabith
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

pdf
USAAR-SHEFFIELD: Semantic Textual Similarity with Deep Regression and Machine Translation Evaluation Metrics
Liling Tan | Carolina Scarton | Lucia Specia | Josef van Genabith
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf
USAAR-WLV: Hypernym Generation with Deep Neural Nets
Liling Tan | Rohit Gupta | Josef van Genabith
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf
Can Translation Memories afford not to use paraphrasing ?
Rohit Gupta | Constantin Orasan | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks
Rohit Gupta | Constantin Orăsan | Josef van Genabith
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf
Active Learning for Post-Editing Based Incrementally Retrained MT
Aswarth Abhilash Dara | Josef van Genabith | Qun Liu | John Judge | Antonio Toral
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiatives work throughout Europe in order to boost progress and innovation in our field.

2013

pdf
Shallow Semantically-Informed PBSMT and HPBSMT
Tsuyoshi Okita | Qun Liu | Josef van Genabith
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf
TMTprime: A Recommender System for MT and TM Integration
Aswarth Abhilash Dara | Sandipan Dandapat | Declan Groves | Josef van Genabith
Proceedings of the 2013 NAACL HLT Demonstration Session

pdf
CNGL-CORE: Referential Translation Machines for Measuring Semantic Similarity
Ergun Biçici | Josef van Genabith
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

pdf
CNGL: Grading Student Answers by Acts of Translation
Ergun Biçici | Josef van Genabith
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf
Quality Estimation-guided Data Selection for Domain Adaptation of SMT
Pratyush Banerjee | Raphael Rubino | Johann Roturier | Josef van Genabith
Proceedings of Machine Translation Summit XIV: Papers

2012

pdf abs
Irish Treebanking and Parsing: A Preliminary Evaluation
Teresa Lynn | Özlem Çetinoğlu | Jennifer Foster | Elaine Uí Dhonnchadha | Mark Dras | Josef van Genabith
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Language resources are essential for linguistic research and the development of NLP applications. Low-density languages, such as Irish, therefore lack significant research in this area. This paper describes the early stages in the development of new language resources for Irish ― namely the first Irish dependency treebank and the first Irish statistical dependency parser. We present the methodology behind building our new treebank and the steps we take to leverage upon the few existing resources. We discuss language-specific choices made when defining our dependency labelling scheme, and describe interesting Irish language characteristics such as prepositional attachment, copula, and clefting. We manually develop a small treebank of 300 sentences based on an existing POS-tagged corpus and report an inter-annotator agreement of 0.7902. We train MaltParser to achieve preliminary parsing results for Irish and describe a bootstrapping approach for further stages of development.

pdf abs
A Richly Annotated, Multilingual Parallel Corpus for Hybrid Machine Translation
Eleftherios Avramidis | Marta R. Costa-jussà | Christian Federmann | Josef van Genabith | Maite Melero | Pavel Pecina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In recent years, machine translation (MT) research has focused on investigating how hybrid machine translation as well as system combination approaches can be designed so that the resulting hybrid translations show an improvement over the individual component translations. As a first step towards achieving this objective we have developed a parallel corpus with source text and the corresponding translation output from a number of machine translation engines, annotated with metadata information, capturing aspects of the translation process performed by the different MT systems. This corpus aims to serve as a basic resource for further research on whether hybrid machine translation algorithms and system combination techniques can benefit from additional (linguistically motivated, decoding, and runtime) information provided by the different systems involved. In this paper, we describe the annotated corpus we have created. We provide an overview on the component MT systems and the XLIFF-based annotation format we have developed. We also report on first experiments with the ML4HMT corpus data.

pdf abs
Arabic Word Generation and Modelling for Spell Checking
Khaled Shaalan | Mohammed Attia | Pavel Pecina | Younes Samih | Josef van Genabith
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Arabic is a language known for its rich and complex morphology. Although many research projects have focused on the problem of Arabic morphological analysis using different techniques and approaches, very few have addressed the issue of generation of fully inflected words for the purpose of text authoring. Available open-source spell checking resources for Arabic are too small and inadequate. Ayaspell, for example, the official resource used with OpenOffice applications, contains only 300,000 fully inflected words. We try to bridge this critical gap by creating an adequate, open-source and large-coverage word list for Arabic containing 9,000,000 fully inflected surface words. Furthermore, from a large list of valid forms and invalid forms we create a character-based tri-gram language model to approximate knowledge about permissible character clusters in Arabic, creating a novel method for detecting spelling errors. Testing of this language model gives a precision of 98.2% at a recall of 100%. We take our research a step further by creating a context-independent spelling correction tool using a finite-state automaton that measures the edit distance between input words and candidate corrections, the Noisy Channel Model, and knowledge-based rules. Our system performs significantly better than Hunspell in choosing the best solution, but it is still below the MS Spell Checker.

pdf abs
Automatic Extraction and Evaluation of Arabic LFG Resources
Mohammed Attia | Khaled Shaalan | Lamia Tounsi | Josef van Genabith
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents the results of an approach to automatically acquire large-scale, probabilistic Lexical-Functional Grammar (LFG) resources for Arabic from the Penn Arabic Treebank (ATB). Our starting point is the earlier, work of (Tounsi et al., 2009) on automatic LFG f(eature)-structure annotation for Arabic using the ATB. They exploit tree configuration, POS categories, functional tags, local heads and trace information to annotate nodes with LFG feature-structure equations. We utilize this annotation to automatically acquire grammatical function (dependency) based subcategorization frames and paths linking long-distance dependencies (LDDs). Many state-of-the-art treebank-based probabilistic parsing approaches are scalable and robust but often also shallow: they do not capture LDDs and represent only local information. Subcategorization frames and LDD paths can be used to recover LDDs from such parser output to capture deep linguistic information. Automatic acquisition of language resources from existing treebanks saves time and effort involved in creating such resources by hand. Moreover, data-driven automatic acquisition naturally associates probabilistic information with subcategorization frames and LDD paths. Finally, based on the statistical distribution of LDD path types, we propose empirical bounds on traditional regular expression based functional uncertainty equations used to handle LDDs in LFG.

pdf abs
The ML4HMT Workshop on Optimising the Division of Labour in Hybrid Machine Translation
Christian Federmann | Eleftherios Avramidis | Marta R. Costa-jussà | Josef van Genabith | Maite Melero | Pavel Pecina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe the Shared Task on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid Machine Translation (ML4HMT) which aims to foster research on improved system combination approaches for machine translation (MT). Participants of the challenge are requested to build hybrid translations by combining the output of several MT systems of different types. We first describe the ML4HMT corpus used in the shared task, then explain the XLIFF-based annotation format we have designed for it, and briefly summarize the participating systems. Using both automated metrics scores and extensive manual evaluation, we discuss the individual performance of the various systems. An interesting result from the shared task is the fact that we were able to observe different systems winning according to the automated metrics scores when compared to the results from the manual evaluation. We conclude by summarising the first edition of the challenge and by giving an outlook to future work.

pdf
Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study
Pavel Pecina | Antonio Toral | Vassilis Papavassiliou | Prokopis Prokopidis | Josef van Genabith
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

pdf
Domain Adaptation in SMT of User-Generated Forum Content Guided by OOV Word Reduction: Normalization and/or Supplementary Data
Pratyush Banerjee | Sudip Kumar Naskar | Johann Roturier | Andy Way | Josef van Genabith
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

pdf
Head-Driven Hierarchical Phrase-based Translation
Junhui Li | Zhaopeng Tu | Guodong Zhou | Josef van Genabith
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Identifying High-Impact Sub-Structures for Convolution Kernels in Document-level Sentiment Classification
Zhaopeng Tu | Yifan He | Jennifer Foster | Josef van Genabith | Qun Liu | Shouxun Lin
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
The Floating Arabic Dictionary: An Automatic Method for Updating a Lexical Database through the Detection and Lemmatization of Unknown Words
Mohammed Attia | Younes Samih | Khaled Shaalan | Josef van Genabith
Proceedings of COLING 2012

pdf
Translation Quality-Based Supplementary Data Selection by Incremental Update of Translation Models
Pratyush Banerjee | Sudip Kumar Naskar | Johann Roturier | Andy Way | Josef van Genabith
Proceedings of COLING 2012

pdf
An Evaluation of Statistical Post-Editing Systems Applied to RBMT and SMT Systems
Hanna Béchara | Raphaël Rubino | Yifan He | Yanjun Ma | Josef van Genabith
Proceedings of COLING 2012

pdf
Simple and Effective Parameter Tuning for Domain Adaptation of Statistical Machine Translation
Pavel Pecina | Antonio Toral | Josef van Genabith
Proceedings of COLING 2012

pdf
Improved Spelling Error Detection and Correction for Arabic
Mohammed Attia | Pavel Pecina | Younes Samih | Khaled Shaalan | Josef van Genabith
Proceedings of COLING 2012: Posters

pdf
Combining EBMT, SMT, TM and IR Technologies for Quality and Scale
Sandipan Dandapat | Sara Morrissey | Andy Way | Josef van Genabith
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

pdf
Using Syntactic Head Information in Hierarchical Phrase-Based Translation
Junhui Li | Zhaopeng Tu | Guodong Zhou | Josef van Genabith
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT
Josef van Genabith | Toni Badia | Christian Federmann | Maite Melero | Marta R. Costa-jussà | Tsuyoshi Okita
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

pdf
System Combination with Extra Alignment Information
Xiaofeng Wu | Tsuyoshi Okita | Josef van Genabith | Qun Liu
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

pdf
Topic Modeling-based Domain Adaptation for System Combination
Tsuyoshi Okita | Antonio Toral | Josef van Genabith
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

pdf
Sentence-Level Quality Estimation for MT System Combination
Tsuyoshi Okita | Raphaël Rubino | Josef van Genabith
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

pdf
Results from the ML4HMT-12 Shared Task on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid Machine Translation
Christian Federmann | Tsuyoshi Okita | Maite Melero | Marta R. Costa-Jussa | Toni Badia | Josef van Genabith
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

2011

pdf
Decreasing Lexical Data Sparsity in Statistical Syntactic Parsing - Experiments with Named Entities
Deirdre Hogan | Jennifer Foster | Josef van Genabith
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf
DCU at Generation Challenges 2011 Surface Realisation Track
Yuqing Guo | Deirdre Hogan | Josef van Genabith
Proceedings of the 13th European Workshop on Natural Language Generation

pdf
Comparing the Use of Edited and Unedited Text in Parser Self-Training
Jennifer Foster | Özlem Çetinoğlu | Joachim Wagner | Josef van Genabith
Proceedings of the 12th International Conference on Parsing Technologies

pdf bib
Morphological Features for Parsing Morphologically-rich Languages: A Case of Arabic
Jon Dehdari | Lamia Tounsi | Josef van Genabith
Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages

pdf
An Open-Source Finite State Morphological Transducer for Modern Standard Arabic
Mohammed Attia | Pavel Pecina | Antonio Toral | Lamia Tounsi | Josef van Genabith
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing

In this paper, we provide a description of the Dublin City University’s (DCU) submissions in the IWSLT 2011 evaluationcampaign.1 WeparticipatedintheArabic-Englishand Chinese-English Machine Translation(MT) track translation tasks. We use phrase-based statistical machine translation (PBSMT) models to create the baseline system. Due to the open-domain nature of the data to be translated, we use domain adaptation techniques to improve the quality of translation. Furthermore, we explore target-side syntactic augmentation for an Hierarchical Phrase-Based (HPB) SMT model. Combinatory Categorial Grammar (CCG) is used to extract labels for target-side phrases and non-terminals in the HPB system. Combining the domain adapted language models with the CCG-augmented HPB system gave us the best translations for both language pairs providing statistically significant improvements of 6.09 absolute BLEU points (25.94% relative) and 1.69 absolute BLEU points (15.89% relative) over the unadapted PBSMT baselines for the Arabic-English and Chinese-English language pairs, respectively.

pdf
Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
Yanjun Ma | Yifan He | Andy Way | Josef van Genabith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf
Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component Level Mixture Modelling
Pratyush Banerjee | Sudip Kumar Naskar | Johann Roturier | Andy Way | Josef van Genabith
Proceedings of Machine Translation Summit XIII: Papers

pdf
Statistical Post-Editing for a Statistical MT System
Hanna Bechara | Yanjun Ma | Josef van Genabith
Proceedings of Machine Translation Summit XIII: Papers

pdf
Rich Linguistic Features for Translation Memory-Inspired Consistent Translation
Yifan He | Yanjun Ma | Andy Way | Josef van Genabith
Proceedings of Machine Translation Summit XIII: Papers

pdf bib abs
From the Confidence Estimation of Machine Translation to the Integration of MT and Translation Memory
Yanjun Ma | Yifan He | Josef van Genabith
Proceedings of Machine Translation Summit XIII: Tutorial Abstracts

In this tutorial, we cover techniques that facilitate the integration of Machine Translation (MT) and Translation Memory (TM), which can help the adoption of MT technology in localisation industry. The tutorial covers four parts: i) brief introduction of MT and TM systems, ii) MT confidence estimation measures tailored for the TM environment, iii) segment-level MT and MT integration, iv) sub-segment level MT and TM integration, and v) human evaluation of MT and TM integration. We will first briefly describe and compare how translations are generated in MT and TM systems, and suggest possible avenues to combines these two systems. We will also cover current quality / cost estimation measures applied in MT and TM systems, such as the fuzzy-match score in the TM, and the evaluation/confidence metrics used to judge MT outputs. We then move on to introduce the recent developments in the field of MT confidence estimation tailored towards predicting post-editing efforts. We will especially focus on the confidence metrics proposed by Specia et al., which is shown to have high correlation with human preference, as well as post-editing time. For segment-level MT and TM integration, we present translation recommendation and translation re-ranking models, where the integration happens at the 1-best or the N-best level, respectively. Given an input to be translated, MT-TM recommendation compares the output from the MT and the TM systems, and presents the better one to the post-editor. MT-TM re-ranking, on the other hand, combines k-best lists from both systems, and generates a new list according to estimated post-editing effort. We observe high precision of these models in automatic and human evaluations, indicating that they can be integrated into TM environments without the risk of deteriorating the quality of the post-editing candidate. For sub-segment level MT and TM integration, we try to reuse high quality TM chunks to improve the quality of MT systems. We can also predict whether phrase pairs derived from fuzzy matches should be used to constrain the translation of an input segment. Using a series of linguistically- motivated features, our constraints lead both to more consistent translation output, and to improved translation quality, as is measured by automatic evaluation scores. Finally, we present several methodologies that can be used to track post-editing effort, perform human evaluation of MT-TM integration, or help translators to access MT outputs in a TM environment.

2010

pdf
Integrating N-best SMT Outputs into a TM System
Yifan He | Yanjun Ma | Andy Way | Josef van Genabith
Coling 2010: Posters

pdf
Bridging SMT and TM with Translation Recommendation
Yifan He | Yanjun Ma | Josef van Genabith | Andy Way
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf
Hard Constraints for Grammatical Function Labelling
Wolfgang Seeker | Ines Rehbein | Jonas Kuhn | Josef van Genabith
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Wide-Coverage NLP with Linguistically Expressive Grammars
Julia Hockenmaier | Yusuke Miyao | Josef van Genabith
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

pdf
Handling Unknown Words in Statistical Latent-Variable Parsing Models for Arabic, English and French
Mohammed Attia | Jennifer Foster | Deirdre Hogan | Joseph Le Roux | Lamia Tounsi | Josef van Genabith
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf
Lemmatization and Lexicalized Statistical Parsing of Morphologically-Rich Languages: the Case of French
Djamé Seddah | Grzegorz Chrupała | Özlem Çetinoğlu | Josef van Genabith | Marie Candito
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf
The DCU Dependency-Based Metric in WMT-MetricsMATR 2010
Yifan He | Jinhua Du | Andy Way | Josef van Genabith
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf
Automatic Extraction of Arabic Multiword Expressions
Mohammed Attia | Antonio Toral | Lamia Tounsi | Pavel Pecina | Josef van Genabith
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

pdf
Seeding Statistical Machine Translation with Translation Memory Output through Tree-Based Structural Alignment
Ventsislav Zhechev | Josef van Genabith
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation

pdf
Deep Syntax Language Models and Statistical Machine Translation
Yvette Graham | Josef van Genabith
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation

pdf
Finding Common Ground: Towards a Surface Realisation Shared Task
Anja Belz | Mike White | Josef van Genabith | Deirdre Hogan | Amanda Stent
Proceedings of the 6th International Natural Language Generation Conference

pdf abs
An Automatically Built Named Entity Lexicon for Arabic
Mohammed Attia | Antonio Toral | Lamia Tounsi | Monica Monachini | Josef van Genabith
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We have adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWNs instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most mature and well-structured Arabic NE lexical resource to date. We have stored and organised this lexicon following the LMF ISO standard. We conduct a quantitative and qualitative evaluation against a manually annotated gold standard and achieve precision scores from 95.83% (with 66.13% recall) to 99.31% (with 61.45% recall) according to different values of a threshold.

pdf abs
Arabic Parsing Using Grammar Transforms
Lamia Tounsi | Josef van Genabith
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We investigate Arabic Context Free Grammar parsing with dependency annotation comparing lexicalised and unlexicalised parsers. We study how morphosyntactic as well as function tag information percolation in the form of grammar transforms (Johnson, 1998, Kulick et al., 2006) affects the performance of a parser and helps dependency assignment. We focus on the three most frequent functional tags in the Arabic Penn Treebank: subjects, direct objects and predicates . We merge these functional tags with their phrasal categories and (where appropriate) percolate case information to the non-terminal (POS) category to train the parsers. We then automatically enrich the output of these parsers with full dependency information in order to annotate trees with Lexical Functional Grammar (LFG) f-structure equations with produce f-structures, i.e. attribute-value matrices approximating to basic predicate-argument-adjunct structure representations. We present a series of experiments evaluating how well lexicalized, history-based, generative (Bikel) as well as latent variable PCFG (Berkeley) parsers cope with the enriched Arabic data. We measure quality and coverage of both the output trees and the generated LFG f-structures. We show that joint functional and morphological information percolation improves both the recovery of trees as well as dependency results in the form of LFG f-structures.

pdf abs
Partial Dependency Parsing for Irish
Elaine Uí Dhonnchadha | Josef Van Genabith
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a partial dependency parser for Irish. Constraint Grammar (CG) based rules are used to annotate dependency relations and grammatical functions. Chunking is performed using a regular-expression grammar which operates on the dependency tagged sentences. As this is the first implementation of a parser for unrestricted Irish text (to our knowledge), there were no guidelines or precedents available. Therefore deciding what constitutes a syntactic unit, and how it should be annotated, accounts for a major part of the early development effort. Currently, all tokens in a sentence are tagged for grammatical function and local dependency. Long-distance dependencies, prepositional attachments or coordination are not handled, resulting in a partial dependency analysis. Evaluations show that the partial dependency analysis achieves an f-score of 93.60% on development data and 94.28% on unseen test data, while the chunker achieves an f-score of 97.20% on development data and 93.50% on unseen test data.

pdf
Factor templates for factored machine translation models
Yvette Graham | Josef van Genabith
Proceedings of the 7th International Workshop on Spoken Language Translation: Papers

pdf abs
f-align: An Open-Source Alignment Tool for LFG f-Structures
Anton Bryl | Josef van Genabith
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

Lexical-Functional Grammar (LFG) f-structures (Kaplan and Bresnan, 1982) have attracted some attention in recent years as an intermediate data representation for statistical machine translation. So far, however, there are no alignment tools capable of aligning f-structures directly, and plain word alignment is used for this purpose. In this way no use is made of the structural information contained in f-structures. We present the first version of a specialized f-structure alignment open-source software.

pdf abs
Combining Multi-Domain Statistical Machine Translation Models using Automatic Classifiers
Pratyush Banerjee | Jinhua Du | Baoli Li | Sudip Naskar | Andy Way | Josef van Genabith
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

This paper presents a set of experiments on Domain Adaptation of Statistical Machine Translation systems. The experiments focus on Chinese-English and two domain-specific corpora. The paper presents a novel approach for combining multiple domain-trained translation models to achieve improved translation quality for both domain-specific as well as combined sets of sentences. We train a statistical classifier to classify sentences according to the appropriate domain and utilize the corresponding domain-specific MT models to translate them. Experimental results show that the method achieves a statistically significant absolute improvement of 1.58 BLEU (2.86% relative improvement) score over a translation model trained on combined data, and considerable improvements over a model using multiple decoding paths of the Moses decoder, for the combined domain test set. Furthermore, even for domain-specific test sets, our approach works almost as well as dedicated domain-specific models and perfect classification.

pdf abs
Maximizing TM Performance through Sub-Tree Alignment and SMT
Ventsislav Zhechev | Josef van Genabith
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

With the steadily increasing demand for high-quality translation, the localisation industry is constantly searching for technologies that would increase translator throughput, in particular focusing on the use of high-quality Statistical Machine Translation (SMT) supplementing the established Translation Memory (TM) technology. In this paper, we present a novel modular approach that utilises state-of-the-art sub-tree alignment and SMT techniques to turn the fuzzy matches from a TM into near-perfect translations. Rather than relegate SMT to a last-resort status where it is only used should the TM system fail to produce the desired output, for us SMT is an integral part of the translation process that we rely on to obtain high-quality results. We show that the presented system consistently produces better-quality output than the TM and performs on par or better than the standalone SMT system.

pdf abs
Improving the Post-Editing Experience using Translation Recommendation: A User Study
Yifan He | Yanjun Ma | Johann Roturier | Andy Way | Josef van Genabith
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

We report findings from a user study with professional post-editors using a translation recommendation framework (He et al., 2010) to integrate Statistical Machine Translation (SMT) output with Translation Memory (TM) systems. The framework recommends SMT outputs to a TM user when it predicts that SMT outputs are more suitable for post-editing than the hits provided by the TM. We analyze the effectiveness of the model as well as the reaction of potential users. Based on the performance statistics and the users’ comments, we find that translation recommendation can reduce the workload of professional post-editors and improve the acceptance of MT in the localization industry.

2009

pdf
Experiments on Domain Adaptation for English–Hindi SMT
Rejwanul Haque | Sudip Kumar Naskar | Josef van Genabith | Andy Way
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

pdf
Automatic Treebank-Based Acquisition of Arabic LFG Dependency Structures
Lamia Tounsi | Mohammed Attia | Josef van Genabith
Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages

pdf
Guessing the Grammatical Function of a Non-Root F-Structure in LFG
Anton Bryl | Josef van Genabith | Yvette Graham
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf
Dependency Parsing Resources for French: Converting Acquired Lexical Functional Grammar F-Structure Annotations and Parsing F-Structures Directly
Natalie Schluter | Josef van Genabith
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2008

pdf
Dependency-Based N-Gram Models for General Purpose Sentence Realisation
Yuqing Guo | Josef van Genabith | Haifeng Wang
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf abs
Parser Evaluation and the BNC: Evaluating 4 constituency parsers with 3 metrics
Jennifer Foster | Josef van Genabith
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We evaluate discriminative parse reranking and parser self-training on a new English test set using four versions of the Charniak parser and a variety of parser evaluation metrics. The new test set consists of 1,000 hand-corrected British National Corpus parse trees. We directly evaluate parser output using both the Parseval and the Leaf Ancestor metrics. We also convert the hand-corrected and parser output phrase structure trees to dependency trees using a state-of-the-art functional tag labeller and constituent-to-dependency conversion tool, and then calculate label accuracy, unlabelled attachment and labelled attachment scores over the dependency structures. We find that reranking leads to a performance improvement on the new test set (albeit a modest one). We find that self-training using BNC data leads to significantly better results. However, it is not clear how effective self-training is when the training material comes from the North American News Corpus.

pdf abs
Learning Morphology with Morfette
Grzegorz Chrupala | Georgiana Dinu | Josef van Genabith
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Morfette is a modular, data-driven, probabilistic system which learns to perform joint morphological tagging and lemmatization from morphologically annotated corpora. The system is composed of two learning modules which are trained to predict morphological tags and lemmas using the Maximum Entropy classifier. The third module dynamically combines the predictions of the Maximum-Entropy models and outputs a probability distribution over tag-lemma pair sequences. The lemmatization module exploits the idea of recasting lemmatization as a classification task by using class labels which encode mappings from word forms to lemmas. Experimental evaluation results and error analysis on three morphologically rich languages show that the system achieves high accuracy with no language-specific feature engineering or additional resources.

pdf abs
Treebank-Based Acquisition of LFG Parsing Resources for French
Natalie Schluter | Josef van Genabith
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Motivated by the expense in time and other resources to produce hand-crafted grammars, there has been increased interest in automatically obtained wide-coverage grammars from treebanks for natural language processing. In particular, recent years have seen the growth in interest in automatically obtained deep resources that can represent information absent from simple CFG-type structured treebanks and which are considered to produce more language-neutral linguistic representations, such as dependency syntactic trees. As is often the case in early pioneering work on natural language processing, English has provided the focus of first efforts towards acquiring deep-grammar resources, followed by successful treatments of, for example, German, Japanese, Chinese and Spanish. However, no comparable large-scale automatically acquired deep-grammar resources have been obtained for French to date. The goal of this paper is to present the application of treebank-based language acquisition to the case of French. We show that with modest changes to the established parsing architectures, encouraging results can be obtained for French, with an overall best dependency structure f-score of 86.73%.

pdf
Adapting a WSJ-Trained Parser to Grammatically Noisy Text
Jennifer Foster | Joachim Wagner | Josef van Genabith
Proceedings of ACL-08: HLT, Short Papers

pdf
Accurate and Robust LFG-Based Generation for Chinese
Yuqing Guo | Haifeng Wang | Josef van Genabith
Proceedings of the Fifth International Natural Language Generation Conference

pdf
Parser-Based Retraining for Domain Adaptation of Probabilistic Generators
Deirdre Hogan | Jennifer Foster | Joachim Wagner | Josef van Genabith
Proceedings of the Fifth International Natural Language Generation Conference

pdf
Packed rules for automatic transfer-rule induction
Yvette Graham | Josef van Genabith
Proceedings of the 12th Annual Conference of the European Association for Machine Translation

2007

pdf
Dependency-Based Automatic Evaluation for Machine Translation
Karolina Owczarzak | Josef van Genabith | Andy Way
Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation

pdf
Labelled Dependencies in Machine Translation Evaluation
Karolina Owczarzak | Josef van Genabith | Andy Way
Proceedings of the Second Workshop on Statistical Machine Translation

pdf
Adapting WSJ-Trained Parsers to the British National Corpus using In-Domain Self-Training
Jennifer Foster | Joachim Wagner | Djamé Seddah | Josef van Genabith
Proceedings of the Tenth International Conference on Parsing Technologies

pdf
Evaluating Evaluation Measures
Ines Rehbein | Josef van Genabith
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

pdf bib
Automatic evaluation of generation and parsing for machine translation with automatically acquired transfer rules
Yvette Graham | Deirdre Hogan | Josef van Genabith
Proceedings of the Workshop on Using corpora for natural language generation

pdf
Automatic Acquisition of Lexical-Functional Grammar Resources from a Japanese Dependency Corpus
Masanori Oya | Josef van Genabith
Proceedings of the 21st Pacific Asia Conference on Language, Information and Computation

pdf
A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors
Joachim Wagner | Jennifer Foster | Josef van Genabith
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf
Recovering Non-Local Dependencies for Chinese
Yuqing Guo | Haifeng Wang | Josef van Genabith
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf
Exploiting Multi-Word Units in History-Based Probabilistic Generation
Deirdre Hogan | Conor Cafferkey | Aoife Cahill | Josef van Genabith
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf
Treebank Annotation Schemes and Parser Evaluation for German
Ines Rehbein | Josef van Genabith
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf
QuestionBank: Creating a Corpus of Parse-Annotated Questions
John Judge | Aoife Cahill | Josef van Genabith
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
Robust PCFG-Based Generation Using Automatically Acquired LFG Approximations
Aoife Cahill | Josef van Genabith
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
Using Machine-Learning to Assign Function Labels to Parser Output for Spanish
Grzegorz Chrupała | Josef van Genabith
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf abs
A Part-of-speech tagger for Irish using Finite-State Morphology and Constraint Grammar Disambiguation
E. Uí Dhonnchadha | J. Van Genabith
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the methodology used to develop a part-of-speech tagger for Irish, which is used to annotate a corpus of 30 million words of text with part-of-speech tags and lemmas. The tagger is evaluated using a manually disambiguated test corpus and it currently achieves 95% accuracy on unrestricted text. To our knowledge, this is the first part-of-speech tagger for Irish.

pdf abs
Multi-Engine Machine Translation by Recursive Sentence Decomposition
Bart Mellebeek | Karolina Owczarzak | Josef Van Genabith | Andy Way
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

In this paper, we present a novel approach to combine the outputs of multiple MT engines into a consensus translation. In contrast to previous Multi-Engine Machine Translation (MEMT) techniques, we do not rely on word alignments of output hypotheses, but prepare the input sentence for multi-engine processing. We do this by using a recursive decomposition algorithm that produces simple chunks as input to the MT engines. A consensus translation is produced by combining the best chunk translations, selected through majority voting, a trigram language model score and a confidence score assigned to each MT engine. We report statistically significant relative improvements of up to 9% BLEU score in experiments (English→Spanish) carried out on an 800-sentence test set extracted from the Penn-II Treebank.

pdf abs
Wrapper Syntax for Example-based Machine Translation
Karolina Owczarzak | Bart Mellebeek | Declan Groves | Josef Van Genabith | Andy Way
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

TransBooster is a wrapper technology designed to improve the performance of wide-coverage machine translation systems. Using linguistically motivated syntactic information, it automatically decomposes source language sentences into shorter and syntactically simpler chunks, and recomposes their translation to form target language sentences. This generally improves both the word order and lexical selection of the translation. To date, TransBooster has been successfully applied to rule-based MT, statistical MT, and multi-engine MT. This paper presents the application of TransBooster to Example-Based Machine Translation. In an experiment conducted on test sets extracted from Europarl and the Penn II Treebank we show that our method can raise the BLEU score up to 3.8% relative to the EBMT baseline. We also conduct a manual evaluation, showing that TransBooster-enhanced EBMT produces a better output in terms of fluency than the baseline EBMT in 55% of the cases and in terms of accuracy in 53% of the cases.

pdf
German Particle Verbs and Pleonastic Prepositions
Ines Rehbein | Josef van Genabith
Proceedings of the Third ACL-SIGSEM Workshop on Prepositions

pdf
Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation
Karolina Owczarzak | Declan Groves | Josef Van Genabith | Andy Way
Proceedings on the Workshop on Statistical Machine Translation

pdf
A Syntactic Skeleton for Statistical Machine Translation
Bart Mellebeek | Karolina Owczarzak | Declan Groves | Josef Van Genabith | Andy Way
Proceedings of the 11th Annual Conference of the European Association for Machine Translation

2005

pdf
Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks
Ruth O’Donovan | Michael Burke | Aoife Cahill | Josef van Genabith | Andy Way
Computational Linguistics, Volume 31, Number 3, September 2005

pdf abs
Improving Online Machine Translation Systems
Bart Mellebeek | Anna Khasin | Karolina Owczarzak | Josef Van Genabith | Andy Way
Proceedings of Machine Translation Summit X: Papers

In (Mellebeek et al., 2005), we proposed the design, implementation and evaluation of a novel and modular approach to boost the translation performance of existing, wide-coverage, freely available machine translation systems, based on reliable and fast automatic decomposition of the translation input and corresponding composition of translation output. Despite showing some initial promise, our method did not improve on the baseline Logomedia1 and Systran2 MT systems. In this paper, we improve on the algorithm presented in (Mellebeek et al., 2005), and on the same test data, show increased scores for a range of automatic evaluation metrics. Our algorithm now outperforms Logomedia, obtains similar results to SDL3 and falls tantalisingly short of the performance achieved by Systran.

pdf
TransBooster: boosting the performance of wide-coverage machine translation systems
Bart Mellebeek | Anna Khasin | Josef Van Genabith | Andy Way
Proceedings of the 10th EAMT Conference: Practical applications of machine translation