Ekaterina Lapshinova-Koltunski

Also published as: Ekaterina Lapshinova-koltunski


2022

pdf
Using Translation Process Data to Explore Explicitation and Implicitation through Discourse Connectives
Ekaterina Lapshinova-Koltunski | Michael Carl
Proceedings of the 3rd Workshop on Computational Approaches to Discourse

We look into English-German translation process data to analyse explicitation and implicitation phenomena of discourse connectives. For this, we use the database CRITT TPR-DB which contains translation process data with various features that elicit online translation behaviour. We explore the English-German part of the data for discourse connectives that are either omitted or inserted in the target, as well as cases when changing a weak signal to strong one, or the other way around. We determine several features that have an impact on cognitive effort during translation for explicitation and implicitation. Our results show that cognitive load caused by implicitation and explicitation may depend on the discourse connectives used, as well as on the strength and the type of the relations the connectives convey.

pdf
ParCorFull2.0: a Parallel Corpus Annotated with Full Coreference
Ekaterina Lapshinova-Koltunski | Pedro Augusto Ferreira | Elina Lartaud | Christian Hardmeier
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we describe ParCorFull2.0, a parallel corpus annotated with full coreference chains for multiple languages, which is an extension of the existing corpus ParCorFull (Lapshinova-Koltunski et al., 2018). Similar to the previous version, this corpus has been created to address translation of coreference across languages, a phenomenon still challenging for machine translation (MT) and other multilingual natural language processing (NLP) applications. The current version of the corpus that we present here contains not only parallel texts for the language pair English-German, but also for English-French and English-Portuguese, which are all major European languages. The new language pairs belong to the Romance languages. The addition of a new language group creates a need of extension not only in terms of texts added, but also in terms of the annotation guidelines. Both French and Portuguese contain structures not found in English and German. Moreover, Portuguese is a pro-drop language bringing even more systemic differences in the realisation of coreference into our cross-lingual resources. These differences cause problems for multilingual coreference resolution and machine translation. Our parallel corpus with full annotation of coreference will be a valuable resource with a variety of uses not only for NLP applications, but also for contrastive linguists and researchers in translation studies.

pdf
EPIC UdS - Creation and Applications of a Simultaneous Interpreting Corpus
Heike Przybyl | Ekaterina Lapshinova-Koltunski | Katrin Menzel | Stefan Fischer | Elke Teich
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we describe the creation and annotation of EPIC UdS, a multilingual corpus of simultaneous interpreting for English, German and Spanish. We give an overview of the comparable and parallel, aligned corpus variants and explore various applications of the corpus. What makes EPIC UdS relevant is that it is one of the rare interpreting corpora that includes transcripts suitable for research on more than one language pair and on interpreting with regard to German. It not only contains transcribed speeches, but also rich metadata and fine-grained linguistic annotations tailored for diverse applications across a broad range of linguistic subfields.

pdf
DiHuTra: a Parallel Corpus to Analyse Differences between Human Translations
Ekaterina Lapshinova-Koltunski | Maja Popović | Maarit Koponen
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper describes a new corpus of human translations which contains both professional and students translations. The data consists of English sources – texts from news and reviews – and their translations into Russian and Croatian, as well as of the subcorpus containing translations of the review texts into Finnish. All target languages represent mid-resourced and less or mid-investigated ones. The corpus will be valuable for studying variation in translation as it allows a direct comparison between human translations of the same source texts. The corpus will also be a valuable resource for evaluating machine translation systems. We believe that this resource will facilitate understanding and improvement of the quality issues in both human and machine translation. In the paper, we describe how the data was collected, provide information on translator groups and summarise the differences between the human translations at hand based on our preliminary results with shallow features.

pdf
DiHuTra: a Parallel Corpus to Analyse Differences between Human Translations
Ekaterina Lapshinova-Koltunski | Maja Popović | Maarit Koponen
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This project aimed to design a corpus of parallel human translations (HTs) of the same source texts by professionals and students. The resulting corpus consists of English news and reviews source texts, their translations into Russian and Croatian, and translations of the reviews into Finnish. The corpus will be valuable for both studying variation in translation and evaluating machine translation (MT) systems.

pdf
Linguistically Motivated Evaluation of the 2022 State-of-the-art Machine Translation Systems for Three Language Directions
Vivien Macketanz | Shushen Manakhimova | Eleftherios Avramidis | Ekaterina Lapshinova-koltunski | Sergei Bagdasarov | Sebastian Möller
Proceedings of the Seventh Conference on Machine Translation (WMT)

This document describes a fine-grained linguistically motivated analysis of 29 machine translation systems submitted at the Shared Task of the 7th Conference of Machine Translation (WMT22). This submission expands the test suite work of previous years by adding the language direction of English–Russian. As a result, evaluation takes place for the language directions of German–English, English–German, and English–Russian. We find that the German–English systems suffer in translating idioms, some tenses of modal verbs, and resultative predicates, the English–German ones in idioms, transitive-past progressive, and middle voice, whereas the English–Russian ones in pseudogapping and idioms.

2021

pdf
Polarity in Translation: Differences between Novice and Experts across Registers
Ekaterina Lapshinova-Koltunski | Fritz Kliche | Anna Moskvina | Johannes Schäfer
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age

pdf
Found in translation/interpreting: combining data-driven and supervised methods to analyse cross-linguistically mediated communication
Ekaterina Lapshinova-Koltunski | Yuri Bizzoni | Heike Przybyl | Elke Teich
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age

pdf
Tracing variation in discourse connectives in translation and interpreting through neural semantic spaces
Ekaterina Lapshinova-Koltunski | Heike Przybyl | Yuri Bizzoni
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

In the present paper, we explore lexical contexts of discourse markers in translation and interpreting on the basis of word embeddings. Our special interest is on contextual variation of the same discourse markers in (written) translation vs. (simultaneous) interpreting. To explore this variation at the lexical level, we use a data-driven approach: we compare bilingual neural word embeddings trained on source-to-translation and source-to-interpreting aligned corpora. Our results show more variation of semantically related items in translation spaces vs. interpreting ones and a more consistent use of fewer connectives in interpreting. We also observe different trends with regard to the discourse relation types.

pdf
Measuring Translationese across Levels of Expertise: Are Professionals more Surprising than Students?
Yuri Bizzoni | Ekaterina Lapshinova-Koltunski
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

The present paper deals with a computational analysis of translationese in professional and student English-to-German translations belonging to different registers. Building upon an information-theoretical approach, we test translation conformity to source and target language in terms of a neural language model’s perplexity over Part of Speech (PoS) sequences. Our primary focus is on register diversification vs. convergence, reflected in the use of constructions eliciting a higher vs. lower perplexity score. Our results show that, against our expectations, professional translations elicit higher perplexity scores from a target language model than students’ translations. An analysis of the distribution of PoS patterns across registers shows that this apparent paradox is the effect of higher stylistic diversification and register sensitivity in professional translations. Our results contribute to the understanding of human translationese and shed light on the variation in texts generated by different translators, which is valuable for translation studies, multilingual language processing, and machine translation.

pdf
Fiction in Russian Translation: A Translationese Study
Maria Kunilovskaya | Ekaterina Lapshinova-Koltunski | Ruslan Mitkov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

This paper presents a translationese study based on the parallel data from the Russian National Corpus (RNC). We explored differences between literary texts originally authored in Russian and fiction translated into Russian from 11 languages. The texts are represented with frequency-based features that capture structural and lexical properties of language. Binary classification results indicate that literary translations can be distinguished from non-translations with an accuracy ranging from 82 to 92% depending on the source language and feature set. Multiclass classification confirms that translations from distant languages are more distinct from non-translations than translations from languages that are typologically close to Russian. It also demonstrates that translations from same-family source languages share translationese properties. Structural features return more consistent results than features relying on external resources and capturing lexical properties of texts in both translationese detection and source language identification tasks.

pdf
Translationese in Russian Literary Texts
Maria Kunilovskaya | Ekaterina Lapshinova-Koltunski | Ruslan Mitkov
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

The paper reports the results of a translationese study of literary texts based on translated and non-translated Russian. We aim to find out if translations deviate from non-translated literary texts, and if the established differences can be attributed to typological relations between source and target languages. We expect that literary translations from typologically distant languages should exhibit more translationese, and the fingerprints of individual source languages (and their families) are traceable in translations. We explore linguistic properties that distinguish non-translated Russian literature from translations into Russian. Our results show that non-translated fiction is different from translations to the degree that these two language varieties can be automatically classified. As expected, language typology is reflected in translations of literary texts. We identified features that point to linguistic specificity of Russian non-translated literature and to shining-through effects. Some of translationese features cut across all language pairs, while others are characteristic of literary translations from languages belonging to specific language families.

2020

pdf
Exploring Coreference Features in Heterogeneous Data
Ekaterina Lapshinova-Koltunski | Kerstin Kunz
Proceedings of the First Workshop on Computational Approaches to Discourse

The present paper focuses on variation phenomena in coreference chains. We address the hypothesis that the degree of structural variation between chain elements depends on language-specific constraints and preferences and, even more, on the communicative situation of language production. We define coreference features that also include reference to abstract entities and events. These features are inspired through several sources – cognitive parameters, pragmatic factors and typological status. We pay attention to the distributions of these features in a dataset containing English and German texts of spoken and written discourse mode, which can be classified into seven different registers. We apply text classification and feature selection to find out how these variational dimensions (language, mode and register) impact on coreference features. Knowledge on the variation under analysis is valuable for contrastive linguistics, translation studies and multilingual natural language processing (NLP), e.g. machine translation or cross-lingual coreference resolution.

pdf
Coreference Strategies in English-German Translation
Ekaterina Lapshinova-Koltunski | Marie-Pauline Krielke | Christian Hardmeier
Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference

We present a study focusing on variation of coreferential devices in English original TED talks and news texts and their German translations. Using exploratory techniques we contemplate a diverse set of coreference devices as features which we assume indicate language-specific and register-based variation as well as potential translation strategies. Our findings reflect differences on both dimensions with stronger variation along the lines of register than between languages. By exposing interactions between text type and cross-linguistic variation, they can also inform multilingual NLP applications, especially machine translation.

pdf
Lexicogrammatic translationese across two targets and competence levels
Maria Kunilovskaya | Ekaterina Lapshinova-Koltunski
Proceedings of the Twelfth Language Resources and Evaluation Conference

This research employs genre-comparable data from a number of parallel and comparable corpora to explore the specificity of translations from English into German and Russian produced by students and professional translators. We introduce an elaborate set of human-interpretable lexicogrammatic translationese indicators and calculate the amount of translationese manifested in the data for each target language and translation variety. By placing translations into the same feature space as their sources and the genre-comparable non-translated reference texts in the target language, we observe two separate translationese effects: a shift of translations into the gap between the two languages and a shift away from either language. These trends are linked to the features that contribute to each of the effects. Finally, we compare the translation varieties and find out that the professionalism levels seem to have some correlation with the amount and types of translationese detected, while each language pair demonstrates a specific socio-linguistically determined combination of the translationese effects.

2019

pdf
Cross-lingual Incongruences in the Annotation of Coreference
Ekaterina Lapshinova-Koltunski | Sharid Loáiciga | Christian Hardmeier | Pauline Krielke
Proceedings of the Second Workshop on Computational Models of Reference, Anaphora and Coreference

In the present paper, we deal with incongruences in English-German multilingual coreference annotation and present automated methods to discover them. More specifically, we automatically detect full coreference chains in parallel texts and analyse discrepancies in their annotations. In doing so, we wish to find out whether the discrepancies rather derive from language typological constraints, from the translation or the actual annotation process. The results of our study contribute to the referential analysis of similarities and differences across languages and support evaluation of cross-lingual coreference annotation. They are also useful for cross-lingual coreference resolution systems and contrastive linguistic studies.

pdf
Translationese Features as Indicators of Quality in English-Russian Human Translation
Maria Kunilovskaya | Ekaterina Lapshinova-Koltunski
Proceedings of the Human-Informed Translation and Interpreting Technology Workshop (HiT-IT 2019)

We use a range of morpho-syntactic features inspired by research in register studies (e.g. Biber, 1995; Neumann, 2013) and translation studies (e.g. Ilisei et al., 2010; Zanettin, 2013; Kunilovskaya and Kutuzov, 2018) to reveal the association between translationese and human translation quality. Translationese is understood as any statistical deviations of translations from non-translations (Baker, 1993) and is assumed to affect the fluency of translations, rendering them foreign-sounding and clumsy of wording and structure. This connection is often posited or implied in the studies of translationese or translational varieties (De Sutter et al., 2017), but is rarely directly tested. Our 45 features include frequencies of selected morphological forms and categories, some types of syntactic structures and relations, as well as several overall text measures extracted from Universal Dependencies annotation. The research corpora include English-to-Russian professional and student translations of informational or argumentative newspaper texts and a comparable corpus of non-translated Russian. Our results indicate lack of direct association between translationese and quality in our data: while our features distinguish translations and non-translations with the near perfect accuracy, the performance of the same algorithm on the quality classes barely exceeds the chance level.

pdf bib
Analysing Coreference in Transformer Outputs
Ekaterina Lapshinova-Koltunski | Cristina España-Bonet | Josef van Genabith
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)

We analyse coreference phenomena in three neural machine translation systems trained with different data settings with or without access to explicit intra- and cross-sentential anaphoric information. We compare system performance on two different genres: news and TED talks. To do this, we manually annotate (the possibly incorrect) coreference chains in the MT outputs and evaluate the coreference chain translations. We define an error typology that aims to go further than pronoun translation adequacy and includes types such as incorrect word selection or missing words. The features of coreference chains in automatic translations are also compared to those of the source texts and human translations. The analysis shows stronger potential translationese effects in machine translated outputs than in human translations.

2018

pdf
ParCorFull: a Parallel Corpus Annotated with Full Coreference
Ekaterina Lapshinova-Koltunski | Christian Hardmeier | Pauline Krielke
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
A Pronoun Test Suite Evaluation of the English–German MT Systems at WMT 2018
Liane Guillou | Christian Hardmeier | Ekaterina Lapshinova-Koltunski | Sharid Loáiciga
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We evaluate the output of 16 English-to-German MT systems with respect to the translation of pronouns in the context of the WMT 2018 competition. We work with a test suite specifically designed to assess system quality in various fine-grained categories known to be problematic. The main evaluation scores come from a semi-automatic process, combining automatic reference matching with extensive manual annotation of uncertain cases. We find that current NMT systems are good at translating pronouns with intra-sentential reference, but the inter-sentential cases remain difficult. NMT systems are also good at the translation of event pronouns, unlike systems from the phrase-based SMT paradigm. No single system performs best at translating all types of anaphoric pronouns, suggesting unexplained random effects influencing the translation of pronouns with NMT.

2017

pdf
Discovery of Discourse-Related Language Contrasts through Alignment Discrepancies in English-German Translation
Ekaterina Lapshinova-Koltunski | Christian Hardmeier
Proceedings of the Third Workshop on Discourse in Machine Translation

In this paper, we analyse alignment discrepancies for discourse structures in English-German parallel data – sentence pairs, in which discourse structures in target or source texts have no alignment in the corresponding parallel sentences. The discourse-related structures are designed in form of linguistic patterns based on the information delivered by automatic part-of-speech and dependency annotation. In addition to alignment errors (existing structures left unaligned), these alignment discrepancies can be caused by language contrasts or through the phenomena of explicitation and implicitation in the translation process. We propose a new approach including new type of resources for corpus-based language contrast analysis and apply it to study and classify the contrasts found in our English-German parallel corpus. As unaligned discourse structures may also result in the loss of discourse information in the MT training data, we hope to deliver information in support of discourse-aware machine translation (MT).

2016

pdf
From Interoperable Annotations towards Interoperable Resources: A Multilingual Approach to the Analysis of Discourse
Ekaterina Lapshinova-Koltunski | Kerstin Anna Kunz | Anna Nedoluzhko
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In the present paper, we analyse variation of discourse phenomena in two typologically different languages, i.e. in German and Czech. The novelty of our approach lies in the nature of the resources we are using. Advantage is taken of existing resources, which are, however, annotated on the basis of two different frameworks. We use an interoperable scheme unifying discourse phenomena in both frameworks into more abstract categories and considering only those phenomena that have a direct match in German and Czech. The discourse properties we focus on are relations of identity, semantic similarity, ellipsis and discourse relations. Our study shows that the application of interoperable schemes allows an exploitation of discourse-related phenomena analysed in different projects and on the basis of different frameworks. As corpus compilation and annotation is a time-consuming task, positive results of this experiment open up new paths for contrastive linguistics, translation studies and NLP, including machine translation.

pdf
Information Density and Quality Estimation Features as Translationese Indicators for Human Translation Classification
Raphael Rubino | Ekaterina Lapshinova-Koltunski | Josef van Genabith
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Beyond Identity Coreference: Contrasting Indicators of Textual Coherence in English and German
Kerstin Kunz | Ekaterina Lapshinova-Koltunski | José Manuel Martínez
Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016)

pdf
Abstract Coreference in a Multilingual Perspective: a View on Czech and German
Anna Nedoluzhko | Ekaterina Lapshinova-Koltunski
Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016)

2015

pdf
Register-based machine translation evaluation with text classification techniques
Mihaela Vela | Ekaterina Lapshinova-Koltunski
Proceedings of Machine Translation Summit XV: Papers

pdf
Across Languages and Genres: Creating a Universal Annotation Scheme for Textual Relations
Ekaterina Lapshinova-Koltunski | Anna Nedoluzhko | Kerstin Anna Kunz
Proceedings of the 9th Linguistic Annotation Workshop

pdf
Measuring ‘Registerness’ in Human and Machine Translation: A Text Classification Approach
Ekaterina Lapshinova-Koltunski | Mihaela Vela
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf
Exploration of Inter- and Intralingual Variation of Discourse Phenomena
Ekaterina Lapshinova-Koltunski
Proceedings of the Second Workshop on Discourse in Machine Translation

2014

pdf
Data Mining with Shallow vs. Linguistic Features to Study Diversification of Scientific Registers
Stefania Degaetano-Ortlieb | Peter Fankhauser | Hannah Kermes | Ekaterina Lapshinova-Koltunski | Noam Ordan | Elke Teich
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a methodology to analyze the linguistic evolution of scientific registers with data mining techniques, comparing the insights gained from shallow vs. linguistic features. The focus is on selected scientific disciplines at the boundaries to computer science (computational linguistics, bioinformatics, digital construction, microelectronics). The data basis is the English Scientific Text Corpus (SCITEX) which covers a time range of roughly thirty years (1970/80s to early 2000s) (Degaetano-Ortlieb et al., 2013; Teich and Fankhauser, 2010). In particular, we investigate the diversification of scientific registers over time. Our theoretical basis is Systemic Functional Linguistics (SFL) and its specific incarnation of register theory (Halliday and Hasan, 1985). In terms of methods, we combine corpus-based methods of feature extraction and data mining techniques.

2013

pdf
Scientific registers and disciplinary diversification: a comparable corpus approach
Elke Teich | Stefania Degaetano-Ortlieb | Hannah Kermes | Ekaterina Lapshinova-Koltunski
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

pdf
VARTRA: A Comparable Corpus for Analysis of Translation Variation
Ekaterina Lapshinova-Koltunski
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

2012

pdf
Feature Discovery for Diachronic Register Analysis: a Semi-Automatic Approach
Stefania Degaetano-Ortlieb | Ekaterina Lapshinova-Koltunski | Elke Teich
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present corpus-based procedures to semi-automatically discover features relevant for the study of recent language change in scientific registers. First, linguistic features potentially adherent to recent language change are extracted from the SciTex Corpus. Second, features are assessed for their relevance for the study of recent language change in scientific registers by means of correspondence analysis. The discovered features will serve for further investigations of the linguistic evolution of newly emerged scientific registers.

pdf
Coreference in Spoken vs. Written Texts: a Corpus-based Analysis
Marilisa Amoia | Kerstin Kunz | Ekaterina Lapshinova-Koltunski
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes an empirical study of coreference in spoken vs. written text. We focus on the comparison of two particular text types, interviews and popular science texts, as instances of spoken and written texts since they display quite different discourse structures. We believe in fact, that the correlation of difficulties in coreference resolution and varying discourse structures requires a deeper analysis that accounts for the diversity of coreference strategies or their sub-phenomena as indicators of text type or genre. In this work, we therefore aim at defining specific parameters that classify differences in genres of spoken and written texts such as the preferred segmentation strategy, the maximal allowed distance in or the length and size of coreference chains as well as the correlation of structural and syntactic features of coreferring expressions. We argue that a characterization of such genre dependent parameters might improve the performance of current state-of-art coreference resolution technology.

pdf
Visualising Linguistic Evolution in Academic Discourse
Verena Lyding | Ekaterina Lapshinova-Koltunski | Stefania Degaetano-Ortlieb | Henrik Dittmann | Chris Culy
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

2011

pdf bib
Discontinuous Constituents: a Problematic Case for Parallel Corpora Annotation and Querying
Marilisa Amoia | Kerstin Kunz | Ekaterina Lapshinova-Koltunski
Proceedings of the Second Workshop on Annotation and Exploitation of Parallel Corpora

2008

pdf
Head or Non-head? Semi-automatic Procedures for Extracting and Classifying Subcategorisation Properties of Compounds.
Ekaterina Lapshinova-Koltunski | Ulrich Heid
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we discuss an approach to the semi-automatic extraction and classification of the compounds extracted from German corpora. Compound nominals are semi-automatically extracted from text corpora along with their sentential complements. In this study we concentrate on that­, wh­ or if subclauses although our methods can be applied to other complements as well. We elaborate an architecture using linguistic knowledge about the phenomena we extract, and aim at answering the following questions: how can data about subcategorisation properties of nominal compounds be extracted from text corpora, and how can compounds be classified according to their subcategorisation properties? Our classification is based on the relationships between the subcategorisation of nominal compounds, e.g. Grundfrage, Wettstreit and Beweismittel, and that of their constituent parts, such as Frage, Streit, Beweis, etc. We show that there are cases which do not match the commonly accepted assumption that the head of a compound is its valency bearer. Such cases should receive a specific treatment in NLP dictionary building. This calls for tools to identify and classify such cases by means of data extraction from corpora. We propose precision-oriented semi­automatic extraction which can operate on tokenized, tagged and lemmatized texts. In the future, we are going to extend the kinds of extracted complements beyond subclauses and analyze the nature of the non-head valency-bearer of compounds, as well as an extension of the kinds of extracted complements beyond subclauses.