Ekaterina Lapshinova-Koltunski


2021

pdf bib
Measuring Translationese across Levels of Expertise: Are Professionals more Surprising than Students?
Yuri Bizzoni | Ekaterina Lapshinova-Koltunski
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

The present paper deals with a computational analysis of translationese in professional and student English-to-German translations belonging to different registers. Building upon an information-theoretical approach, we test translation conformity to source and target language in terms of a neural language model’s perplexity over Part of Speech (PoS) sequences. Our primary focus is on register diversification vs. convergence, reflected in the use of constructions eliciting a higher vs. lower perplexity score. Our results show that, against our expectations, professional translations elicit higher perplexity scores from a target language model than students’ translations. An analysis of the distribution of PoS patterns across registers shows that this apparent paradox is the effect of higher stylistic diversification and register sensitivity in professional translations. Our results contribute to the understanding of human translationese and shed light on the variation in texts generated by different translators, which is valuable for translation studies, multilingual language processing, and machine translation.

pdf bib
Polarity in Translation: Differences between Novice and Experts across Registers
Ekaterina Lapshinova-Koltunski | Fritz Kliche | Anna Moskvina | Johannes Schäfer
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age

pdf bib
Found in translation/interpreting: combining data-driven and supervised methods to analyse cross-linguistically mediated communication
Ekaterina Lapshinova-Koltunski | Yuri Bizzoni | Heike Przybyl | Elke Teich
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age

pdf bib
Fiction in Russian Translation: A Translationese Study
Maria Kunilovskaya | Ekaterina Lapshinova-Koltunski | Ruslan Mitkov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

This paper presents a translationese study based on the parallel data from the Russian National Corpus (RNC). We explored differences between literary texts originally authored in Russian and fiction translated into Russian from 11 languages. The texts are represented with frequency-based features that capture structural and lexical properties of language. Binary classification results indicate that literary translations can be distinguished from non-translations with an accuracy ranging from 82 to 92% depending on the source language and feature set. Multiclass classification confirms that translations from distant languages are more distinct from non-translations than translations from languages that are typologically close to Russian. It also demonstrates that translations from same-family source languages share translationese properties. Structural features return more consistent results than features relying on external resources and capturing lexical properties of texts in both translationese detection and source language identification tasks.

pdf bib
Tracing variation in discourse connectives in translation and interpreting through neural semantic spaces
Ekaterina Lapshinova-Koltunski | Heike Przybyl | Yuri Bizzoni
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

In the present paper, we explore lexical contexts of discourse markers in translation and interpreting on the basis of word embeddings. Our special interest is on contextual variation of the same discourse markers in (written) translation vs. (simultaneous) interpreting. To explore this variation at the lexical level, we use a data-driven approach: we compare bilingual neural word embeddings trained on source-to-translation and source-to-interpreting aligned corpora. Our results show more variation of semantically related items in translation spaces vs. interpreting ones and a more consistent use of fewer connectives in interpreting. We also observe different trends with regard to the discourse relation types.

pdf bib
Translationese in Russian Literary Texts
Maria Kunilovskaya | Ekaterina Lapshinova-Koltunski | Ruslan Mitkov
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

The paper reports the results of a translationese study of literary texts based on translated and non-translated Russian. We aim to find out if translations deviate from non-translated literary texts, and if the established differences can be attributed to typological relations between source and target languages. We expect that literary translations from typologically distant languages should exhibit more translationese, and the fingerprints of individual source languages (and their families) are traceable in translations. We explore linguistic properties that distinguish non-translated Russian literature from translations into Russian. Our results show that non-translated fiction is different from translations to the degree that these two language varieties can be automatically classified. As expected, language typology is reflected in translations of literary texts. We identified features that point to linguistic specificity of Russian non-translated literature and to shining-through effects. Some of translationese features cut across all language pairs, while others are characteristic of literary translations from languages belonging to specific language families.

2020

pdf bib
Coreference Strategies in English-German Translation
Ekaterina Lapshinova-Koltunski | Marie-Pauline Krielke | Christian Hardmeier
Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference

We present a study focusing on variation of coreferential devices in English original TED talks and news texts and their German translations. Using exploratory techniques we contemplate a diverse set of coreference devices as features which we assume indicate language-specific and register-based variation as well as potential translation strategies. Our findings reflect differences on both dimensions with stronger variation along the lines of register than between languages. By exposing interactions between text type and cross-linguistic variation, they can also inform multilingual NLP applications, especially machine translation.

pdf bib
Exploring Coreference Features in Heterogeneous Data
Ekaterina Lapshinova-Koltunski | Kerstin Kunz
Proceedings of the First Workshop on Computational Approaches to Discourse

The present paper focuses on variation phenomena in coreference chains. We address the hypothesis that the degree of structural variation between chain elements depends on language-specific constraints and preferences and, even more, on the communicative situation of language production. We define coreference features that also include reference to abstract entities and events. These features are inspired through several sources – cognitive parameters, pragmatic factors and typological status. We pay attention to the distributions of these features in a dataset containing English and German texts of spoken and written discourse mode, which can be classified into seven different registers. We apply text classification and feature selection to find out how these variational dimensions (language, mode and register) impact on coreference features. Knowledge on the variation under analysis is valuable for contrastive linguistics, translation studies and multilingual natural language processing (NLP), e.g. machine translation or cross-lingual coreference resolution.

pdf bib
Lexicogrammatic translationese across two targets and competence levels
Maria Kunilovskaya | Ekaterina Lapshinova-Koltunski
Proceedings of the 12th Language Resources and Evaluation Conference

This research employs genre-comparable data from a number of parallel and comparable corpora to explore the specificity of translations from English into German and Russian produced by students and professional translators. We introduce an elaborate set of human-interpretable lexicogrammatic translationese indicators and calculate the amount of translationese manifested in the data for each target language and translation variety. By placing translations into the same feature space as their sources and the genre-comparable non-translated reference texts in the target language, we observe two separate translationese effects: a shift of translations into the gap between the two languages and a shift away from either language. These trends are linked to the features that contribute to each of the effects. Finally, we compare the translation varieties and find out that the professionalism levels seem to have some correlation with the amount and types of translationese detected, while each language pair demonstrates a specific socio-linguistically determined combination of the translationese effects.

2019

pdf bib
Cross-lingual Incongruences in the Annotation of Coreference
Ekaterina Lapshinova-Koltunski | Sharid Loáiciga | Christian Hardmeier | Pauline Krielke
Proceedings of the Second Workshop on Computational Models of Reference, Anaphora and Coreference

In the present paper, we deal with incongruences in English-German multilingual coreference annotation and present automated methods to discover them. More specifically, we automatically detect full coreference chains in parallel texts and analyse discrepancies in their annotations. In doing so, we wish to find out whether the discrepancies rather derive from language typological constraints, from the translation or the actual annotation process. The results of our study contribute to the referential analysis of similarities and differences across languages and support evaluation of cross-lingual coreference annotation. They are also useful for cross-lingual coreference resolution systems and contrastive linguistic studies.

pdf bib
Translationese Features as Indicators of Quality in English-Russian Human Translation
Maria Kunilovskaya | Ekaterina Lapshinova-Koltunski
Proceedings of the Human-Informed Translation and Interpreting Technology Workshop (HiT-IT 2019)

We use a range of morpho-syntactic features inspired by research in register studies (e.g. Biber, 1995; Neumann, 2013) and translation studies (e.g. Ilisei et al., 2010; Zanettin, 2013; Kunilovskaya and Kutuzov, 2018) to reveal the association between translationese and human translation quality. Translationese is understood as any statistical deviations of translations from non-translations (Baker, 1993) and is assumed to affect the fluency of translations, rendering them foreign-sounding and clumsy of wording and structure. This connection is often posited or implied in the studies of translationese or translational varieties (De Sutter et al., 2017), but is rarely directly tested. Our 45 features include frequencies of selected morphological forms and categories, some types of syntactic structures and relations, as well as several overall text measures extracted from Universal Dependencies annotation. The research corpora include English-to-Russian professional and student translations of informational or argumentative newspaper texts and a comparable corpus of non-translated Russian. Our results indicate lack of direct association between translationese and quality in our data: while our features distinguish translations and non-translations with the near perfect accuracy, the performance of the same algorithm on the quality classes barely exceeds the chance level.

pdf bib
Analysing Coreference in Transformer Outputs
Ekaterina Lapshinova-Koltunski | Cristina España-Bonet | Josef van Genabith
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)

We analyse coreference phenomena in three neural machine translation systems trained with different data settings with or without access to explicit intra- and cross-sentential anaphoric information. We compare system performance on two different genres: news and TED talks. To do this, we manually annotate (the possibly incorrect) coreference chains in the MT outputs and evaluate the coreference chain translations. We define an error typology that aims to go further than pronoun translation adequacy and includes types such as incorrect word selection or missing words. The features of coreference chains in automatic translations are also compared to those of the source texts and human translations. The analysis shows stronger potential translationese effects in machine translated outputs than in human translations.

2018

pdf bib
A Pronoun Test Suite Evaluation of the English–German MT Systems at WMT 2018
Liane Guillou | Christian Hardmeier | Ekaterina Lapshinova-Koltunski | Sharid Loáiciga
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We evaluate the output of 16 English-to-German MT systems with respect to the translation of pronouns in the context of the WMT 2018 competition. We work with a test suite specifically designed to assess system quality in various fine-grained categories known to be problematic. The main evaluation scores come from a semi-automatic process, combining automatic reference matching with extensive manual annotation of uncertain cases. We find that current NMT systems are good at translating pronouns with intra-sentential reference, but the inter-sentential cases remain difficult. NMT systems are also good at the translation of event pronouns, unlike systems from the phrase-based SMT paradigm. No single system performs best at translating all types of anaphoric pronouns, suggesting unexplained random effects influencing the translation of pronouns with NMT.

pdf bib
ParCorFull: a Parallel Corpus Annotated with Full Coreference
Ekaterina Lapshinova-Koltunski | Christian Hardmeier | Pauline Krielke
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Discovery of Discourse-Related Language Contrasts through Alignment Discrepancies in English-German Translation
Ekaterina Lapshinova-Koltunski | Christian Hardmeier
Proceedings of the Third Workshop on Discourse in Machine Translation

In this paper, we analyse alignment discrepancies for discourse structures in English-German parallel data – sentence pairs, in which discourse structures in target or source texts have no alignment in the corresponding parallel sentences. The discourse-related structures are designed in form of linguistic patterns based on the information delivered by automatic part-of-speech and dependency annotation. In addition to alignment errors (existing structures left unaligned), these alignment discrepancies can be caused by language contrasts or through the phenomena of explicitation and implicitation in the translation process. We propose a new approach including new type of resources for corpus-based language contrast analysis and apply it to study and classify the contrasts found in our English-German parallel corpus. As unaligned discourse structures may also result in the loss of discourse information in the MT training data, we hope to deliver information in support of discourse-aware machine translation (MT).

2016

pdf bib
Information Density and Quality Estimation Features as Translationese Indicators for Human Translation Classification
Raphael Rubino | Ekaterina Lapshinova-Koltunski | Josef van Genabith
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Beyond Identity Coreference: Contrasting Indicators of Textual Coherence in English and German
Kerstin Kunz | Ekaterina Lapshinova-Koltunski | José Manuel Martínez
Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016)

pdf bib
Abstract Coreference in a Multilingual Perspective: a View on Czech and German
Anna Nedoluzhko | Ekaterina Lapshinova-Koltunski
Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016)

pdf bib
From Interoperable Annotations towards Interoperable Resources: A Multilingual Approach to the Analysis of Discourse
Ekaterina Lapshinova-Koltunski | Kerstin Anna Kunz | Anna Nedoluzhko
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In the present paper, we analyse variation of discourse phenomena in two typologically different languages, i.e. in German and Czech. The novelty of our approach lies in the nature of the resources we are using. Advantage is taken of existing resources, which are, however, annotated on the basis of two different frameworks. We use an interoperable scheme unifying discourse phenomena in both frameworks into more abstract categories and considering only those phenomena that have a direct match in German and Czech. The discourse properties we focus on are relations of identity, semantic similarity, ellipsis and discourse relations. Our study shows that the application of interoperable schemes allows an exploitation of discourse-related phenomena analysed in different projects and on the basis of different frameworks. As corpus compilation and annotation is a time-consuming task, positive results of this experiment open up new paths for contrastive linguistics, translation studies and NLP, including machine translation.

2015

pdf bib
Across Languages and Genres: Creating a Universal Annotation Scheme for Textual Relations
Ekaterina Lapshinova-Koltunski | Anna Nedoluzhko | Kerstin Anna Kunz
Proceedings of The 9th Linguistic Annotation Workshop

pdf bib
Measuring ‘Registerness’ in Human and Machine Translation: A Text Classification Approach
Ekaterina Lapshinova-Koltunski | Mihaela Vela
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib
Exploration of Inter- and Intralingual Variation of Discourse Phenomena
Ekaterina Lapshinova-Koltunski
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib
Register-based machine translation evaluation with text classification techniques
Mihaela Vela | Ekaterina Lapshinova-Koltunski
Proceedings of Machine Translation Summit XV: Papers

2014

pdf bib
Data Mining with Shallow vs. Linguistic Features to Study Diversification of Scientific Registers
Stefania Degaetano-Ortlieb | Peter Fankhauser | Hannah Kermes | Ekaterina Lapshinova-Koltunski | Noam Ordan | Elke Teich
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a methodology to analyze the linguistic evolution of scientific registers with data mining techniques, comparing the insights gained from shallow vs. linguistic features. The focus is on selected scientific disciplines at the boundaries to computer science (computational linguistics, bioinformatics, digital construction, microelectronics). The data basis is the English Scientific Text Corpus (SCITEX) which covers a time range of roughly thirty years (1970/80s to early 2000s) (Degaetano-Ortlieb et al., 2013; Teich and Fankhauser, 2010). In particular, we investigate the diversification of scientific registers over time. Our theoretical basis is Systemic Functional Linguistics (SFL) and its specific incarnation of register theory (Halliday and Hasan, 1985). In terms of methods, we combine corpus-based methods of feature extraction and data mining techniques.

2013

pdf bib
Scientific registers and disciplinary diversification: a comparable corpus approach
Elke Teich | Stefania Degaetano-Ortlieb | Hannah Kermes | Ekaterina Lapshinova-Koltunski
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

pdf bib
VARTRA: A Comparable Corpus for Analysis of Translation Variation
Ekaterina Lapshinova-Koltunski
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

2012

pdf bib
Visualising Linguistic Evolution in Academic Discourse
Verena Lyding | Ekaterina Lapshinova-Koltunski | Stefania Degaetano-Ortlieb | Henrik Dittmann | Chris Culy
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

pdf bib
Feature Discovery for Diachronic Register Analysis: a Semi-Automatic Approach
Stefania Degaetano-Ortlieb | Ekaterina Lapshinova-Koltunski | Elke Teich
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present corpus-based procedures to semi-automatically discover features relevant for the study of recent language change in scientific registers. First, linguistic features potentially adherent to recent language change are extracted from the SciTex Corpus. Second, features are assessed for their relevance for the study of recent language change in scientific registers by means of correspondence analysis. The discovered features will serve for further investigations of the linguistic evolution of newly emerged scientific registers.

pdf bib
Coreference in Spoken vs. Written Texts: a Corpus-based Analysis
Marilisa Amoia | Kerstin Kunz | Ekaterina Lapshinova-Koltunski
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes an empirical study of coreference in spoken vs. written text. We focus on the comparison of two particular text types, interviews and popular science texts, as instances of spoken and written texts since they display quite different discourse structures. We believe in fact, that the correlation of difficulties in coreference resolution and varying discourse structures requires a deeper analysis that accounts for the diversity of coreference strategies or their sub-phenomena as indicators of text type or genre. In this work, we therefore aim at defining specific parameters that classify differences in genres of spoken and written texts such as the preferred segmentation strategy, the maximal allowed distance in or the length and size of coreference chains as well as the correlation of structural and syntactic features of coreferring expressions. We argue that a characterization of such genre dependent parameters might improve the performance of current state-of-art coreference resolution technology.

2011

pdf bib
Discontinuous Constituents: a Problematic Case for Parallel Corpora Annotation and Querying
Marilisa Amoia | Kerstin Kunz | Ekaterina Lapshinova-Koltunski
Proceedings of The Second Workshop on Annotation and Exploitation of Parallel Corpora

2008

pdf bib
Head or Non-head? Semi-automatic Procedures for Extracting and Classifying Subcategorisation Properties of Compounds.
Ekaterina Lapshinova-Koltunski | Ulrich Heid
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we discuss an approach to the semi-automatic extraction and classification of the compounds extracted from German corpora. Compound nominals are semi-automatically extracted from text corpora along with their sentential complements. In this study we concentrate on that­, wh­ or if subclauses although our methods can be applied to other complements as well. We elaborate an architecture using linguistic knowledge about the phenomena we extract, and aim at answering the following questions: how can data about subcategorisation properties of nominal compounds be extracted from text corpora, and how can compounds be classified according to their subcategorisation properties? Our classification is based on the relationships between the subcategorisation of nominal compounds, e.g. Grundfrage, Wettstreit and Beweismittel, and that of their constituent parts, such as Frage, Streit, Beweis, etc. We show that there are cases which do not match the commonly accepted assumption that the head of a compound is its valency bearer. Such cases should receive a specific treatment in NLP dictionary building. This calls for tools to identify and classify such cases by means of data extraction from corpora. We propose precision-oriented semi­automatic extraction which can operate on tokenized, tagged and lemmatized texts. In the future, we are going to extend the kinds of extracted complements beyond subclauses and analyze the nature of the non-head valency-bearer of compounds, as well as an extension of the kinds of extracted complements beyond subclauses.