2024
pdf
abs
DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis
Sarah Jablotschkin
|
Elke Teich
|
Heike Zinsmeister
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion
In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.
pdf
abs
Diachronic Analysis of Multi-word Expression Functional Categories in Scientific English
Diego Alves
|
Stefania Degaetano-Ortlieb
|
Elena Schmidt
|
Elke Teich
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
We present a diachronic analysis of multi-word expressions (MWEs) in English based on the Royal Society Corpus, a dataset containing 300+ years of the scientific publications of the Royal Society of London. Specifically, we investigate the functions of MWEs, such as stance markers (“is is interesting”) or discourse organizers (“in this section”), and their development over time. Our approach is multi-disciplinary: to detect MWEs we use Universal Dependencies, to classify them functionally we use an approach from register linguistics, and to assess their role in diachronic development we use an information-theoretic measure, relative entropy.
pdf
abs
Multi-word Expressions in English Scientific Writing
Diego Alves
|
Stefan Fischer
|
Stefania Degaetano-Ortlieb
|
Elke Teich
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
Multi-Word Expressions (MWEs) play a pivotal role in language use overall and in register formation more specifically, e.g. encoding field-specific terminology. Our study focuses on the identification and categorization of MWEs used in scientific writing, considering their formal characteristics as well as their developmental trajectory over time from the mid-17th century to the present. For this, we develop an approach combining three different types of methods to identify MWEs (Universal Dependency annotation, Partitioner and the Academic Formulas List) and selected measures to characterize MWE properties (e.g., dispersion by Kullback-Leibler Divergence and several association measures). This allows us to inspect MWEs types in a novel data-driven way regarding their functions and change over time in specialized discourse.
pdf
bib
abs
A Contemporary News Corpus of Ukrainian (CNC-UA): Compilation, Annotation, Publication
Stefan Fischer
|
Kateryna Haidarzhyi
|
Jörg Knappen
|
Olha Polishchuk
|
Yuliya Stodolinska
|
Elke Teich
Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024
We present a corpus of contemporary Ukrainian news articles published between 2019 and 2022 on the news website of the national public broadcaster of Ukraine, commonly known as SUSPILNE. The current release comprises 87 210 364 words in 292 955 texts. Texts are annotated with titles and their time of publication. In addition, the corpus has been linguistically annotated at the token level with a dependency parser. To provide further aspects for investigation, a topic model was trained on the corpus. The corpus is hosted (Fischer et al., 2023) at the Saarbrücken CLARIN center under a CC BY-NC-ND 4.0 license and available in two tab-separated formats: CoNLL-U (de Marneffe et al., 2021) and vertical text format (VRT) as used by the IMS Open Corpus Workbench (CWB; Evert and Hardie, 2011) and CQPweb (Hardie, 2012). We show examples of using the CQPweb interface, which allows to extract the quantitative data necessary for distributional and collocation analyses of the CNC-UA. As the CNC-UA contains news texts documenting recent events, it is highly relevant not only for linguistic analyses of the modern Ukrainian language but also for socio-cultural and political studies.
2023
pdf
abs
Simultaneous Interpreting as a Noisy Channel: How Much Information Gets Through
Maria Kunilovskaya
|
Heike Przybyl
|
Ekaterina Lapshinova-Koltunski
|
Elke Teich
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
We explore the relationship between information density/surprisal of source and target texts in translation and interpreting in the language pair English-German, looking at the specific properties of translation (“translationese”). Our data comes from two bidirectional English-German subcorpora representing written and spoken mediation modes collected from European Parliament proceedings. Within each language, we (a) compare original speeches to their translated or interpreted counterparts, and (b) explore the association between segment-aligned sources and targets in each translation direction. As additional variables, we consider source delivery mode (read-out, impromptu) and source speech rate in interpreting. We use language modelling to measure the information rendered by words in a segment and to characterise the cross-lingual transfer of information under various conditions. Our approach is based on statistical analyses of surprisal values, extracted from n-gram models of our dataset. The analysis reveals that while there is a considerable positive correlation between the average surprisal of source and target segments in both modes, information output in interpreting is lower than in translation, given the same amount of input. Significantly lower information density in spoken mediated production compared to non-mediated speech in the same language can indicate a possible simplification effect in interpreting.
2022
pdf
abs
EPIC UdS - Creation and Applications of a Simultaneous Interpreting Corpus
Heike Przybyl
|
Ekaterina Lapshinova-Koltunski
|
Katrin Menzel
|
Stefan Fischer
|
Elke Teich
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In this paper, we describe the creation and annotation of EPIC UdS, a multilingual corpus of simultaneous interpreting for English, German and Spanish. We give an overview of the comparable and parallel, aligned corpus variants and explore various applications of the corpus. What makes EPIC UdS relevant is that it is one of the rare interpreting corpora that includes transcripts suitable for research on more than one language pair and on interpreting with regard to German. It not only contains transcribed speeches, but also rich metadata and fine-grained linguistic annotations tailored for diverse applications across a broad range of linguistic subfields.
2021
pdf
bib
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age
Yuri Bizzoni
|
Elke Teich
|
Cristina España-Bonet
|
Josef van Genabith
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age
pdf
Found in translation/interpreting: combining data-driven and supervised methods to analyse cross-linguistically mediated communication
Ekaterina Lapshinova-Koltunski
|
Yuri Bizzoni
|
Heike Przybyl
|
Elke Teich
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age
pdf
abs
The diffusion of scientific terms – tracing individuals’ influence in the history of science for English
Yuri Bizzoni
|
Stefania Degaetano-Ortlieb
|
Katrin Menzel
|
Elke Teich
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Tracing the influence of individuals or groups in social networks is an increasingly popular task in sociolinguistic studies. While methods to determine someone’s influence in shortterm contexts (e.g., social media, on-line political debates) are widespread, influence in longterm contexts is less investigated and may be harder to capture. We study the diffusion of scientific terms in an English diachronic scientific corpus, applying Hawkes Processes to capture the role of individual scientists as “influencers” or “influencees” in the diffusion of new concepts. Our findings on two major scientific discoveries in chemistry and astronomy of the 18th century reveal that modelling both the introduction and diffusion of scientific terms in a historical corpus as Hawkes Processes allows detecting patterns of influence between authors on a long-term scale.
2020
pdf
abs
The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study
Stefan Fischer
|
Jörg Knappen
|
Katrin Menzel
|
Elke Teich
Proceedings of the Twelfth Language Resources and Evaluation Conference
We present a new, extended version of the Royal Society Corpus (RSC), a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665–1996). The corpus comprises 47 837 texts, primarily scientific articles, and is based on publications of the Royal Society of London, mainly its Philosophical Transactions and Proceedings. The corpus has been built on the basis of the FAIR principles and is freely available under a Creative Commons license, excluding copy-righted parts. We provide information on how the corpus can be found, the file formats available for download as well as accessibility via a web-based corpus query platform. We show a number of analytic tools that we have implemented for better usability and provide an example of use of the corpus for linguistic analysis as well as examples of subsequent, external uses of earlier releases. We place the RSC against the background of existing English diachronic/scientific corpora, elaborating on its value for linguistic and humanistic study.
pdf
abs
How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech
Yuri Bizzoni
|
Tom S Juzek
|
Cristina España-Bonet
|
Koel Dutta Chowdhury
|
Josef van Genabith
|
Elke Teich
Proceedings of the 17th International Conference on Spoken Language Translation
Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons. In our analysis we – (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs machine) rather than to the data (written vs spoken).
pdf
abs
Exploring diachronic syntactic shifts with dependency length: the case of scientific English
Tom S Juzek
|
Marie-Pauline Krielke
|
Elke Teich
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)
We report on an application of universal dependencies for the study of diachronic shifts in syntactic usage patterns. Our focus is on the evolution of Scientific English in the Late Modern English period (ca. 1700-1900). Our data set is the Royal Society Corpus (RSC), comprising the full set of publications of the Royal Society of London between 1665 and 1996. Our starting assumption is that over time, Scientific English develops specific syntactic choice preferences that increase efficiency in (expert-to-expert) communication. The specific hypothesis we pursue in this paper is that changing syntactic choice preferences lead to greater dependency locality/dependency length minimization, which is associated with positive effects for the efficiency of human as well as computational linguistic processing. As a basis for our measurements, we parsed the RSC using Stanford CoreNLP. Overall, we observe a decrease in dependency length, with long dependency structures becoming less frequent and short dependency structures becoming more frequent over time, notably pertaining to the nominal phrase, thus marking an overall push towards greater communicative efficiency.
2019
pdf
abs
Grammar and Meaning: Analysing the Topology of Diachronic Word Embeddings
Yuri Bizzoni
|
Stefania Degaetano-Ortlieb
|
Katrin Menzel
|
Pauline Krielke
|
Elke Teich
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change
The paper showcases the application of word embeddings to change in language use in the domain of science, focusing on the Late Modern English period (17-19th century). Historically, this is the period in which many registers of English developed, including the language of science. Our overarching interest is the linguistic development of scientific writing to a distinctive (group of) register(s). A register is marked not only by the choice of lexical words (discourse domain) but crucially by grammatical choices which indicate style. The focus of the paper is on the latter, tracing words with primarily grammatical functions (function words and some selected, poly-functional word forms) diachronically. To this end, we combine diachronic word embeddings with appropriate visualization and exploratory techniques such as clustering and relative entropy for meaningful aggregation of data and diachronic comparison.
2018
pdf
abs
Using relative entropy for detection and analysis of periods of diachronic linguistic change
Stefania Degaetano-Ortlieb
|
Elke Teich
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
We present a data-driven approach to detect periods of linguistic change and the lexical and grammatical features contributing to change. We focus on the development of scientific English in the late modern period. Our approach is based on relative entropy (Kullback-Leibler Divergence) comparing temporally adjacent periods and sliding over the time line from past to present. Using a diachronic corpus of scientific publications of the Royal Society of London, we show how periods of change reflect the interplay between lexis and grammar, where periods of lexical expansion are typically followed by periods of grammatical consolidation resulting in a balance between expressivity and communicative efficiency. Our method is generic and can be applied to other data sets, languages and time ranges.
2017
pdf
The Making of the Royal Society Corpus
Jörg Knappen
|
Stefan Fischer
|
Hannah Kermes
|
Elke Teich
|
Peter Fankhauser
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language
pdf
abs
Modeling intra-textual variation with entropy and surprisal: topical vs. stylistic patterns
Stefania Degaetano-Ortlieb
|
Elke Teich
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
We present a data-driven approach to investigate intra-textual variation by combining entropy and surprisal. With this approach we detect linguistic variation based on phrasal lexico-grammatical patterns across sections of research articles. Entropy is used to detect patterns typical of specific sections. Surprisal is used to differentiate between more and less informationally-loaded patterns as well as type of information (topical vs. stylistic). While we here focus on research articles in biology/genetics, the methodology is especially interesting for digital humanities scholars, as it can be applied to any text type or domain and combined with additional variables (e.g. time, author or social group).
2016
pdf
abs
Modeling Diachronic Change in Scientific Writing with Information Density
Raphael Rubino
|
Stefania Degaetano-Ortlieb
|
Elke Teich
|
Josef van Genabith
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Previous linguistic research on scientific writing has shown that language use in the scientific domain varies considerably in register and style over time. In this paper we investigate the introduction of information theory inspired features to study long term diachronic change on three levels: lexis, part-of-speech and syntax. Our approach is based on distinguishing between sentences from 19th and 20th century scientific abstracts using supervised classification models. To the best of our knowledge, the introduction of information theoretic features to this task is novel. We show that these features outperform more traditional features, such as token or character n-grams, while leading to more compact models. We present a detailed analysis of feature informativeness in order to gain a better understanding of diachronic change on different linguistic levels.
pdf
Information-based Modeling of Diachronic Linguistic Change: from Typicality to Productivity
Stefania Degaetano-Ortlieb
|
Elke Teich
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
pdf
abs
The Royal Society Corpus: From Uncharted Data to Corpus
Hannah Kermes
|
Stefania Degaetano-Ortlieb
|
Ashraf Khamis
|
Jörg Knappen
|
Elke Teich
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We present the Royal Society Corpus (RSC) built from the Philosophical Transactions and Proceedings of the Royal Society of London. At present, the corpus contains articles from the first two centuries of the journal (1665―1869) and amounts to around 35 million tokens. The motivation for building the RSC is to investigate the diachronic linguistic development of scientific English. Specifically, we assume that due to specialization, linguistic encodings become more compact over time (Halliday, 1988; Halliday and Martin, 1993), thus creating a specific discourse type characterized by high information density that is functional for expert communication. When building corpora from uncharted material, typically not all relevant meta-data (e.g. author, time, genre) or linguistic data (e.g. sentence/word boundaries, words, parts of speech) is readily available. We present an approach to obtain good quality meta-data and base text data adopting the concept of Agile Software Development.
2014
pdf
abs
Exploring and Visualizing Variation in Language Resources
Peter Fankhauser
|
Jörg Knappen
|
Elke Teich
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Language resources are often compiled for the purpose of variational analysis, such as studying differences between genres, registers, and disciplines, regional and diachronic variation, influence of gender, cultural context, etc. Often the sheer number of potentially interesting contrastive pairs can get overwhelming due to the combinatorial explosion of possible combinations. In this paper, we present an approach that combines well understood techniques for visualization heatmaps and word clouds with intuitive paradigms for exploration drill down and side by side comparison to facilitate the analysis of language variation in such highly combinatorial situations. Heatmaps assist in analyzing the overall pattern of variation in a corpus, and word clouds allow for inspecting variation at the level of words.
pdf
abs
Data Mining with Shallow vs. Linguistic Features to Study Diversification of Scientific Registers
Stefania Degaetano-Ortlieb
|
Peter Fankhauser
|
Hannah Kermes
|
Ekaterina Lapshinova-Koltunski
|
Noam Ordan
|
Elke Teich
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present a methodology to analyze the linguistic evolution of scientific registers with data mining techniques, comparing the insights gained from shallow vs. linguistic features. The focus is on selected scientific disciplines at the boundaries to computer science (computational linguistics, bioinformatics, digital construction, microelectronics). The data basis is the English Scientific Text Corpus (SCITEX) which covers a time range of roughly thirty years (1970/80s to early 2000s) (Degaetano-Ortlieb et al., 2013; Teich and Fankhauser, 2010). In particular, we investigate the diversification of scientific registers over time. Our theoretical basis is Systemic Functional Linguistics (SFL) and its specific incarnation of register theory (Halliday and Hasan, 1985). In terms of methods, we combine corpus-based methods of feature extraction and data mining techniques.
2013
pdf
Scientific registers and disciplinary diversification: a comparable corpus approach
Elke Teich
|
Stefania Degaetano-Ortlieb
|
Hannah Kermes
|
Ekaterina Lapshinova-Koltunski
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora
2012
pdf
abs
Feature Discovery for Diachronic Register Analysis: a Semi-Automatic Approach
Stefania Degaetano-Ortlieb
|
Ekaterina Lapshinova-Koltunski
|
Elke Teich
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
In this paper, we present corpus-based procedures to semi-automatically discover features relevant for the study of recent language change in scientific registers. First, linguistic features potentially adherent to recent language change are extracted from the SciTex Corpus. Second, features are assessed for their relevance for the study of recent language change in scientific registers by means of correspondence analysis. The discovered features will serve for further investigations of the linguistic evolution of newly emerged scientific registers.
2006
pdf
Corpus Annotation by Generation
Elke Teich
|
John A. Bateman
|
Richard Eckart
Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006
2004
pdf
Multi-dimensional annotation of linguistic corpora for investigating information structure
Stefan Baumann
|
Caren Brinckmann
|
Silvia Hansen-Schirra
|
Geert-Jan Kruijff
|
Ivana Kruijff-Korbayová
|
Stella Neumann
|
Elke Teich
Proceedings of the Workshop Frontiers in Corpus Annotation at HLT-NAACL 2004
pdf
The MULI Project: Annotation and Analysis of Information Structure in German and English
Stefan Baumann
|
Caren Brinckmann
|
Silvia Hansen-Schirra
|
Geert-Jan Kruijff
|
Ivana Kruijff-Korbayová
|
Stella Neumann
|
Erich Steiner
|
Elke Teich
|
Hans Uszkoreit
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
2001
pdf
Linear Order as Higher-Level Decision: Information Structure in Strategic and Tactical Generation
Geert-Jan M. Kruijff
|
Ivana Kruijff-Korbayovà
|
John Bateman
|
Elke Teich
Proceedings of the ACL 2001 Eighth European Workshop on Natural Language Generation (EWNLG)
2000
pdf
Multilinguality in a Text Generation System For Three Slavic Languages
Geert-Jan Kruijff
|
Elke Teich
|
John Bateman
|
Ivana Kruijff-Korbayova
|
Hana Skoumalova
|
Serge Sharoff
|
Lena Sokolova
|
Tony Hartley
|
Kamenka Staykova
|
Jiri Hana
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics
pdf
Matching a tone-based and tune-based approach to English intonation for concept-to-speech generation
Elke Teich
|
Catherine I. Watson
|
Cecile Pereira
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics
pdf
Resources for Multilingual Text Generation in Three Slavic Languages
John Bateman
|
Elke Teich
|
Geert-Jan Kruijff
|
Ivana Kruijff-Korbayová
|
Serge Sharoff
|
Hana Skoumalová
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
1998
pdf
Types of syntagmatic grammatical relations and their representation
Elke Teich
Processing of Dependency-Based Grammars
1994
pdf
Towards the Application of Text Generation in an Integrated Publication System
Elke Teich
|
John Bateman
Proceedings of the Seventh International Workshop on Natural Language Generation