Stefania Degaetano-Ortlieb

Also published as: Stefania Degaetano-ortlieb

2024

pdf bib
Modeling Diachronic Change in English Scientific Writing over 300+ Years with Transformer-based Language Model Surprisal
Julius Steuer | Marie-Pauline Krielke | Stefan Fischer | Stefania Degaetano-Ortlieb | Marius Mosbach | Dietrich Klakow
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024

pdf abs
Diachronic Analysis of Multi-word Expression Functional Categories in Scientific English
Diego Alves | Stefania Degaetano-Ortlieb | Elena Schmidt | Elke Teich
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

We present a diachronic analysis of multi-word expressions (MWEs) in English based on the Royal Society Corpus, a dataset containing 300+ years of the scientific publications of the Royal Society of London. Specifically, we investigate the functions of MWEs, such as stance markers (“is is interesting”) or discourse organizers (“in this section”), and their development over time. Our approach is multi-disciplinary: to detect MWEs we use Universal Dependencies, to classify them functionally we use an approach from register linguistics, and to assess their role in diachronic development we use an information-theoretic measure, relative entropy.

pdf bib
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
Yuri Bizzoni | Stefania Degaetano-Ortlieb | Anna Kazantseva | Stan Szpakowicz
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

pdf abs
Multi-word Expressions in English Scientific Writing
Diego Alves | Stefan Fischer | Stefania Degaetano-Ortlieb | Elke Teich
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

Multi-Word Expressions (MWEs) play a pivotal role in language use overall and in register formation more specifically, e.g. encoding field-specific terminology. Our study focuses on the identification and categorization of MWEs used in scientific writing, considering their formal characteristics as well as their developmental trajectory over time from the mid-17th century to the present. For this, we develop an approach combining three different types of methods to identify MWEs (Universal Dependency annotation, Partitioner and the Academic Formulas List) and selected measures to characterize MWE properties (e.g., dispersion by Kullback-Leibler Divergence and several association measures). This allows us to inspect MWEs types in a novel data-driven way regarding their functions and change over time in specialized discourse.

pdf abs
Applying Information-theoretic Notions to Measure Effects of the Plain English Movement on English Law Reports and Scientific Articles
Sergei Bagdasarov | Stefania Degaetano-Ortlieb
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

We investigate the impact of the Plain English Movement (PEM) on the complexity of legal language in UK law reports from the 1950s-2010s, contrasting it with the evolution of scientific language. The PEM, emerging in the late 20th century, advocated for clear and understandable legal language. We define complexity through the concept of surprisal - an information-theoretic measure correlating with cognitive processing difficulty. Our research contrasts surprisal with traditional readability measures, which often overlook content. We hypothesize that, if the PEM has influenced legal language, there would be a reduction in complexity over time and a shift from a nominal to a more verbal style. We analyze text complexity and lexico-grammatical changes in line with PEM recommendations. Results indicate minimal impact of the PEM on both legal and scientific domains. This finding suggests future research should consider processing effort when advocating for linguistic norms to enhance accessibility.

2023

pdf bib
Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Stefania Degaetano-Ortlieb | Anna Kazantseva | Nils Reiter | Stan Szpakowicz
Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

pdf abs
Fractality of informativity in 300 years of English scientific writing
Yuri Bizzoni | Stefania Degaetano-ortlieb
Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Scientific writing is assumed to have become more informationally dense over time (Halliday, 1988; Biber and Gray, 2016). By means of fractal analysis, we study whether over time the degree of informativity has become more persistent with predictable patterns of gradual changes between high vs. low informational content, indicating a trend towards an optimal code for scientific communication.

2021

pdf bib
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Stefania Degaetano-Ortlieb | Anna Kazantseva | Nils Reiter | Stan Szpakowicz
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

pdf abs
The diffusion of scientific terms – tracing individuals’ influence in the history of science for English
Yuri Bizzoni | Stefania Degaetano-Ortlieb | Katrin Menzel | Elke Teich
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Tracing the influence of individuals or groups in social networks is an increasingly popular task in sociolinguistic studies. While methods to determine someone’s influence in shortterm contexts (e.g., social media, on-line political debates) are widespread, influence in longterm contexts is less investigated and may be harder to capture. We study the diffusion of scientific terms in an English diachronic scientific corpus, applying Hawkes Processes to capture the role of individual scientists as “influencers” or “influencees” in the diffusion of new concepts. Our findings on two major scientific discoveries in chemistry and astronomy of the 18th century reveal that modelling both the introduction and diffusion of scientific terms in a historical corpus as Hawkes Processes allows detecting patterns of influence between authors on a long-term scale.

2020

pdf abs
A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English
Marius Mosbach | Stefania Degaetano-Ortlieb | Marie-Pauline Krielke | Badr M. Abdullah | Dietrich Klakow
Proceedings of the 28th International Conference on Computational Linguistics

Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT), testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks. We focus on relative clauses (in American English) as a complex phenomenon needing contextual information and antecedent identification to be resolved. Based on a naturalistic dataset, probing shows that all three models indeed capture linguistic knowledge about grammaticality, achieving high performance. Evaluation on diagnostic cases and masked prediction tasks considering fine-grained linguistic knowledge, however, shows pronounced model-specific weaknesses especially on semantic knowledge, strongly impacting models’ performance. Our results highlight the importance of (a)model comparison in evaluation task and (b) building up claims of model performance and the linguistic knowledge they capture beyond purely probing-based evaluations.

2019

pdf bib
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Beatrice Alex | Stefania Degaetano-Ortlieb | Anna Kazantseva | Nils Reiter | Stan Szpakowicz
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

pdf abs
The Scientization of Literary Study
Stefania Degaetano-Ortlieb | Andrew Piper
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Scholarly practices within the humanities have historically been perceived as distinct from the natural sciences. We look at literary studies, a discipline strongly anchored in the humanities, and hypothesize that over the past half-century literary studies has instead undergone a process of “scientization”, adopting linguistic behavior similar to the sciences. We test this using methods based on information theory, comparing a corpus of literary studies articles (around 63,400) with a corpus of standard English and scientific English respectively. We show evidence for “scientization” effects in literary studies, though at a more muted level than scientific English, suggesting that literary studies occupies a middle ground with respect to standard English in the larger space of academic disciplines. More generally, our methodology can be applied to investigate the social positioning and development of language use across different domains (e.g. scientific disciplines, language varieties, registers).

pdf abs
Grammar and Meaning: Analysing the Topology of Diachronic Word Embeddings
Yuri Bizzoni | Stefania Degaetano-Ortlieb | Katrin Menzel | Pauline Krielke | Elke Teich
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

The paper showcases the application of word embeddings to change in language use in the domain of science, focusing on the Late Modern English period (17-19th century). Historically, this is the period in which many registers of English developed, including the language of science. Our overarching interest is the linguistic development of scientific writing to a distinctive (group of) register(s). A register is marked not only by the choice of lexical words (discourse domain) but crucially by grammatical choices which indicate style. The focus of the paper is on the latter, tracing words with primarily grammatical functions (function words and some selected, poly-functional word forms) diachronically. To this end, we combine diachronic word embeddings with appropriate visualization and exploratory techniques such as clustering and relative entropy for meaningful aggregation of data and diachronic comparison.

pdf abs
Some steps towards the generation of diachronic WordNets
Yuri Bizzoni | Marius Mosbach | Dietrich Klakow | Stefania Degaetano-Ortlieb
Proceedings of the 22nd Nordic Conference on Computational Linguistics

We apply hyperbolic embeddings to trace the dynamics of change of conceptual-semantic relationships in a large diachronic scientific corpus (200 years). Our focus is on emerging scientific fields and the increasingly specialized terminology establishing around them. Reproducing high-quality hierarchical structures such as WordNet on a diachronic scale is a very difficult task. Hyperbolic embeddings can map partial graphs into low dimensional, continuous hierarchical spaces, making more explicit the latent structure of the input. We show that starting from simple lists of word pairs (rather than a list of entities with directional links) it is possible to build diachronic hierarchical semantic spaces which allow us to model a process towards specialization for selected scientific fields.

2018

pdf bib abs
Stylistic variation over 200 years of court proceedings according to gender and social class
Stefania Degaetano-Ortlieb
Proceedings of the Second Workshop on Stylistic Variation

We present an approach to detect stylistic variation across social variables (here: gender and social class), considering also diachronic change in language use. For detection of stylistic variation, we use relative entropy, measuring the difference between probability distributions at different linguistic levels (here: lexis and grammar). In addition, by relative entropy, we can determine which linguistic units are related to stylistic variation.

pdf bib
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Beatrice Alex | Stefania Degaetano-Ortlieb | Anna Feldman | Anna Kazantseva | Nils Reiter | Stan Szpakowicz
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

pdf abs
Using relative entropy for detection and analysis of periods of diachronic linguistic change
Stefania Degaetano-Ortlieb | Elke Teich
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We present a data-driven approach to detect periods of linguistic change and the lexical and grammatical features contributing to change. We focus on the development of scientific English in the late modern period. Our approach is based on relative entropy (Kullback-Leibler Divergence) comparing temporally adjacent periods and sliding over the time line from past to present. Using a diachronic corpus of scientific publications of the Royal Society of London, we show how periods of change reflect the interplay between lexis and grammar, where periods of lexical expansion are typically followed by periods of grammatical consolidation resulting in a balance between expressivity and communicative efficiency. Our method is generic and can be applied to other data sets, languages and time ranges.

2017

pdf bib
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Beatrice Alex | Stefania Degaetano-Ortlieb | Anna Feldman | Anna Kazantseva | Nils Reiter | Stan Szpakowicz
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

pdf abs
Modeling intra-textual variation with entropy and surprisal: topical vs. stylistic patterns
Stefania Degaetano-Ortlieb | Elke Teich
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We present a data-driven approach to investigate intra-textual variation by combining entropy and surprisal. With this approach we detect linguistic variation based on phrasal lexico-grammatical patterns across sections of research articles. Entropy is used to detect patterns typical of specific sections. Surprisal is used to differentiate between more and less informationally-loaded patterns as well as type of information (topical vs. stylistic). While we here focus on research articles in biology/genetics, the methodology is especially interesting for digital humanities scholars, as it can be applied to any text type or domain and combined with additional variables (e.g. time, author or social group).

2016

pdf abs
Modeling Diachronic Change in Scientific Writing with Information Density
Raphael Rubino | Stefania Degaetano-Ortlieb | Elke Teich | Josef van Genabith
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Previous linguistic research on scientific writing has shown that language use in the scientific domain varies considerably in register and style over time. In this paper we investigate the introduction of information theory inspired features to study long term diachronic change on three levels: lexis, part-of-speech and syntax. Our approach is based on distinguishing between sentences from 19th and 20th century scientific abstracts using supervised classification models. To the best of our knowledge, the introduction of information theoretic features to this task is novel. We show that these features outperform more traditional features, such as token or character n-grams, while leading to more compact models. We present a detailed analysis of feature informativeness in order to gain a better understanding of diachronic change on different linguistic levels.

pdf
Information-based Modeling of Diachronic Linguistic Change: from Typicality to Productivity
Stefania Degaetano-Ortlieb | Elke Teich
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf abs
The Royal Society Corpus: From Uncharted Data to Corpus
Hannah Kermes | Stefania Degaetano-Ortlieb | Ashraf Khamis | Jörg Knappen | Elke Teich
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present the Royal Society Corpus (RSC) built from the Philosophical Transactions and Proceedings of the Royal Society of London. At present, the corpus contains articles from the first two centuries of the journal (1665―1869) and amounts to around 35 million tokens. The motivation for building the RSC is to investigate the diachronic linguistic development of scientific English. Specifically, we assume that due to specialization, linguistic encodings become more compact over time (Halliday, 1988; Halliday and Martin, 1993), thus creating a specific discourse type characterized by high information density that is functional for expert communication. When building corpora from uncharted material, typically not all relevant meta-data (e.g. author, time, genre) or linguistic data (e.g. sentence/word boundaries, words, parts of speech) is readily available. We present an approach to obtain good quality meta-data and base text data adopting the concept of Agile Software Development.

2014

pdf abs
Data Mining with Shallow vs. Linguistic Features to Study Diversification of Scientific Registers
Stefania Degaetano-Ortlieb | Peter Fankhauser | Hannah Kermes | Ekaterina Lapshinova-Koltunski | Noam Ordan | Elke Teich
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a methodology to analyze the linguistic evolution of scientific registers with data mining techniques, comparing the insights gained from shallow vs. linguistic features. The focus is on selected scientific disciplines at the boundaries to computer science (computational linguistics, bioinformatics, digital construction, microelectronics). The data basis is the English Scientific Text Corpus (SCITEX) which covers a time range of roughly thirty years (1970/80s to early 2000s) (Degaetano-Ortlieb et al., 2013; Teich and Fankhauser, 2010). In particular, we investigate the diversification of scientific registers over time. Our theoretical basis is Systemic Functional Linguistics (SFL) and its specific incarnation of register theory (Halliday and Hasan, 1985). In terms of methods, we combine corpus-based methods of feature extraction and data mining techniques.

2013

pdf
Scientific registers and disciplinary diversification: a comparable corpus approach
Elke Teich | Stefania Degaetano-Ortlieb | Hannah Kermes | Ekaterina Lapshinova-Koltunski
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

2012

pdf
Visualising Linguistic Evolution in Academic Discourse
Verena Lyding | Ekaterina Lapshinova-Koltunski | Stefania Degaetano-Ortlieb | Henrik Dittmann | Chris Culy
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

pdf abs
Feature Discovery for Diachronic Register Analysis: a Semi-Automatic Approach
Stefania Degaetano-Ortlieb | Ekaterina Lapshinova-Koltunski | Elke Teich
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present corpus-based procedures to semi-automatically discover features relevant for the study of recent language change in scientific registers. First, linguistic features potentially adherent to recent language change are extracted from the SciTex Corpus. Second, features are assessed for their relevance for the study of recent language change in scientific registers by means of correspondence analysis. The discovered features will serve for further investigations of the linguistic evolution of newly emerged scientific registers.

Co-authors

Venues

mwe1

udw1