2017
pdf
The Making of the Royal Society Corpus
Jörg Knappen
|
Stefan Fischer
|
Hannah Kermes
|
Elke Teich
|
Peter Fankhauser
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language
2016
pdf
abs
The Royal Society Corpus: From Uncharted Data to Corpus
Hannah Kermes
|
Stefania Degaetano-Ortlieb
|
Ashraf Khamis
|
Jörg Knappen
|
Elke Teich
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We present the Royal Society Corpus (RSC) built from the Philosophical Transactions and Proceedings of the Royal Society of London. At present, the corpus contains articles from the first two centuries of the journal (1665―1869) and amounts to around 35 million tokens. The motivation for building the RSC is to investigate the diachronic linguistic development of scientific English. Specifically, we assume that due to specialization, linguistic encodings become more compact over time (Halliday, 1988; Halliday and Martin, 1993), thus creating a specific discourse type characterized by high information density that is functional for expert communication. When building corpora from uncharted material, typically not all relevant meta-data (e.g. author, time, genre) or linguistic data (e.g. sentence/word boundaries, words, parts of speech) is readily available. We present an approach to obtain good quality meta-data and base text data adopting the concept of Agile Software Development.
2014
pdf
abs
Data Mining with Shallow vs. Linguistic Features to Study Diversification of Scientific Registers
Stefania Degaetano-Ortlieb
|
Peter Fankhauser
|
Hannah Kermes
|
Ekaterina Lapshinova-Koltunski
|
Noam Ordan
|
Elke Teich
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present a methodology to analyze the linguistic evolution of scientific registers with data mining techniques, comparing the insights gained from shallow vs. linguistic features. The focus is on selected scientific disciplines at the boundaries to computer science (computational linguistics, bioinformatics, digital construction, microelectronics). The data basis is the English Scientific Text Corpus (SCITEX) which covers a time range of roughly thirty years (1970/80s to early 2000s) (Degaetano-Ortlieb et al., 2013; Teich and Fankhauser, 2010). In particular, we investigate the diversification of scientific registers over time. Our theoretical basis is Systemic Functional Linguistics (SFL) and its specific incarnation of register theory (Halliday and Hasan, 1985). In terms of methods, we combine corpus-based methods of feature extraction and data mining techniques.
2013
pdf
Scientific registers and disciplinary diversification: a comparable corpus approach
Elke Teich
|
Stefania Degaetano-Ortlieb
|
Hannah Kermes
|
Ekaterina Lapshinova-Koltunski
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora
2012
pdf
abs
A methodology for the extraction of information about the usage of formulaic expressions in scientific texts
Hannah Kermes
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
In this paper, we present a methodology for the extraction of formulaic expressions, which goes beyond the mere extraction of candidate patterns. Using a pipeline we are able to extract information about the usage of formulaic expressions automatically from text corpora. According to Biber and Barbieri (2007) formulaic expressions are important building blocks of discourse in spoken and written registers. The automatic extraction procedure can help to investigate the usage and function of these recurrent patterns in different registers and domains. Formulaic expressions are commonplace not only in every- day language but also in scientific writing. Patterns such as 'in this paper', 'the number of', 'on the basis of' are often used by scientists to convey research interests, the theoretical basis of their studies, results of experiments, sci- entific findings as well as conclusions and are used as dis- course organizers. For Hyland (2008) they help to shape meanings in specific context and contribute to our sense of coherence in a text. We are interested in: (i) which and what type of formulaic expressions are used in scientific texts? (ii) the distribution of formulaic expression across different scien- tific disciplines, (iii) where do formulaic expressions occur within a text?
2004
pdf
Text Analysis Meets Computational Lexicography
Hannah Kermes
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics
2003
pdf
Experiments on Candidate Data for Collocation Extraction
Stefan Evert
|
Hannah Kermes
10th Conference of the European Chapter of the Association for Computational Linguistics
2002
pdf
YAC - A Recursive Chunker for Unrestricted German Text
Hannah Kermes
|
Stefan Evert
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)