2020
pdf
abs
Preserving Semantic Information from Old Dictionaries: Linking Senses of the ‘Altfranzösisches Wörterbuch’ to WordNet
Achim Stein
Proceedings of the Twelfth Language Resources and Evaluation Conference
Historical dictionaries of the pre-digital period are important resources for the study of older languages. Taking the example of the ‘Altfranzösisches Wörterbuch’, an Old French dictionary published from 1925 onwards, this contribution shows how the printed dictionaries can be turned into a more easily accessible and more sustainable lexical database, even though a full-text retro-conversion is too costly. Over 57,000 German sense definitions were identified in uncorrected OCR output. For verbs and nouns, 34,000 senses of more than 20,000 lemmas were matched with GermaNet, a semantic network for German, and, in a second step, linked to synsets of the English WordNet. These results are relevant for the automatic processing of Old French, for the annotation and exploitation of Old French text corpora, and for the philological study of Old French in general.
2016
pdf
abs
Old French Dependency Parsing: Results of Two Parsers Analysed from a Linguistic Point of View
Achim Stein
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The treatment of medieval texts is a particular challenge for parsers. I compare how two dependency parsers, one graph-based, the other transition-based, perform on Old French, facing some typical problems of medieval texts: graphical variation, relatively free word order, and syntactic variation of several parameters over a diachronic period of about 300 years. Both parsers were trained and evaluated on the “Syntactic Reference Corpus of Medieval French” (SRCMF), a manually annotated dependency treebank. I discuss the relation between types of parsers and types of language, as well as the differences of the analyses from a linguistic point of view.
pdf
abs
LVF-lemon ― Towards a Linked Data Representation of “Les Verbes français”
Ingrid Falk
|
Achim Stein
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this study we elaborate a road map for the conversion of a traditional lexical syntactico-semantic resource for French into a linguistic linked open data (LLOD) model. Our approach uses current best-practices and the analyses of earlier similar undertakings (lemonUBY and PDEV-lemon) to tease out the most appropriate representation for our resource.
2014
pdf
abs
Parsing Heterogeneous Corpora with a Rich Dependency Grammar
Achim Stein
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Grammar models conceived for parsing purposes are often poorer than models that are motivated linguistically. We present a grammar model which is linguistically satisfactory and based on the principles of traditional dependency grammar. We show how a state-of-the-art dependency parser (mate tools) performs with this model, trained on the Syntactic Reference Corpus of Medieval French (SRCMF), a manually annotated corpus of medieval (Old French) texts. We focus on the problems caused by small and heterogeneous training sets typical for corpora of older periods. The result is the first publicly available dependency parser for Old French. On a 90/10 training/evaluation split of eleven OF texts (206000 words), we obtained an UAS of 89.68% and a LAS of 82.62%. Three experiments showed how heterogeneity, typical of medieval corpora, affects the parsing results: (a) a ‘one-on-one’ cross evaluation for individual texts, (b) a ‘leave-one-out’ cross evaluation, and (c) a prose/verse cross evaluation.
2010
pdf
abs
Identification of Rare & Novel Senses Using Translations in a Parallel Corpus
Richard Schwarz
|
Hinrich Schütze
|
Fabienne Martin
|
Achim Stein
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The identification of rare and novel senses is a challenge in lexicography. In this paper, we present a new method for finding such senses using a word aligned multilingual parallel corpus. We use the Europarl corpus and therein concentrate on French verbs. We represent each occurrence of a French verb as a high dimensional term vector. The dimensions of such a vector are the possible translations of the verb according to the underlying word alignment. The dimensions are weighted by a weighting scheme to adjust to the significance of any particular translation. After collecting these vectors we apply forms of the K-means algorithm on the resulting vector space to produce clusters of distinct senses, so that standard uses produce large homogeneous clusters while rare and novel uses appear in small or heterogeneous clusters. We show in a qualitative and quantitative evaluation that the method can successfully find rare and novel senses.
2009
pdf
Disambiguation of Polysemous Verbs for Rule-based Inferencing
Fabienne Martin
|
Dennis Spohr
|
Achim Stein
Proceedings of the Eight International Conference on Computational Semantics