2024
pdf
abs
Bridging Computational Lexicography and Corpus Linguistics: A Query Extension for OntoLex-FrAC
Christian Chiarcos
|
Ranka Stanković
|
Maxim Ionov
|
Gilles Sérasset
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
OntoLex, the dominant community standard for machine-readable lexical resources in the context of RDF, Linked Data and Semantic Web technologies, is currently extended with a designated module for Frequency, Attestations and Corpus-based Information (OntoLex-FrAC). We propose a novel component for OntoLex-FrAC, addressing the incorporation of corpus queries for (a) linking dictionaries with corpus engines, (b) enabling RDF-based web services to exchange corpus queries and responses data dynamically, and (c) using conventional query languages to formalize the internal structure of collocations, word sketches, and colligations. The primary field of application of the query extension is in digital lexicography and corpus linguistics, and we present a proof-of-principle implementation in backend components of a novel platform designed to support digital lexicography for the Serbian language.
pdf
abs
On Modelling Corpus Citations in Computational Lexical Resources
Fahad Khan
|
Maxim Ionov
|
Christian Chiarcos
|
Laurent Romary
|
Gilles Sérasset
|
Besim Kabashi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In this article we look at how two different standards for lexical resources, TEI and OntoLex, deal with corpus citations in lexicons. We will focus on how corpus citations in retrodigitised dictionaries can be modelled using each of the two standards since this provides us with a suitably challenging use case. After looking at the structure of an example entry from a legacy dictionary, we examine the two approaches offered by the two different standards by outlining an encoding for the example entry using both of them (note that this article features the first extended discussion of how the Frequency Attestation and Corpus (FrAC) module of OntoLex deals with citations). After comparing the two approaches and looking at the advantages and disadvantages of both, we argue for a combination of both. In the last part of the article we discuss different ways of doing this, giving our preference for a strategy which makes use of RDFa.
pdf
bib
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
Christian Chiarcos
|
Katerina Gkirtzou
|
Maxim Ionov
|
Fahad Khan
|
John P. McCrae
|
Elena Montiel Ponsoda
|
Patricia Martín Chozas
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
pdf
abs
Linguistic LOD for Interoperable Morphological Description
Michael Rosner
|
Maxim Ionov
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
Interoperability is a characteristic of a product or system that seamlessly works with another product or system and implies a certain level of independence from the context of use. Turning to language resources, interoperability is frequently cited as one important rationale underlying the use of LLOD representations and is generally regarded as highly desirable. In this paper we further elaborate this theme, distinguishing three different kinds of interoperability providing practical implementations with examples from morphology.
pdf
abs
OntoLex Publication Made Easy: A Dataset of Verbal Aspectual Pairs for Bosnian, Croatian and Serbian
Ranka Stanković
|
Maxim Ionov
|
Medina Bajtarević
|
Lorena Ninčević
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
This paper introduces a novel language resource for retrieving and researching verbal aspectual pairs in BCS (Bosnian, Croatian, and Serbian) created using Linguistic Linked Open Data (LLOD) principles. As there is no resource to help learners of Bosnian, Croatian, and Serbian as foreign languages to recognize the aspect of a verb or its pairs, we have created a new resource that will provide users with information about the aspect, as well as the link to a verb’s aspectual counterparts. This resource also contains external links to monolingual dictionaries, Wordnet, and BabelNet. As this is a work in progress, our resource only includes verbs and their perfective pairs formed with prefixes “pro”, “od”, “ot”, “iz”, “is” and “na”. The goal of this project is to have a complete dataset of all the aspectual pairs in these three languages. We believe it will be useful for research in the field of aspectology, as well as machine translation and other NLP tasks. Using this resource as an example, we also propose a sustainable approach to publishing small to moderate LLOD resources on the Web, both in a user-friendly way and according to the Linked Data principles.
2023
pdf
Beyond Concatenative Morphology: Applying OntoLex-Morph to Maltese
Maxim Ionov
|
Mike Rosner
Proceedings of the 4th Conference on Language, Data and Knowledge
2022
pdf
abs
Modelling Collocations in OntoLex-FrAC
Christian Chiarcos
|
Katerina Gkirtzou
|
Maxim Ionov
|
Besim Kabashi
|
Fahad Khan
|
Ciprian-Octavian Truică
Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference
Following presentations of frequency and attestations, and embeddings and distributional similarity, this paper introduces the third cornerstone of the emerging OntoLex module for Frequency, Attestation and Corpus-based Information, OntoLex-FrAC. We provide an RDF vocabulary for collocations, established as a consensus over contributions from five different institutions and numerous data sets, with the goal of eliciting feedback from reviewers, workshop audience and the scientific community in preparation of the final consolidation of the OntoLex-FrAC module, whose publication as a W3C community report is foreseen for the end of this year. The novel collocation component of OntoLex-FrAC is described in application to a lexicographic resource and corpus-based collocation scores available from the web, and finally, we demonstrate the capability and genericity of the model by showing how to retrieve and aggregate collocation information by means of SPARQL, and its export to a tabular format, so that it can be easily processed in downstream applications.
pdf
abs
Querying a Dozen Corpora and a Thousand Years with Fintan
Christian Chiarcos
|
Christian Fäth
|
Maxim Ionov
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Large-scale diachronic corpus studies covering longer time periods are difficult if more than one corpus are to be consulted and, as a result, different formats and annotation schemas need to be processed and queried in a uniform, comparable and replicable manner. We describes the application of the Flexible Integrated Transformation and Annotation eNgineering (Fintan) platform for studying word order in German using syntactically annotated corpora that represent its entire written history. Focusing on nominal dative and accusative arguments, this study hints at two major phases in the development of scrambling in modern German. Against more recent assumptions, it supports the traditional view that word order flexibility decreased over time, but it also indicates that this was a relatively sharp transition in Early New High German. The successful case study demonstrates the potential of Fintan and the underlying LLOD technology for historical linguistics, linguistic typology and corpus linguistics. The technological contribution of this paper is to demonstrate the applicability of Fintan for querying across heterogeneously annotated corpora, as previously, it had only been applied for transformation tasks. With its focus on quantitative analysis, Fintan is a natural complement for existing multi-layer technologies that focus on query and exploration.
pdf
abs
Unifying Morphology Resources with OntoLex-Morph. A Case Study in German
Christian Chiarcos
|
Christian Fäth
|
Maxim Ionov
Proceedings of the Thirteenth Language Resources and Evaluation Conference
The OntoLex vocabulary has become a widely used community standard for machine-readable lexical resources on the web. The primary motivation to use OntoLex in favor of tool- or application-specific formalisms is to facilitate interoperability and information integration across different resources. One of its extension that is currently being developed is a module for representing morphology, OntoLex-Morph. In this paper, we show how OntoLex-Morph can be used for the encoding and integration of different types of morphological resources on a unified basis. With German as the example, we demonstrate it for (a) a full-form dictionary with inflection information (Unimorph), (b) a dictionary of base forms and their derivations (UDer), (c) a dictionary of compounds (from GermaNet), and (d) lexicon and inflection rules of a finite-state parser/generator (SMOR/Morphisto). These data are converted to OntoLex-Morph, their linguistic information is consolidated and corresponding lexical entries are linked with each other.
pdf
bib
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference
Thierry Declerck
|
John P. McCrae
|
Elena Montiel
|
Christian Chiarcos
|
Maxim Ionov
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference
2021
pdf
Embeddings for the Lexicon: Modelling and Representation
Christian Chiarcos
|
Thierry Declerck
|
Maxim Ionov
Proceedings of the 6th Workshop on Semantic Deep Learning (SemDeep-6)
2020
pdf
bib
Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)
Maxim Ionov
|
John P. McCrae
|
Christian Chiarcos
|
Thierry Declerck
|
Julia Bosque-Gil
|
Jorge Gracia
Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)
pdf
abs
The ACoLi Dictionary Graph
Christian Chiarcos
|
Christian Fäth
|
Maxim Ionov
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we report the release of the ACoLi Dictionary Graph, a large-scale collection of multilingual open source dictionaries available in two machine-readable formats, a graph representation in RDF, using the OntoLex-Lemon vocabulary, and a simple tabular data format to facilitate their use in NLP tasks, such as translation inference across dictionaries. We describe the mapping and harmonization of the underlying data structures into a unified representation, its serialization in RDF and TSV, and the release of a massive and coherent amount of lexical data under open licenses.
pdf
abs
Fintan - Flexible, Integrated Transformation and Annotation eNgineering
Christian Fäth
|
Christian Chiarcos
|
Björn Ebbrecht
|
Maxim Ionov
Proceedings of the Twelfth Language Resources and Evaluation Conference
We introduce the Flexible and Integrated Transformation and Annotation eNgeneering (Fintan) platform for converting heterogeneous linguistic resources to RDF. With its modular architecture, workflow management and visualization features, Fintan facilitates the development of complex transformation pipelines by integrating generic RDF converters and augmenting them with extended graph processing capabilities: Existing converters can be easily deployed to the system by means of an ontological data structure which renders their properties and the dependencies between transformation steps. Development of subsequent graph transformation steps for resource transformation, annotation engineering or entity linking is further facilitated by a novel visual rendering of SPARQL queries. A graphical workflow manager allows to easily manage the converter modules and combine them to new transformation pipelines. Employing the stream-based graph processing approach first implemented with CoNLL-RDF, we address common challenges and scalability issues when transforming resources and showcase the performance of Fintan by means of a purely graph-based transformation of the Universal Morphology data to RDF.
pdf
bib
abs
Modelling Frequency and Attestations for OntoLex-Lemon
Christian Chiarcos
|
Maxim Ionov
|
Jesse de Does
|
Katrien Depuydt
|
Anas Fahad Khan
|
Sander Stolk
|
Thierry Declerck
|
John Philip McCrae
Proceedings of the 2020 Globalex Workshop on Linked Lexicography
The OntoLex vocabulary enjoys increasing popularity as a means of publishing lexical resources with RDF and as Linked Data. The recent publication of a new OntoLex module for lexicography, lexicog, reflects its increasing importance for digital lexicography. However, not all aspects of digital lexicography have been covered to the same extent. In particular, supplementary information drawn from corpora such as frequency information, links to attestations, and collocation data were considered to be beyond the scope of lexicog. Therefore, the OntoLex community has put forward the proposal for a novel module for frequency, attestation and corpus information (FrAC), that not only covers the requirements of digital lexicography, but also accommodates essential data structures for lexical information in natural language processing. This paper introduces the current state of the OntoLex-FrAC vocabulary, describes its structure, some selected use cases, elementary concepts and fundamental definitions, with a focus on frequency and attestations.
2018
pdf
Universal Morphologies for the Caucasus region
Christian Chiarcos
|
Kathrin Donandt
|
Maxim Ionov
|
Monika Rind-Pawlowski
|
Hasmik Sargsian
|
Jesse Wichers Schreur
|
Frank Abromeit
|
Christian Fäth
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2015
pdf
Expanding the horizons: adding a new language to the news personalization system
Andrey Fedorovsky
|
Maxim Ionov
|
Varvara Litvinova
|
Tatyana Olenina
|
Darya Trofimova
Proceedings of the First Workshop on Computing News Storylines
2012
pdf
RU-EVAL-2012: Evaluating Dependency Parsers for Russian
Anastasia Gareyshina
|
Maxim Ionov
|
Olga Lyashevskaya
|
Dmitry Privoznov
|
Elena Sokolova
|
Svetlana Toldova
Proceedings of COLING 2012: Posters