Christian Chiarcos

2021

pdf bib
Embeddings for the Lexicon: Modelling and Representation
Christian Chiarcos | Thierry Declerck | Maxim Ionov
Proceedings of the 6th Workshop on Semantic Deep Learning (SemDeep-6)

2020

The OntoLex vocabulary enjoys increasing popularity as a means of publishing lexical resources with RDF and as Linked Data. The recent publication of a new OntoLex module for lexicography, lexicog, reflects its increasing importance for digital lexicography. However, not all aspects of digital lexicography have been covered to the same extent. In particular, supplementary information drawn from corpora such as frequency information, links to attestations, and collocation data were considered to be beyond the scope of lexicog. Therefore, the OntoLex community has put forward the proposal for a novel module for frequency, attestation and corpus information (FrAC), that not only covers the requirements of digital lexicography, but also accommodates essential data structures for lexical information in natural language processing. This paper introduces the current state of the OntoLex-FrAC vocabulary, describes its structure, some selected use cases, elementary concepts and fundamental definitions, with a focus on frequency and attestations.

pdf bib abs
Translation Inference by Concept Propagation
Christian Chiarcos | Niko Schenk | Christian Fäth
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

This paper describes our contribution to the Third Shared Task on Translation Inference across Dictionaries (TIAD-2020). We describe an approach on translation inference based on symbolic methods, the propagation of concepts over a graph of interconnected dictionaries: Given a mapping from source language words to lexical concepts (e.g., synsets) as a seed, we use bilingual dictionaries to extrapolate a mapping of pivot and target language words to these lexical concepts. Translation inference is then performed by looking up the lexical concept(s) of a source language word and returning the target language word(s) for which these lexical concepts have the respective highest score. We present two instantiations of this system: One using WordNet synsets as concepts, and one using lexical entries (translations) as concepts. With a threshold of 0, the latter configuration is the second among participant systems in terms of F1 score. We also describe additional evaluation experiments on Apertium data, a comparison with an earlier approach based on embedding projection, and an approach for constrained projection that outperforms the TIAD-2020 vanilla system by a large margin.

pdf bib abs
On the Linguistic Linked Open Data Infrastructure
Christian Chiarcos | Bettina Klimek | Christian Fäth | Thierry Declerck | John Philip McCrae
Proceedings of the 1st International Workshop on Language Technology Platforms

In this paper we describe the current state of development of the Linguistic Linked Open Data (LLOD) infrastructure, an LOD(sub-)cloud of linguistic resources, which covers various linguistic data bases, lexicons, corpora, terminology and metadata repositories.We give in some details an overview of the contributions made by the European H2020 projects “Prêt-à-LLOD” (‘Ready-to-useMultilingual Linked Language Data for Knowledge Services across Sectors’) and “ELEXIS” (‘European Lexicographic Infrastructure’) to the further development of the LLOD.

With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider federation of AI/LT platforms. We illustrate the approach using the five emerging AI/LT platforms AI4EU, ELG, Lynx, QURATOR and SPEAKER.

pdf bib abs
Towards the First Machine Translation System for Sumerian Transliterations
Ravneet Punia | Niko Schenk | Christian Chiarcos | Émilie Pagé-Perron
Proceedings of the 28th International Conference on Computational Linguistics

The Sumerian cuneiform script was invented more than 5,000 years ago and represents one of the oldest in history. We present the first attempt to translate Sumerian texts into English automatically. We publicly release high-quality corpora for standardized training and evaluation and report results on experiments with supervised, phrase-based, and transfer learning techniques for machine translation. Quantitative and qualitative evaluations indicate the usefulness of the translations. Our proposed methodology provides a broader audience of researchers with novel access to the data, accelerates the costly and time-consuming manual translation process, and helps them better explore the relationships between Sumerian cuneiform and Mesopotamian culture.

pdf bib abs
The ACoLi Dictionary Graph
Christian Chiarcos | Christian Fäth | Maxim Ionov
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper, we report the release of the ACoLi Dictionary Graph, a large-scale collection of multilingual open source dictionaries available in two machine-readable formats, a graph representation in RDF, using the OntoLex-Lemon vocabulary, and a simple tabular data format to facilitate their use in NLP tasks, such as translation inference across dictionaries. We describe the mapping and harmonization of the underlying data structures into a unified representation, its serialization in RDF and TSV, and the release of a massive and coherent amount of lexical data under open licenses.

In this paper we describe the contributions made by the European H2020 project “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors’) to the further development of the Linguistic Linked Open Data (LLOD) infrastructure. Prêt-à-LLOD aims to develop a new methodology for building data value chains applicable to a wide range of sectors and applications and based around language resources and language technologies that can be integrated by means of semantic technologies. We describe the methods implemented for increasing the number of language data sets in the LLOD. We also present the approach for ensuring interoperability and for porting LLOD data sets and services to other infrastructures, as well as the contribution of the projects to existing standards.

pdf bib abs
Annotation Interoperability for the Post-ISOCat Era
Christian Chiarcos | Christian Fäth | Frank Abromeit
Proceedings of the 12th Language Resources and Evaluation Conference

With this paper, we provide an overview over ISOCat successor solutions and annotation standardization efforts since 2010, and we describe the low-cost harmonization of post-ISOCat vocabularies by means of modular, linked ontologies: The CLARIN Concept Registry, LexInfo, Universal Parts of Speech, Universal Dependencies and UniMorph are linked with the Ontologies of Linguistic Annotation and through it with ISOCat, the GOLD ontology, the Typological Database Systems ontology and a large number of annotation schemes.

pdf bib abs
A Tree Extension for CoNLL-RDF
Christian Chiarcos | Luis Glaser
Proceedings of the 12th Language Resources and Evaluation Conference

The technological bridges between knowledge graphs and natural language processing are of utmost importance for the future development of language technology. CoNLL-RDF is a technology that provides such a bridge for popular one-word-per-line formats as widely used in NLP (e.g., the CoNLL Shared Tasks), annotation (Universal Dependencies, Unimorph), corpus linguistics (Corpus WorkBench, CWB) and digital lexicography (SketchEngine): Every empty-line separated table (usually a sentence) is parsed into an graph, can be freely manipulated and enriched using W3C-standardized RDF technology, and then be serialized back into in a TSV format, RDF or other formats. An important limitation is that CoNLL-RDF provides native support for word-level annotations only. This does include dependency syntax and semantic role annotations, but neither phrase structures nor text structure. We describe the extension of the CoNLL-RDF technology stack for two vocabulary extensions of CoNLL-TSV, the PTB bracket notation used in earlier CoNLL Shared Tasks and the extension with XML markup elements featured by CWB and SketchEngine. In order to represent the necessary extensions of the CoNLL vocabulary in an adequate fashion, we employ the POWLA vocabulary for representing and navigating in tree structures.

pdf bib abs
Fintan - Flexible, Integrated Transformation and Annotation eNgineering
Christian Fäth | Christian Chiarcos | Björn Ebbrecht | Maxim Ionov
Proceedings of the 12th Language Resources and Evaluation Conference

We introduce the Flexible and Integrated Transformation and Annotation eNgeneering (Fintan) platform for converting heterogeneous linguistic resources to RDF. With its modular architecture, workflow management and visualization features, Fintan facilitates the development of complex transformation pipelines by integrating generic RDF converters and augmenting them with extended graph processing capabilities: Existing converters can be easily deployed to the system by means of an ontological data structure which renders their properties and the dependencies between transformation steps. Development of subsequent graph transformation steps for resource transformation, annotation engineering or entity linking is further facilitated by a novel visual rendering of SPARQL queries. A graphical workflow manager allows to easily manage the converter modules and combine them to new transformation pipelines. Employing the stream-based graph processing approach first implemented with CoNLL-RDF, we address common challenges and scalability issues when transforming resources and showcase the performance of Fintan by means of a purely graph-based transformation of the Universal Morphology data to RDF.

2018

pdf bib
The ACoLi CoNLL Libraries: Beyond Tab-Separated Values
Christian Chiarcos | Niko Schenk
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Towards a Linked Open Data Edition of Sumerian Corpora
Christian Chiarcos | Émilie Pagé-Perron | Ilya Khait | Niko Schenk | Lucas Reckling
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Analyzing Middle High German Syntax with RDF and SPARQL
Christian Chiarcos | Benjamin Kosmehl | Christian Fäth | Maria Sukhareva
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Interoperability of Language-related Information: Mapping the BLL Thesaurus to Lexvo and Glottolog
Vanya Dimitrova | Christian Fäth | Christian Chiarcos | Heike Renner-Westermann | Frank Abromeit
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
A Recurrent Neural Model with Attention for the Recognition of Chinese Implicit Discourse Relations
Samuel Rönnqvist | Niko Schenk | Christian Chiarcos
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We introduce an attention-based Bi-LSTM for Chinese implicit discourse relations and demonstrate that modeling argument pairs as a joint sequence can outperform word order-agnostic approaches. Our model benefits from a partial sampling scheme and is conceptually simple, yet achieves state-of-the-art performance on the Chinese Discourse Treebank. We also visualize its attention activity to illustrate the model’s ability to selectively focus on the relevant parts of an input sequence.

pdf bib abs
Resource-Lean Modeling of Coherence in Commonsense Stories
Niko Schenk | Christian Chiarcos
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

We present a resource-lean neural recognizer for modeling coherence in commonsense stories. Our lightweight system is inspired by successful attempts to modeling discourse relations and stands out due to its simplicity and easy optimization compared to prior approaches to narrative script learning. We evaluate our approach in the Story Cloze Test demonstrating an absolute improvement in accuracy of 4.7% over state-of-the-art implementations.

pdf bib abs
Machine Translation and Automated Analysis of the Sumerian Language
Émilie Pagé-Perron | Maria Sukhareva | Ilya Khait | Christian Chiarcos
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper presents a newly funded international project for machine translation and automated analysis of ancient cuneiform languages where NLP specialists and Assyriologists collaborate to create an information retrieval system for Sumerian. This research is conceived in response to the need to translate large numbers of administrative texts that are only available in transcription, in order to make them accessible to a wider audience. The methodology includes creation of a specialized NLP pipeline and also the use of linguistic linked open data to increase access to the results.

2016

pdf bib
Unsupervised Learning of Prototypical Fillers for Implicit Semantic Role Labeling
Niko Schenk | Christian Chiarcos
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib abs
Corpora and Linguistic Linked Open Data: Motivations, Applications, Limitations
Christian Chiarcos
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. Volume 4 : Conférences invitées

Linguistic Linked Open Data (LLOD) is a technology and a movement in several disciplines working with language resources, including Natural Language Processing, general linguistics, computational lexicography and the localization industry. This talk describes basic principles of Linguistic Linked Open Data and their application to linguistically annotated corpora, it summarizes the current status of the Linguistic Linked Open Data cloud and gives an overview over selected LLOD vocabularies and their uses.

pdf bib
Do We Really Need All Those Rich Linguistic Features? A Neural Network-Based Approach to Implicit Sense Labeling
Niko Schenk | Christian Chiarcos | Kathrin Donandt | Samuel Rönnqvist | Evgeny Stepanov | Giuseppe Riccardi
Proceedings of the CoNLL-16 shared task

pdf bib abs
Combining Ontologies and Neural Networks for Analyzing Historical Language Varieties. A Case Study in Middle Low German
Maria Sukhareva | Christian Chiarcos
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we describe experiments on the morphosyntactic annotation of historical language varieties for the example of Middle Low German (MLG), the official language of the German Hanse during the Middle Ages and a dominant language around the Baltic Sea by the time. To our best knowledge, this is the first experiment in automatically producing morphosyntactic annotations for Middle Low German, and accordingly, no part-of-speech (POS) tagset is currently agreed upon. In our experiment, we illustrate how ontology-based specifications of projected annotations can be employed to circumvent this issue: Instead of training and evaluating against a given tagset, we decomponse it into independent features which are predicted independently by a neural network. Using consistency constraints (axioms) from an ontology, then, the predicted feature probabilities are decoded into a sound ontological representation. Using these representations, we can finally bootstrap a POS tagset capturing only morphosyntactic features which could be reliably predicted. In this way, our approach is capable to optimize precision and recall of morphosyntactic annotations simultaneously with bootstrapping a tagset rather than performing iterative cycles.

The Open Linguistics Working Group (OWLG) brings together researchers from various fields of linguistics, natural language processing, and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking linguistic data collections. A major outcome of our work is the Linguistic Linked Open Data (LLOD) cloud, an LOD (sub-)cloud of linguistic resources, which covers various linguistic databases, lexicons, corpora, terminologies, and metadata repositories. We present and summarize five years of progress on the development of the cloud and of advancements in open data in linguistics, and we describe recent community activities. The paper aims to serve as a guideline to orient and involve researchers with the community and/or Linguistic Linked Open Data.

pdf bib abs
Word Segmentation for Akkadian Cuneiform
Timo Homburg | Christian Chiarcos
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present experiments on word segmentation for Akkadian cuneiform, an ancient writing system and a language used for about 3 millennia in the ancient Near East. To our best knowledge, this is the first study of this kind applied to either the Akkadian language or the cuneiform writing system. As a logosyllabic writing system, cuneiform structurally resembles Eastern Asian writing systems, so, we employ word segmentation algorithms originally developed for Chinese and Japanese. We describe results of rule-based algorithms, dictionary-based algorithms, statistical and machine learning approaches. Our results may indicate possible promising steps in cuneiform word segmentation that can create and improve natural language processing in this area.

This paper introduces a novel research tool for the field of linguistics: The Lin|gu|is|tik web portal provides a virtual library which offers scientific information on every linguistic subject. It comprises selected internet sources and databases as well as catalogues for linguistic literature, and addresses an interdisciplinary audience. The virtual library is the most recent outcome of the Special Subject Collection Linguistics of the German Research Foundation (DFG), and also integrates the knowledge accumulated in the Bibliography of Linguistic Literature. In addition to the portal, we describe long-term goals and prospects with a special focus on ongoing efforts regarding an extension towards integrating language resources and Linguistic Linked Open Data.

2015

pdf bib
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications
Christian Chiarcos | John Philip McCrae | Petya Osenova | Philipp Cimiano | Nancy Ide
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications

pdf bib
Memory-Based Acquisition of Argument Structures and its Application to Implicit Role Detection
Christian Chiarcos | Niko Schenk
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib
An Ontology-based Approach To Automatic Part-of-Speech Tagging Using Heterogeneously Annotated Corpora
Maria Sukhareva | Christian Chiarcos
Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data

pdf bib
A Minimalist Approach to Shallow Discourse Parsing and Implicit Relation Recognition
Christian Chiarcos | Niko Schenk
Proceedings of the Nineteenth Conference on Computational Natural Language Learning - Shared Task

pdf bib
Towards the Unsupervised Acquisition of Implicit Semantic Roles
Niko Schenk | Christian Chiarcos | Maria Sukhareva
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

pdf bib abs
Towards interoperable discourse annotation. Discourse features in the Ontologies of Linguistic Annotation
Christian Chiarcos
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the extension of the Ontologies of Linguistic Annotation (OLiA) with respect to discourse features. The OLiA ontologies provide a a terminology repository that can be employed to facilitate the conceptual (semantic) interoperability of annotations of discourse phenomena as found in the most important corpora available to the community, including OntoNotes, the RST Discourse Treebank and the Penn Discourse Treebank. Along with selected schemes for information structure and coreference, discourse relations are discussed with special emphasis on the Penn Discourse Treebank and the RST Discourse Treebank. For an example contained in the intersection of both corpora, I show how ontologies can be employed to generalize over divergent annotation schemes.

pdf bib
New Technologies for Old Germanic. Resources and Research on Parallel Bibles in Older Continental Western Germanic
Christian Chiarcos | Maria Sukhareva | Roland Mittmann | Timothy Price | Gaye Detmold | Jan Chobotsky
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf bib
Diachronic proximity vs. data sparsity in cross-lingual parser projection. A case study on Germanic
Maria Sukhareva | Christian Chiarcos
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

2013

pdf bib
Linguistic Linked Open Data (LLOD) – Building the cloud
Christian Chiarcos
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction

pdf bib
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data
Christian Chiarcos | Philipp Cimiano | Thierry Declerck | John P. McCrae
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data

pdf bib
Linguistic Linked Open Data (LLOD). Introduction and Overview
Christian Chiarcos | Philipp Cimiano | Thierry Declerck | John P. McCrae
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data

2012

pdf bib
Towards the Unsupervised Acquisition of Discourse Relations
Christian Chiarcos
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib abs
Ontologies of Linguistic Annotation: Survey and perspectives
Christian Chiarcos
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper announces the release of the Ontologies of Linguistic Annotation (OLiA). The OLiA ontologies represent a repository of annotation terminology for various linguistic phenomena on a great band-width of languages. This paper summarizes the results of five years of research, it describes recent developments and directions for further research.

This paper describes the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation (OKFN). The OWLG is an initiative concerned with linguistic data by scholars from diverse fields, including linguistics, NLP, and information science. The primary goal of the working group is to promote the idea of open linguistic resources, to develop means for their representation and to encourage the exchange of ideas across different disciplines. This paper summarizes the progress of the working group, goals that have been identified, problems that we are going to address, and recent activities and ongoing developments. Here, we put particular emphasis on the development of a Linked Open Data (sub-)cloud of linguistic resources that is currently being pursued by several OWLG members.

pdf bib abs
A generic formalism to represent linguistic corpora in RDF and OWL/DL
Christian Chiarcos
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes POWLA, a generic formalism to represent linguistic corpora by means of RDF and OWL/DL. Unlike earlier approaches in this direction, POWLA is not tied to a specific selection of annotation layers, but rather, it is designed to support any kind of text-oriented annotation. POWLA inherits its generic character from the underlying data model PAULA (Dipper, 2005; Chiarcos et al., 2009) that is based on early sketches of the ISO TC37/SC4 Linguistic Annotation Framework (Ide and Romary, 2004). As opposed to existing standoff XML linearizations for such generic data models, it uses RDF as representation formalism and OWL/DL for validation. The paper discusses advantages of this approach, in particular with respect to interoperability and queriability, which are illustrated for the MASC corpus, an open multi-layer corpus of American English (Ide et al., 2008).

We present an approach for querying collections of heterogeneous linguistic corpora that are annotated on multiple layers using arbitrary XML-based markup languages. An OWL ontology provides a homogenising view on the conceptually different markup languages so that a common querying framework can be established using the method of ontology-based query expansion. In addition, we present a highly flexible web-based graphical interface that can be used to query corpora with regard to several different linguistic properties such as, for example, syntactic tree fragments. This interface can also be used for ontology-based querying of multiple corpora simultaneously.

pdf bib abs
Ontology-Based Interface Specifications for a NLP Pipeline Architecture
Ekaterina Buyko | Christian Chiarcos | Antonio Pareja Lora
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The high level of heterogeneity between linguistic annotations usually complicates the interoperability of processing modules within an NLP pipeline. In this paper, a framework for the interoperation of NLP components, based on a data-driven architecture, is presented. Here, ontologies of linguistic annotation are employed to provide a conceptual basis for the tagset-neutral processing of linguistic annotations. The framework proposed here is based on a set of structured OWL ontologies: a reference ontology, a set of annotation models which formalize different annotation schemes, and a declarative linking between these, specified separately. This modular architecture is particularly scalable and flexible as it allows for the integration of different reference ontologies of linguistic annotations in order to overcome the absence of a consensus for an ontology of linguistic terminology. Our proposal originates from three lines of research from different fields: research on annotation type systems in UIMA; the ontological architecture OLiA, originally developed for sustainable documentation and annotation-independent corpus browsing, and the ontologies of the OntoTag model, targeted towards the processing of linguistic annotations in Semantic Web applications. We describe how UIMA annotations can be backed up by ontological specifications of annotation schemes as in the OLiA model, and how these are linked to the OntoTag ontologies, which allow for further ontological processing.

Our goal is to provide a web-based platform for the long-term preservation and distribution of a heterogeneous collection of linguistic resources. We discuss the corpus preprocessing and normalisation phase that results in sets of multi-rooted trees. At the same time we transform the original metadata records, just like the corpora annotated using different annotation approaches and exhibiting different levels of granularity, into the all-encompassing and highly flexible format eTEI for which we present editing and parsing tools. We also discuss the architecture of the sustainability platform. Its primary components are an XML database that contains corpus and metadata files and an SQL database that contains user accounts and access control lists. A staging area, whose structure, contents, and consistency can be checked using tools, is used to make sure that new resources about to be imported into the platform have the correct structure.