Modelling and Annotating Interlinear Glossed Text from 280 Different Endangered Languages as Linked Data with LIGT

Sebastian Nordhoff


Abstract
This paper reports on the harvesting, analysis, and enrichment of 20k documents from 4 different endangered language archives in 300 different low-resource languages. The documents are heterogeneous as to their provenance (holding archive, language, geographical area, creator) and internal structure (annotation types, metalanguages), but they have the ELAN-XML format in common. Typical annotations include sentence-level translations, morpheme-segmentation, morpheme-level translations, and parts-of-speech. The ELAN-format gives a lot of freedom to document creators, and hence the data set is very heterogeneous. We use regularities in the ELAN format to arrive at a common internal representation of sentences, words, and morphemes, with translations into one or more additional languages. Building upon the paradigm of Linguistic Linked Open Data (LLOD, Chiarcos, Nordhoff, et al. 2012), the document elements receive unique identifiers and are linked to other resources such as Glottolog for languages, Wikidata for semantic concepts, and the Leipzig Glossing Rules list for category abbreviations. We provide an RDF export in the LIGT format (Chiarcos & Ionov 2019), enabling uniform and interoperable access with some semantic enrichments to a formerly disparate resource type difficult to access. Two use cases (semantic search and colexification) are presented to show the viability of the approach.
Anthology ID:
2020.law-1.9
Volume:
Proceedings of the 14th Linguistic Annotation Workshop
Month:
December
Year:
2020
Address:
Barcelona, Spain
Venue:
LAW
SIG:
SIGANN
Publisher:
Association for Computational Linguistics
Note:
Pages:
93–104
Language:
URL:
https://aclanthology.org/2020.law-1.9
DOI:
Bibkey:
Cite (ACL):
Sebastian Nordhoff. 2020. Modelling and Annotating Interlinear Glossed Text from 280 Different Endangered Languages as Linked Data with LIGT. In Proceedings of the 14th Linguistic Annotation Workshop, pages 93–104, Barcelona, Spain. Association for Computational Linguistics.
Cite (Informal):
Modelling and Annotating Interlinear Glossed Text from 280 Different Endangered Languages as Linked Data with LIGT (Nordhoff, LAW 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.law-1.9.pdf