The GLAUx corpus: methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek

Alek Keersmaekers

doi:10.18653/v1/2021.lchange-1.6

The GLAUx corpus: methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek

Abstract

This paper describes the GLAUx project (“the Greek Language Automated”), an ongoing effort to develop a large long-term diachronic corpus of Greek, covering sixteen centuries of literary and non-literary material annotated with NLP methods. After providing an overview of related corpus projects and discussing the general architecture of the corpus, it zooms in on a number of larger methodological issues in the design of historical corpora. These include the encoding of textual variants, handling extralinguistic variation and annotating linguistic ambiguity. Finally, the long- and short-term perspectives of this project are discussed.

Anthology ID:: 2021.lchange-1.6
Volume:: Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021
Month:: August
Year:: 2021
Address:: Online
Editors:: Nina Tahmasebi, Adam Jatowt, Yang Xu, Simon Hengchen, Syrielle Montariol, Haim Dubossarsky
Venue:: LChange
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 39–50
Language:
URL:: https://aclanthology.org/2021.lchange-1.6
DOI:: 10.18653/v1/2021.lchange-1.6
Bibkey:
Cite (ACL):: Alek Keersmaekers. 2021. The GLAUx corpus: methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek. In Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021, pages 39–50, Online. Association for Computational Linguistics.
Cite (Informal):: The GLAUx corpus: methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek (Keersmaekers, LChange 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl24-info/2021.lchange-1.6.pdf
Data: Universal Dependencies

PDF Search