Abstract
This paper describes the GLAUx project (“the Greek Language Automated”), an ongoing effort to develop a large long-term diachronic corpus of Greek, covering sixteen centuries of literary and non-literary material annotated with NLP methods. After providing an overview of related corpus projects and discussing the general architecture of the corpus, it zooms in on a number of larger methodological issues in the design of historical corpora. These include the encoding of textual variants, handling extralinguistic variation and annotating linguistic ambiguity. Finally, the long- and short-term perspectives of this project are discussed.- Anthology ID:
- 2021.lchange-1.6
- Volume:
- Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021
- Month:
- August
- Year:
- 2021
- Address:
- Online
- Editors:
- Nina Tahmasebi, Adam Jatowt, Yang Xu, Simon Hengchen, Syrielle Montariol, Haim Dubossarsky
- Venue:
- LChange
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 39–50
- Language:
- URL:
- https://aclanthology.org/2021.lchange-1.6
- DOI:
- 10.18653/v1/2021.lchange-1.6
- Cite (ACL):
- Alek Keersmaekers. 2021. The GLAUx corpus: methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek. In Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021, pages 39–50, Online. Association for Computational Linguistics.
- Cite (Informal):
- The GLAUx corpus: methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek (Keersmaekers, LChange 2021)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2021.lchange-1.6.pdf
- Data
- Universal Dependencies