Searchable Language Documentation Corpora: DoReCo meets TEITOK

Maarten Janssen, Frank Seifart


Abstract
In this paper, we describe a newly created searchable interface for DoReCo, a database that contains spoken corpora from a world-wide sample of 53, mostly lesser described languages, with audio, transcription, translation, and - for most languages - interlinear morpheme glosses. Until now, DoReCo data were available for download via the DoReCo website and via the Nakala repository in a number of different formats, but not directly accessible online. We created a graphical interface to view, listen to, and search these data online, providing direct and intuitive access for linguists and laypeople. The new interface uses the TEITOK corpus infrastructure to provide a number of different visualizations on individual documents in DoReCo and provides a search interface to perform detailed searches on individual languages. The use of TEITOK also enables the corpus for use with NLP pipelines, either using the data to train NLP models, or to use NLP models to enrich the data.
Anthology ID:
2025.fieldmatters-1.5
Volume:
Proceedings of the Fourth Workshop on NLP Applications to Field Linguistics
Month:
August
Year:
2025
Address:
Vienna, Austria
Editors:
Éric Le Ferrand, Elena Klyachko, Anna Postnikova, Tatiana Shavrina, Oleg Serikov, Ekaterina Voloshina, Ekaterina Vylomova
Venues:
FieldMatters | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
58–64
Language:
URL:
https://preview.aclanthology.org/corrections-2025-08/2025.fieldmatters-1.5/
DOI:
Bibkey:
Cite (ACL):
Maarten Janssen and Frank Seifart. 2025. Searchable Language Documentation Corpora: DoReCo meets TEITOK. In Proceedings of the Fourth Workshop on NLP Applications to Field Linguistics, pages 58–64, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Searchable Language Documentation Corpora: DoReCo meets TEITOK (Janssen & Seifart, FieldMatters 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2025-08/2025.fieldmatters-1.5.pdf