Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak’wala Legacy Texts

Milind Agarwal, Antonios Anastasopoulos, Daisy Rosenblum


Abstract
Kwak’wala is an Indigenous language spoken in British Columbia, with a rich legacy of pub- lished documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revi- talization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete dig- itization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we ap- ply the latest OCR techniques to a series of Kwak’wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-the- shelf OCR methods, language identification, and masking to effectively isolate Kwak’wala text, along with post-correction models, to pro- duce a final high-quality transcription.
Anthology ID:
2025.computel-main.15
Volume:
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages
Month:
March
Year:
2025
Address:
Honolulu, Hawaii, USA
Editors:
Jordan Lachler, Godfred Agyapong, Antti Arppe, Sarah Moeller, Aditi Chaudhary, Shruti Rijhwani, Daisy Rosenblum
Venues:
ComputEL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
133–138
Language:
URL:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.computel-main.15/
DOI:
Bibkey:
Cite (ACL):
Milind Agarwal, Antonios Anastasopoulos, and Daisy Rosenblum. 2025. Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak’wala Legacy Texts. In Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 133–138, Honolulu, Hawaii, USA. Association for Computational Linguistics.
Cite (Informal):
Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak’wala Legacy Texts (Agarwal et al., ComputEL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.computel-main.15.pdf