Daisy Rosenblum

2023

pdf bib
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Atticus Harrigan | Aditi Chaudhary | Shruti Rijhwani | Sarah Moeller | Antti Arppe | Alexis Palmer | Ryan Henke | Daisy Rosenblum
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf
User-Centric Evaluation of OCR Systems for Kwak’wala
Shruti Rijhwani | Daisy Rosenblum | Michayla King | Antonios Anastasopoulos | Graham Neubig
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages

2021

pdf abs
Lexically Aware Semi-Supervised Learning for OCR Post-Correction
Shruti Rijhwani | Daisy Rosenblum | Antonios Anastasopoulos | Graham Neubig
Transactions of the Association for Computational Linguistics, Volume 9

Abstract Much of the existing linguistic data in many languages of the world is locked away in non- digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general- purpose OCR systems on recognition of less- well-resourced languages. However, these methods rely on manually curated post- correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15%–29%, where we find the combination of self-training and lexically aware decoding essential for achieving consistent improvements.1

2020

This paper surveys the first, three-year phase of a project at the National Research Council of Canada that is developing software to assist Indigenous communities in Canada in preserving their languages and extending their use. The project aimed to work within the empowerment paradigm, where collaboration with communities and fulfillment of their goals is central. Since many of the technologies we developed were in response to community needs, the project ended up as a collection of diverse subprojects, including the creation of a sophisticated framework for building verb conjugators for highly inflectional polysynthetic languages (such as Kanyen’kéha, in the Iroquoian language family), release of what is probably the largest available corpus of sentences in a polysynthetic language (Inuktut) aligned with English sentences and experiments with machine translation (MT) systems trained on this corpus, free online services based on automatic speech recognition (ASR) for easing the transcription bottleneck for recordings of speech in Indigenous languages (and other languages), software for implementing text prediction and read-along audiobooks for Indigenous languages, and several other subprojects.

Daisy Rosenblum

2023

2021

2020

Co-authors

Venues