Caitlin Richter

2020

This paper describes a new morphology resource created by Linguistic Data Consortium and the University of Pennsylvania for the DARPA LORELEI Program. The data consists of approximately 2000 tokens annotated for morphological segmentation in each of 9 low resource languages, along with root information for 7 of the languages. The languages annotated show a broad diversity of typological features. A minimal annotation scheme for segmentation was developed such that it could capture the patterns of a wide range of languages and also be performed reliably by non-linguist annotators. The basic annotation guidelines were designed to be language-independent, but included language-specific morphological paradigms and other specifications. The resulting annotated corpus is designed to support and stimulate the development of unsupervised morphological segmenters and analyzers by providing a gold standard for their evaluation on a more typologically diverse set of languages than has previously been available. By providing root annotation, this corpus is also a step toward supporting research in identifying richer morphological structures than simple morpheme boundaries.

2018

pdf
Low-resource Post Processing of Noisy OCR Output for Historical Corpus Digitisation
Caitlin Richter | Matthew Wickes | Deniz Beser | Mitch Marcus
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Evaluating Low-Level Speech Features Against Human Perceptual Data
Caitlin Richter | Naomi H. Feldman | Harini Salgado | Aren Jansen
Transactions of the Association for Computational Linguistics, Volume 5

We introduce a method for measuring the correspondence between low-level speech features and human perception, using a cognitive model of speech perception implemented directly on speech recordings. We evaluate two speaker normalization techniques using this method and find that in both cases, speech features that are normalized across speakers predict human data better than unnormalized speech features, consistent with previous research. Results further reveal differences across normalization methods in how well each predicts human data. This work provides a new framework for evaluating low-level representations of speech on their match to human perception, and lays the groundwork for creating more ecologically valid models of speech perception.

Co-authors

Venues

lrec2
tacl1