Nathaniel Parkes


2025

Building downstream NLP applications with tokenization systems built on morphological segmentation has been shown to be fruitful for certain morphologically-rich languages. Yet, indigenous and endangered languages, which tend to be highly polysynthetic, thereby a po- tential beneficiary of this approach, pose ad- ditional difficulties in their limited access to annotated data for morphological segmenta- tion tasks. In this study, we develop mor- phological segmentation models for Hupa, a Dene/Athabaskan language critically endan- gered to North America. With a total of 595 word types, we seek to identify an optimal mor- phological segmentation model and illustrate how those tested perform under different levels of training data limitation. We propose a simple method that casts morphological segmentation as a sequence binary classification task. While this approach does not outperform the estab- lished practice of multi-class classification, it outperforms neural alternatives. This work is conducted under the intention to act as a start- ing point for future technological developments with Hupa looking to leverage its morpholog- ical qualities, which we hope can serve as a reflection for work with other indigenous lan- guages being studied under similar constraints.