Yangyang Chen


2026

Publicly available spoken language identification (LID) systems provide sparse and inconsistent coverage of indigenous languages of the Americas and languages of the Pacific Islands. No system on HuggingFace covers Central Alaskan Yup’ik except the largest variant of Meta’s MMS-LID family, and only three MMS-LID variants cover Samoan, while Whisper and VoxLingua107-based models lack both despite including other Polynesian languages. We describe an ongoing effort to build a coarse-labeled LID dataset for Yup’ik and Samoan from US public broadcast archives, benchmark publicly available LID systems on it, and train a simple MLP classifier on frozen wav2vec~2.0 representations as a prototype. We report preliminary corpus statistics, off-the-shelf model performance, and prototype results. Guided by the distinctive phonological typology of the target languages, we outline a phonologically-informed fine-tuning direction as future work.