Siyu Liang
2026
The Tonogenesis Continuum in Tibetan: A Computational Investigation
Siyu Liang | Zhaxi Zerong
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)
Siyu Liang | Zhaxi Zerong
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)
Tonogenesis—the historical process by which segmental contrasts evolve into lexical tone—has traditionally been studied through comparative reconstruction and acoustic phonetics. We introduce a computational approach that quantifies the functional role of pitch at different stages of this sound change by measuring how pitch manipulation affects automatic speech recognition (ASR) performance. Through analysis on the sensitivity to pitch-flattening from a set of closely related Tibetan languages, we find evidence of a tonogenesis continuum: atonal Amdo dialects tolerate pitch removal the most, while fully tonal Ü-Tsang varieties show severe degradation, and intermediate Kham dialects fall measurably between these extremes. These gradient effects demonstrate how ASR models implicitly learn the shifting functional load of pitch as languages transition from consonant-based to tone-based lexical contrasts. Our findings show that computational methods can capture fine-grained stages of sound change and suggest that traditional functional load metrics, based solely on minimal pairs, may overestimate pitch dependence in transitional systems where segmental and suprasegmental cues remain phonetically intertwined.
Capturing Epistemic Uncertainty in LLM-Based Soft Labeling
Yanru Jiang | Siyu Liang
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Yanru Jiang | Siyu Liang
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
In many human-annotated NLP tasks involving ambiguity or subjective judgment, annotator disagreement reflects epistemic uncertainty rather than noise. Soft labeling (SL), which represents annotations as probability distributions rather than majority-vote (MV) labels, preserves this uncertainty and can improve downstream performance. We extend this perspective to LLM-based annotation by formalizing LLM soft labeling as introducing controlled variation in model-generated annotations to approximate the latent variability underlying human annotations. We distinguish two sources of variation: model-induced (e.g., stochastic decoding and model ensembles) and human-approximated (e.g., persona prompting and human-calibrated in-context annotation). Using the Gab Hate and GoEmotions datasets, we show that SL training consistently outperforms MV training under stronger LLM-based annotation strategies. Model ensembles produce the most informative soft-label distributions, achieving the best human–LLM agreement and downstream classification performance. These findings suggest that scalable LLM-based annotation pipelines can model epistemic uncertainty through diverse model-level variation without explicitly simulating human attributes.
Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan
Siyu Liang | Talant Mawkanuli | Gina-Anne Levow
Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics
Siyu Liang | Talant Mawkanuli | Gina-Anne Levow
Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics
Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.
Bottlenecks of In-Context Learning for Fieldwork ASR: A Case-study of Panãra
Siyu Liang | Myriam Lapierre | Gina-Anne Levow
Proceedings of the Ninth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-9)
Siyu Liang | Myriam Lapierre | Gina-Anne Levow
Proceedings of the Ninth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-9)
In-context learning (ICL) enables ASR models to transcribe unseen languages by conditioning on a handful of audio-transcript pairs at inference time, with no fine-tuning. This is appealing for language documentation, where transcribed data is scarce and recording conditions vary across sessions. We evaluate ICL on Panãra (Northern Jê, Brazil), a language with a complex practical orthography in which diacritics encode phonemic contrasts, across seven fieldwork recordings varying in speaker, narrative, and recording context. We find substantial within-language variation in transcription accuracy unexplained by any single recording-level factor, and show that diacritics are a systematic bottleneck with pronounced differences across diacritic types. An orthographic manipulation experiment further shows that how diacritics are represented in context transcriptions substantially affects model performance. These results highlight orthographic complexity and recording-level variation as key practical challenges for ICL-assisted fieldwork transcription.
2025
Tone in Perspective: A Computational Typological Analysis of Tone Function in ASR
Siyu Liang | Gina-Anne Levow
Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Siyu Liang | Gina-Anne Levow
Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
This study investigates the impact of pitch flattening on automatic speech recognition (ASR) performance across tonal and non-tonal languages. Using vocoder-based signal processing techniques, we created pitch-flattened versions of speech recordings and compared ASR performance against original recordings. Results reveal that tonal languages experience substantially larger performance degradation than non-tonal languages. Analysis of tone confusion matrices shows systematic patterns of misidentification where contour tones collapse toward level tones when pitch information is removed. Calculation of tone’s functional load at syllable and word levels demonstrates that syllable-level functional load strongly predicts ASR vulnerability to pitch flattening, while word-level patterns reflect each language’s morphological structure. These findings illuminate the differential importance of pitch information across languages and suggest that ASR systems for languages with high syllable-level functional load require more robust pitch modeling.
Breaking the Transcription Bottleneck: Fine-tuning ASR Models for Extremely Low-Resource Fieldwork Languages
Siyu Liang | Gina-Anne Levow
Proceedings of the Fourth Workshop on NLP Applications to Field Linguistics
Siyu Liang | Gina-Anne Levow
Proceedings of the Fourth Workshop on NLP Applications to Field Linguistics
The development of Automatic Speech Recognition (ASR) has yielded impressive results, but its use in linguistic fieldwork remains limited. Recordings collected in fieldwork contexts present unique challenges, including spontaneous speech, environmental noise, and severely constrained datasets from under-documented languages. In this paper, we benchmark the performance of two fine-tuned multilingual ASR models, MMS and XLS-R, on five typologically diverse low-resource languages with control of training data duration. Our findings show that MMS is best suited when extremely small amounts of training data are available, whereas XLS-R shows parity performance once training data exceed one hour. We provide linguistically grounded analysis for further provide insights towards practical guidelines for field linguists, highlighting reproducible ASR adaptation approaches to mitigate the transcription bottleneck in language documentation.
Beyond WER: Probing Whisper’s Sub‐token Decoder Across Diverse Language Resource Levels
Siyu Liang | Nicolas Ballier | Gina-Anne Levow | Richard Wright
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Siyu Liang | Nicolas Ballier | Gina-Anne Levow | Richard Wright
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
While large multilingual automatic speech recognition (ASR) models achieve remarkable performance, the internal mechanisms of the end-to-end pipeline, particularly concerning fairness and efficacy across languages, remain underexplored. This paper introduces a fine-grained analysis of Whisper’s multilingual decoder, examining its sub-token hypotheses during transcription across languages with various resource levels. Our method traces the beam search path, capturing sub-token guesses and their associated probabilities. Results reveal that higher resource languages benefit from higher likelihood of the correct token being top-ranked, greater confidence, lower predictive entropy, and more diverse alternative candidates. Lower resource languages fare worse on these metrics, but also exhibit distinct clustering patterns in sub-token usage sometimes influenced by typology in our PCA and t-SNE analysis. This sub-token probing uncovers systematic decoding disparities masked by aggregate error rates and points towards targeted interventions to ameliorate the imbalanced development of speech technology.