James Kirby
2026
The Chulalongkorn Corpus of Spoken Thai (CCOST)
Pittayawat Pittayaporn | Cathryn Yang | Sujinat Jitwiriyanont | James Kirby
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Pittayawat Pittayaporn | Cathryn Yang | Sujinat Jitwiriyanont | James Kirby
Proceedings of the Fifteenth Language Resources and Evaluation Conference
The Chulalongkorn Corpus of Spoken Thai (CCOST) is a phonetically annotated corpus of Standard Thai. The corpus comprises approximately 7 hours of interview-style spontaneous speech from 49 speakers (19 male, 30 female) ranging in age from 18 to 83 years old. Speakers represent diverse regional backgrounds across Thailand but were instructed to speak in Standard Thai. Each speaker also read a 206-item monosyllabic word list twice and a set of 25 sentences three times. The annotation pipeline combines automatic speech recognition (ASR) and forced alignment using CLARIN-D’s OCTRA and Munich Automatic Segmentation System (MAUS) tools with manual correction by phonetically trained native Thai speakers. Transcriptions include orthographic, word-level, syllable-level, and phone-level annotations including toneme labels. The corpus serves as a resource in the sociophonetic investigation of segmental and tonal variation in spontaneous and controlled speech, enabling examination of individual characteristics as well as group differences across age groups, genders, and regional backgrounds. Hand-corrected annotations will additionally serve to improve forced alignment accuracy for Standard Thai.
2021
Incorporating tone in the calculation of phonotactic probability
James Kirby
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
James Kirby
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
This paper investigates how the ordering of tone relative to the segmental string influences the calculation of phonotactic probability. Trigram and recurrent neural network models were trained on syllable lexicons of four Asian syllable-tone languages (Mandarin, Thai, Vietnamese, and Cantonese) in which tone was treated as a segment occurring in different positions in the string. For trigram models, the optimal permutation interacted with language, while neural network models were relatively unaffected by tone position in all languages. In addition to providing a baseline for future evaluation, these results suggest that phonotactic probability is robust to choices of how tone is ordered with respect to other elements in the syllable.
2018
Inducing a lexicon of sociolinguistic variables from code-mixed text
Philippa Shoemark | James Kirby | Sharon Goldwater
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text
Philippa Shoemark | James Kirby | Sharon Goldwater
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text
Sociolinguistics is often concerned with how variants of a linguistic item (e.g., nothing vs. nothin’) are used by different groups or in different situations. We introduce the task of inducing lexical variables from code-mixed text: that is, identifying equivalence pairs such as (football, fitba) along with their linguistic code (football→British, fitba→Scottish). We adapt a framework for identifying gender-biased word pairs to this new task, and present results on three different pairs of English dialects, using tweets as the code-mixed text. Our system achieves precision of over 70% for two of these three datasets, and produces useful results even without extensive parameter tuning. Our success in adapting this framework from gender to language variety suggests that it could be used to discover other types of analogous pairs as well.
2017
Topic and audience effects on distinctively Scottish vocabulary usage in Twitter data
Philippa Shoemark | James Kirby | Sharon Goldwater
Proceedings of the Workshop on Stylistic Variation
Philippa Shoemark | James Kirby | Sharon Goldwater
Proceedings of the Workshop on Stylistic Variation
Sociolinguistic research suggests that speakers modulate their language style in response to their audience. Similar effects have recently been claimed to occur in the informal written context of Twitter, with users choosing less region-specific and non-standard vocabulary when addressing larger audiences. However, these studies have not carefully controlled for the possible confound of topic: that is, tweets addressed to a broad audience might also tend towards topics that engender a more formal style. In addition, it is not clear to what extent previous results generalize to different samples of users. Using mixed-effects models, we show that audience and topic have independent effects on the rate of distinctively Scottish usage in two demographically distinct Twitter user samples. However, not all effects are consistent between the two groups, underscoring the importance of replicating studies on distinct user samples before drawing strong conclusions from social media data.