2024
pdf
bib
abs
Multi-Tiered Cantonese Word Segmentation
Charles Lam
|
Chaak-ming Lau
|
Jackson L. Lee
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Word segmentation for Chinese text data is essential for compiling corpora and any other tasks where the notion of “word” is assumed, since Chinese orthography does not have conventional word boundaries as languages such as English do. A perennial issue, however, is that there is no consensus about the definition of “word” in Chinese, which makes word segmentation challenging. Recent work in Chinese word segmentation has begun to embrace the idea of multiple word segmentation possibilities. In a similar spirit, this paper focuses on Cantonese, another major Chinese variety. We propose a linguistically motivated, multi-tiered word segmentation system for Cantonese, and release a Cantonese corpus of 150,000 characters word-segmented by this proposal. Our work will be of interest to researchers whose work involves Cantonese corpus data.
2023
pdf
bib
abs
The SIGMORPHON 2022 Shared Task on Cross-lingual and Low-Resource Grapheme-to-Phoneme Conversion
Arya D. McCarthy
|
Jackson L. Lee
|
Alexandra DeLucia
|
Travis Bartley
|
Milind Agarwal
|
Lucas F.E. Ashby
|
Luca Del Signore
|
Cameron Gibson
|
Reuben Raff
|
Winston Wu
Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology
Grapheme-to-phoneme conversion is an important component in many speech technologies, but until recently there were no multilingual benchmarks for this task. The third iteration of the SIGMORPHON shared task on multilingual grapheme-to-phoneme conversion features many improvements from the previous year’s task (Ashby et al., 2021), including additional languages, three subtasks varying the amount of available resources, extensive quality assurance procedures, and automated error analyses. Three teams submitted a total of fifteen systems, at best achieving relative reductions of word error rate of 14% in the crosslingual subtask and 14% in the very-low resource subtask. The generally consistent result is that cross-lingual transfer substantially helps grapheme-to-phoneme modeling, but not to the same degree as in-language examples.
2022
pdf
bib
abs
PyCantonese: Cantonese Linguistics and NLP in Python
Jackson L. Lee
|
Litong Chen
|
Charles Lam
|
Chaak Ming Lau
|
Tsz-Him Tsui
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper introduces PyCantonese, an open-source Python library for Cantonese linguistics and natural language processing. After the library design, implementation, corpus data format, and key datasets included are introduced, the paper provides an overview of the currently implemented functionality: stop words, handling Jyutping romanization, word segmentation, part-of-speech tagging, and parsing Cantonese text.
2020
pdf
bib
abs
Massively Multilingual Pronunciation Modeling with WikiPron
Jackson L. Lee
|
Lucas F.E. Ashby
|
M. Elizabeth Garza
|
Yeonju Lee-Sikka
|
Sean Miller
|
Alan Wong
|
Arya D. McCarthy
|
Kyle Gorman
Proceedings of the Twelfth Language Resources and Evaluation Conference
We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.
2016
pdf
bib
abs
Linguistica 5: Unsupervised Learning of Linguistic Structure
Jackson L. Lee
|
John A. Goldsmith
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations
This paper introduces Linguistica 5, a software for unsupervised learning of linguistic structure. It is a descendant of Goldsmith's (2001, 2006) Linguistica. Open-source and written in Python, the new Linguistica 5 is both a graphical user interface software and a Python library. While Linguistica 5 inherits its predecessors' strength in unsupervised learning of natural language morphology, it incorporates significant improvements in multiple ways. Notable new features include tools for data visualization as well as straightforward extensions for both its components and embedding in other programs.
2015
pdf
bib
abs
Morphological Paradigms: Computational Structure and Unsupervised Learning
Jackson L. Lee
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop
This thesis explores the computational structure of morphological paradigms from the perspective of unsupervised learning. Three topics are studied: (i) stem identification, (ii) paradigmatic similarity, and (iii) paradigm induction. All the three topics progress in terms of the scope of data in question. The first and second topics explore structure when morphological paradigms are given, first within a paradigm and then across paradigms. The third topic asks where morphological paradigms come from in the first place, and explores strategies of paradigm induction from child-directed speech. This research is of interest to linguists and natural language processing researchers, for both theoretical questions and applied areas.