Charles Lam


2024

pdf
Multi-Tiered Cantonese Word Segmentation
Charles Lam | Chaak-ming Lau | Jackson L. Lee
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Word segmentation for Chinese text data is essential for compiling corpora and any other tasks where the notion of “word” is assumed, since Chinese orthography does not have conventional word boundaries as languages such as English do. A perennial issue, however, is that there is no consensus about the definition of “word” in Chinese, which makes word segmentation challenging. Recent work in Chinese word segmentation has begun to embrace the idea of multiple word segmentation possibilities. In a similar spirit, this paper focuses on Cantonese, another major Chinese variety. We propose a linguistically motivated, multi-tiered word segmentation system for Cantonese, and release a Cantonese corpus of 150,000 characters word-segmented by this proposal. Our work will be of interest to researchers whose work involves Cantonese corpus data.

pdf
Quantitative metrics to the CARS model in academic discourse in biology introductions
Charles Lam | Nonso Nnamoko
Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)

Writing research articles is crucial in any academic’s development and is thus an important component of the academic discourse. The Introduction section is often seen as a difficult task within the research article genre. This study presents two metrics of rhetorical moves in academic writing: step-n-grams and lengths of steps. While scholars agree that expert writers follow the general pattern described in the CARS model (Swales, 1990), this study complements previous studies with empirical quantitative data that highlight how writers progress from one rhetorical function to another in practice, based on 50 recent papers by expert writers. The discussion shows the significance of the results in relation to writing instructors and data-driven learning.

2022

pdf
PyCantonese: Cantonese Linguistics and NLP in Python
Jackson Lee | Litong Chen | Charles Lam | Chaak Ming Lau | Tsz-Him Tsui
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper introduces PyCantonese, an open-source Python library for Cantonese linguistics and natural language processing. After the library design, implementation, corpus data format, and key datasets included are introduced, the paper provides an overview of the currently implemented functionality: stop words, handling Jyutping romanization, word segmentation, part-of-speech tagging, and parsing Cantonese text.

2020

pdf
Forms and Meanings of Lexical Reduplications in Cantonese: a corpus study
Charles Lam
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

2014

pdf
A Unified Analysis to Surpass Comparative and Experiential Aspect
Charles Lam
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

2013

pdf
Reduplication across Categories in Cantonese
Charles Lam
Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27)