Charles Lam

2024

pdf abs
Quantitative metrics to the CARS model in academic discourse in biology introductions
Charles Lam | Nonso Nnamoko
Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)

Writing research articles is crucial in any academic’s development and is thus an important component of the academic discourse. The Introduction section is often seen as a difficult task within the research article genre. This study presents two metrics of rhetorical moves in academic writing: step-n-grams and lengths of steps. While scholars agree that expert writers follow the general pattern described in the CARS model (Swales, 1990), this study complements previous studies with empirical quantitative data that highlight how writers progress from one rhetorical function to another in practice, based on 50 recent papers by expert writers. The discussion shows the significance of the results in relation to writing instructors and data-driven learning.

pdf abs
Multi-Tiered Cantonese Word Segmentation
Charles Lam | Chaak-ming Lau | Jackson L. Lee
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Word segmentation for Chinese text data is essential for compiling corpora and any other tasks where the notion of “word” is assumed, since Chinese orthography does not have conventional word boundaries as languages such as English do. A perennial issue, however, is that there is no consensus about the definition of “word” in Chinese, which makes word segmentation challenging. Recent work in Chinese word segmentation has begun to embrace the idea of multiple word segmentation possibilities. In a similar spirit, this paper focuses on Cantonese, another major Chinese variety. We propose a linguistically motivated, multi-tiered word segmentation system for Cantonese, and release a Cantonese corpus of 150,000 characters word-segmented by this proposal. Our work will be of interest to researchers whose work involves Cantonese corpus data.

2022

pdf abs
PyCantonese: Cantonese Linguistics and NLP in Python
Jackson Lee | Litong Chen | Charles Lam | Chaak Ming Lau | Tsz-Him Tsui
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper introduces PyCantonese, an open-source Python library for Cantonese linguistics and natural language processing. After the library design, implementation, corpus data format, and key datasets included are introduced, the paper provides an overview of the currently implemented functionality: stop words, handling Jyutping romanization, word segmentation, part-of-speech tagging, and parsing Cantonese text.

Charles Lam

2024

2022

2020

2014

2013

Co-authors

Venues