John S. Y. Lee

Also published as: John Lee

2022

We present a corpus of simulated counselling sessions consisting of speech- and text-based dialogs in Cantonese. Consisting of 152K Chinese characters, the corpus labels the dialog act of both client and counsellor utterances, segments each dialog into stages, and identifies the forward and backward links in the dialog. We analyze the distribution of client and counsellor communicative intentions in the various stages, and discuss significant patterns of the dialog flow.

pdf abs
Automatic Nominalization of Clauses through Textual Entailment
John S. Y. Lee | Ho Hung Lim | Carol Webster | Anton Melser
Proceedings of the 29th International Conference on Computational Linguistics

Nominalization re-writes a clause as a noun phrase. It requires the transformation of the head verb of the clause into a deverbal noun, and the verb’s modifiers into nominal modifiers. Past research has focused on the selection of deverbal nouns, but has paid less attention to predicting the word positions and word forms for the nominal modifiers. We propose the use of a textual entailment model for clause nominalization. We obtained the best performance by fine-tuning a textual entailment model on this task, outperforming a number of unsupervised approaches using language model scores from a state-of-the-art neural language model.

2021

pdf abs
Restatement and Question Generation for Counsellor Chatbot
John Lee | Baikun Liang | Haley Fong
Proceedings of the 1st Workshop on NLP for Positive Impact

Amidst rising mental health needs in society, virtual agents are increasingly deployed in counselling. In order to give pertinent advice, counsellors must first gain an understanding of the issues at hand by eliciting sharing from the counsellee. It is thus important for the counsellor chatbot to encourage the user to open up and talk. One way to sustain the conversation flow is to acknowledge the counsellee’s key points by restating them, or probing them further with questions. This paper applies models from two closely related NLP tasks — summarization and question generation — to restatement and question generation in the counselling context. We conducted experiments on a manually annotated dataset of Cantonese post-reply pairs on topics related to loneliness, academic anxiety and test anxiety. We obtained the best performance in both restatement and question generation by fine-tuning BertSum, a state-of-the-art summarization model, with the in-domain manual dataset augmented with a large-scale, automatically mined open-domain dataset.

pdf abs
Character Set Construction for Chinese Language Learning
Chak Yan Yeung | John Lee
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

To promote efficient learning of Chinese characters, pedagogical materials may present not only a single character, but a set of characters that are related in meaning and in written form. This paper investigates automatic construction of these character sets. The proposed model represents a character as averaged word vectors of common words containing the character. It then identifies sets of characters with high semantic similarity through clustering. Human evaluation shows that this representation outperforms direct use of character embeddings, and that the resulting character sets capture distinct semantic ranges.

pdf abs
Text Retrieval for Language Learners: Graded Vocabulary vs. Open Learner Model
John Lee | Chak Yan Yeung
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

A text retrieval system for language learning returns reading materials at the appropriate difficulty level for the user. The system typically maintains a learner model on the user’s vocabulary knowledge, and identifies texts that best fit the model. As the user’s language proficiency increases, model updates are necessary to retrieve texts with the corresponding lexical complexity. We investigate an open learner model that allows user modification of its content, and evaluate its effectiveness with respect to the amount of user update effort. We compare this model with the graded approach, in which the system returns texts at the optimal grade. When the user makes at least half of the expected updates to the open learner model, simulation results show that it outperforms the graded approach in retrieving texts that fit user preference for new-word density.

pdf
Discourse Tree Structure and Dependency Distance in EFL Writing
Jingting Yuan | Qiuhan Lin | John S. Y. Lee
Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021)

pdf abs
Unsupervised Adverbial Identification in Modern Chinese Literature
Wenxiu Xie | John Lee | Fangqiong Zhan | Xiao Han | Chi-Yin Chow
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In many languages, adverbials can be derived from words of various parts-of-speech. In Chinese, the derivation may be marked either with the standard adverbial marker DI, or the non-standard marker DE. Since DE also serves double duty as the attributive marker, accurate identification of adverbials requires disambiguation of its syntactic role. As parsers are trained predominantly on texts using the standard adverbial marker DI, they often fail to recognize adverbials suffixed with the non-standard DE. This paper addresses this problem with an unsupervised, rule-based approach for adverbial identification that utilizes dependency tree patterns. Experiment results show that this approach outperforms a masked language model baseline. We apply this approach to analyze standard and non-standard adverbial marker usage in modern Chinese literature.

pdf abs
Paraphrasing Compound Nominalizations
John Lee | Ho Hung Lim | Carol Webster
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

A nominalization uses a deverbal noun to describe an event associated with its underlying verb. Commonly found in academic and formal texts, nominalizations can be difficult to interpret because of ambiguous semantic relations between the deverbal noun and its arguments. Our goal is to interpret nominalizations by generating clausal paraphrases. We address compound nominalizations with both nominal and adjectival modifiers, as well as prepositional phrases. In evaluations on a number of unsupervised methods, we obtained the strongest performance by using a pre-trained contextualized language model to re-rank paraphrase candidates identified by a textual entailment model.

2020

pdf abs
Using Verb Frames for Text Difficulty Assessment
John Lee | Meichun Liu | Tianyuan Cai
Proceedings of the International FrameNet Workshop 2020: Towards a Global, Multilingual FrameNet

This paper presents the first investigation on using semantic frames to assess text difficulty. Based on Mandarin VerbNet, a verbal semantic database that adopts a frame-based approach, we examine usage patterns of ten verbs in a corpus of graded Chinese texts. We identify a number of characteristics in texts at advanced grades: more frequent use of non-core frame elements; more frequent omission of some core frame elements; increased preference for noun phrases rather than clauses as verb arguments; and more frequent metaphoric usage. These characteristics can potentially be useful for automatic prediction of text readability.

pdf abs
A Counselling Corpus in Cantonese
John Lee | Tianyuan Cai | Wenxiu Xie | Lam Xing
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Virtual agents are increasingly used for delivering health information in general, and mental health assistance in particular. This paper presents a corpus designed for training a virtual counsellor in Cantonese, a variety of Chinese. The corpus consists of a domain-independent subcorpus that supports small talk for rapport building with users, and a domain-specific subcorpus that provides material for a particular area of counselling. The former consists of ELIZA style responses, chitchat expressions, and a dataset of general dialog, all of which are reusable across counselling domains. The latter consists of example user inputs and appropriate chatbot replies relevant to the specific domain. In a case study, we created a chatbot with a domain-specific subcorpus that addressed 25 issues in test anxiety, with 436 inputs solicited from native speakers of Cantonese and 150 chatbot replies harvested from mental health websites. Preliminary evaluations show that Word Mover’s Distance achieved 56% accuracy in identifying the issue in user input, outperforming a number of baselines.

pdf abs
A Dataset for Investigating the Impact of Feedback on Student Revision Outcome
Ildiko Pilan | John Lee | Chak Yan Yeung | Jonathan Webster
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present an annotation scheme and a dataset of teacher feedback provided for texts written by non-native speakers of English. The dataset consists of student-written sentences in their original and revised versions with teacher feedback provided for the errors. Feedback appears both in the form of open-ended comments and error category tags. We focus on a specific error type, namely linking adverbial (e.g. however, moreover) errors. The dataset has been annotated for two aspects: (i) revision outcome establishing whether the re-written student sentence was correct and (ii) directness, indicating whether teachers provided explicitly the correction in their feedback. This dataset allows for studies around the characteristics of teacher feedback and how these influence students’ revision outcome. We describe the data preparation process and we present initial statistical investigations regarding the effect of different feedback characteristics on revision outcome. These show that open-ended comments and mitigating expressions appear in a higher proportion of successful revisions than unsuccessful ones, while directness and metalinguistic terms have no effect. Given that the use of this type of data is relatively unexplored in natural language processing (NLP) applications, we also report some observations and challenges when working with feedback data.

pdf abs
Automatic Compilation of Resources for Academic Writing and Evaluating with Informal Word Identification and Paraphrasing System
Seid Muhie Yimam | Gopalakrishnan Venkatesh | John Lee | Chris Biemann
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present the first approach to automatically building resources for academic writing. The aim is to build a writing aid system that automatically edits a text so that it better adheres to the academic style of writing. On top of existing academic resources, such as the Corpus of Contemporary American English (COCA) academic Word List, the New Academic Word List, and the Academic Collocation List, we also explore how to dynamically build such resources that would be used to automatically identify informal or non-academic words or phrases. The resources are compiled using different generic approaches that can be extended for different domains and languages. We describe the evaluation of resources with a system implementation. The system consists of an informal word identification (IWI), academic candidate paraphrase generation, and paraphrase ranking components. To generate candidates and rank them in context, we have used the PPDB and WordNet paraphrase resources. We use the Concepts in Context (CoInCO) “All-Words” lexical substitution dataset both for the informal word identification and paraphrase generation experiments. Our informal word identification component achieves an F-1 score of 82%, significantly outperforming a stratified classifier baseline. The main contribution of this work is a domain-independent methodology to build targeted resources for writing aids.

pdf abs
Automatic Assistance for Academic Word Usage
Dariush Saberi | John Lee | Jonathan James Webster
Proceedings of the 28th International Conference on Computational Linguistics

This paper describes a writing assistance system that helps students improve their academic writing. Given an input text, the system suggests lexical substitutions that aim to incorporate more academic vocabulary. The substitution candidates are drawn from an academic word list and ranked by a masked language model. Experimental results show that lexical formality analysis can improve the quality of the suggestions, in comparison to a baseline that relies on the masked language model only.

pdf abs
Using Bilingual Patents for Translation Training
John Lee | Benjamin Tsou | Tianyuan Cai
Proceedings of the 28th International Conference on Computational Linguistics

While bilingual corpora have been instrumental for machine translation, their utility for training translators has been less explored. We investigate the use of bilingual corpora as pedagogical tools for translation in the technical domain. In a user study, novice translators revised Chinese translations of English patents through bilingual concordancing. Results show that concordancing with an in-domain bilingual corpus can yield greater improvement in translation quality of technical terms than a general-domain bilingual corpus.

pdf
Bilingual Multi-word Expressions, Multiple-correspondence, and their cultivation from parallel patents: The Chinese-English case
Benjamin K. Tsou | Ka Po Chow | John Lee | Ka-Fai Yip | Yaxuan Ji | Kevin Wu
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

2019

pdf
Noun Generation for Nominalization in Academic Writing
Dariush Saberi | John Lee
Proceedings of the 4th Workshop on Computational Creativity in Language Generation

pdf
Difficulty-aware Distractor Generation for Gap-Fill Items
Chak Yan Yeung | John Lee | Benjamin Tsou
Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association

pdf abs
Personalized Substitution Ranking for Lexical Simplification
John Lee | Chak Yan Yeung
Proceedings of the 12th International Conference on Natural Language Generation

A lexical simplification (LS) system substitutes difficult words in a text with simpler ones to make it easier for the user to understand. In the typical LS pipeline, the Substitution Ranking step determines the best substitution out of a set of candidates. Most current systems do not consider the user’s vocabulary proficiency, and always aim for the simplest candidate. This approach may overlook less-simple candidates that the user can understand, and that are semantically closer to the original word. We propose a personalized approach for Substitution Ranking to identify the candidate that is the closest synonym and is non-complex for the user. In experiments on learners of English at different proficiency levels, we show that this approach enhances the semantic faithfulness of the output, at the cost of a relatively small increase in the number of complex words.

2018

pdf abs
Personalizing Lexical Simplification
John Lee | Chak Yan Yeung
Proceedings of the 27th International Conference on Computational Linguistics

A lexical simplification (LS) system aims to substitute complex words with simple words in a text, while preserving its meaning and grammaticality. Despite individual users’ differences in vocabulary knowledge, current systems do not consider these variations; rather, they are trained to find one optimal substitution or ranked list of substitutions for all users. We evaluate the performance of a state-of-the-art LS system on individual learners of English at different proficiency levels, and measure the benefits of using complex word identification (CWI) models to personalize the system. Experimental results show that even a simple personalized CWI model, based on graded vocabulary lists, can help the system avoid some unnecessary simplifications and produce more readable output.

pdf abs
Personalized Text Retrieval for Learners of Chinese as a Foreign Language
Chak Yan Yeung | John Lee
Proceedings of the 27th International Conference on Computational Linguistics

This paper describes a personalized text retrieval algorithm that helps language learners select the most suitable reading material in terms of vocabulary complexity. The user first rates their knowledge of a small set of words, chosen by a graph-based active learning model. The system trains a complex word identification model on this set, and then applies the model to find texts that contain the desired proportion of new, challenging, and familiar vocabulary. In an evaluation on learners of Chinese as a foreign language, we show that this algorithm is effective in identifying simpler texts for low-proficiency learners, and more challenging ones for high-proficiency learners.

pdf
Register-sensitive Translation: a Case Study of Mandarin and Cantonese (Non-archival Extended Abstract)
Tak-sum Wong | John Lee
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf
Assisted Nominalization for Academic English Writing
John Lee | Dariush Saberi | Marvin Lam | Jonathan Webster
Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG)

pdf
L1-L2 Parallel Treebank of Learner Chinese: Overused and Underused Syntactic Structures
Keying Li | John Lee
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Identifying Speakers and Listeners of Quoted Speech in Literary Works
Chak Yan Yeung | John Lee
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We present the first study that evaluates both speaker and listener identification for direct speech in literary texts. Our approach consists of two steps: identification of speakers and listeners near the quotes, and dialogue chain segmentation. Evaluation results show that this approach outperforms a rule-based approach that is state-of-the-art on a corpus of literary texts.

pdf abs
Lexical Simplification with the Deep Structured Similarity Model
Lis Pereira | Xiaodong Liu | John Lee
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We explore the application of a Deep Structured Similarity Model (DSSM) to ranking in lexical simplification. Our results show that the DSSM can effectively capture fine-grained features to perform semantic matching when ranking substitution candidates, outperforming the state-of-the-art on two standard datasets used for the task.

We present a web-based interface that automatically assesses reading difficulty of Chinese texts. The system performs word segmentation, part-of-speech tagging and dependency parsing on the input text, and then determines the difficulty levels of the vocabulary items and grammatical constructions in the text. Furthermore, the system highlights the words and phrases that must be simplified or re-written in order to conform to the user-specified target difficulty level. Evaluation results show that the system accurately identifies the vocabulary level of 89.9% of the words, and detects grammar points at 0.79 precision and 0.83 recall.

pdf
Towards Universal Dependencies for Learner Chinese
John Lee | Herman Leung | Keying Li
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

pdf abs
Distractor Generation for Chinese Fill-in-the-blank Items
Shu Jiang | John Lee
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper reports the first study on automatic generation of distractors for fill-in-the-blank items for learning Chinese vocabulary. We investigate the quality of distractors generated by a number of criteria, including part-of-speech, difficulty level, spelling, word co-occurrence and semantic similarity. Evaluations show that a semantic similarity measure, based on the word2vec model, yields distractors that are significantly more plausible than those generated by baseline methods.

pdf abs
Carrier Sentence Selection for Fill-in-the-blank Items
Shu Jiang | John Lee
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

Fill-in-the-blank items are a common form of exercise in computer-assisted language learning systems. To automatically generate an effective item, the system must be able to select a high-quality carrier sentence that illustrates the usage of the target word. Previous approaches for carrier sentence selection have considered sentence length, vocabulary difficulty, the position of the target word and the presence of finite verbs. This paper investigates the utility of word co-occurrence statistics and lexical similarity as selection criteria. In an evaluation on generating fill-in-the-blank items for learning Chinese as a foreign language, we show that these two criteria can improve carrier sentence quality.

pdf abs
L1-L2 Parallel Dependency Treebank as Learner Corpus
John Lee | Keying Li | Herman Leung
Proceedings of the 15th International Conference on Parsing Technologies

This opinion paper proposes the use of parallel treebank as learner corpus. We show how an L1-L2 parallel treebank — i.e., parse trees of non-native sentences, aligned to the parse trees of their target hypotheses — can facilitate retrieval of sentences with specific learner errors. We argue for its benefits, in terms of corpus re-use and interoperability, over a conventional learner corpus annotated with error tags. As a proof of concept, we conduct a case study on word-order errors made by learners of Chinese as a foreign language. We report precision and recall in retrieving a range of word-order error categories from L1-L2 tree pairs annotated in the Universal Dependency framework.

pdf abs
Splitting Complex English Sentences
John Lee | J. Buddhika K. Pathirage Don
Proceedings of the 15th International Conference on Parsing Technologies

This paper applies parsing technology to the task of syntactic simplification of English sentences, focusing on the identification of text spans that can be removed from a complex sentence. We report the most comprehensive evaluation to-date on this task, using a dataset of sentences that exhibit simplification based on coordination, subordination, punctuation/parataxis, adjectival clauses, participial phrases, and appositive phrases. We train a decision tree with features derived from text span length, POS tags and dependency relations, and show that it significantly outperforms a parser-only baseline.

pdf
Quantitative Comparative Syntax on the Cantonese-Mandarin Parallel Dependency Treebank
Tak-sum Wong | Kim Gerdes | Herman Leung | John Lee
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

2016

This article proposes a Universal Dependency Annotation Scheme for Mandarin Chinese, including POS tags and dependency analysis. We identify cases of idiosyncrasy of Mandarin Chinese that are difficult to fit into the current schema which has mainly been based on the descriptions of various Indo-European languages. We discuss differences between our scheme and those of the Stanford Chinese Dependencies and the Chinese Dependency Treebank.

pdf
A CALL System for Learning Preposition Usage
John Lee | Donald Sturgeon | Mengqi Luo
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Personalized Exercises for Preposition Learning
John Lee | Mengqi Luo
Proceedings of ACL-2016 System Demonstrations

pdf abs
An Annotated Corpus of Direct Speech
John Lee | Chak Yan Yeung
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We propose a scheme for annotating direct speech in literary texts, based on the Text Encoding Initiative (TEI) and the coreference annotation guidelines from the Message Understanding Conference (MUC). The scheme encodes the speakers and listeners of utterances in a text, as well as the quotative verbs that reports the utterances. We measure inter-annotator agreement on this annotation task. We then present statistics on a manually annotated corpus that consists of books from the New Testament. Finally, we visualize the corpus as a conversational network.

pdf abs
A Dependency Treebank of the Chinese Buddhist Canon
Tak-sum Wong | John Lee
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a dependency treebank of the Chinese Buddhist Canon, which contains 1,514 texts with about 50 million Chinese characters. The treebank was created by an automatic parser trained on a smaller treebank, containing four manually annotated sutras (Lee and Kong, 2014). We report results on word segmentation, part-of-speech tagging and dependency parsing, and discuss challenges posed by the processing of medieval Chinese. In a case study, we exploit the treebank to examine verbs frequently associated with Buddha, and to analyze usage patterns of quotative verbs in direct speech. Our results suggest that certain quotative verbs imply status differences between the speaker and the listener.

pdf abs
A Reading Environment for Learners of Chinese as a Foreign Language
John Lee | Chun Yin Lam | Shu Jiang
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present a mobile app that provides a reading environment for learners of Chinese as a foreign language. The app includes a text database that offers over 500K articles from Chinese Wikipedia. These articles have been word-segmented; each word is linked to its entry in a Chinese-English dictionary, and to automatically-generated review exercises. The app estimates the reading proficiency of the user based on a “to-learn” list of vocabulary items. It automatically constructs and maintains this list by tracking the user’s dictionary lookup behavior and performance in review exercises. When a user searches for articles to read, search results are filtered such that the proportion of unknown words does not exceed a user-specified threshold.

pdf abs
A Customizable Editor for Text Simplification
John Lee | Wenlong Zhao | Wenxiu Xie
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present a browser-based editor for simplifying English text. Given an input sentence, the editor performs both syntactic and lexical simplification. It splits a complex sentence into shorter ones, and suggests word substitutions in drop-down lists. The user can choose the best substitution from the list, undo any inappropriate splitting, and further edit the sentence as necessary. A significant novelty is that the system accepts a customized vocabulary list for a target reader population. It identifies all words in the text that do not belong to the list, and attempts to substitute them with words from the list, thus producing a text tailored for the targeted readers.

We have recently converted a dependency treebank, consisting of ancient Greek and Latin texts, from one annotation scheme to another that was independently designed. This paper makes two observations about this conversion process. First, we show that, despite significant surface differences between the two treebanks, a number of straightforward transformation rules yield a substantial level of compatibility between them, giving evidence for their sound design and high quality of annotation. Second, we analyze some linguistic annotations that require further disambiguation, proposing some simple yet effective machine learning methods.

2009

pdf
Human Evaluation of Article and Noun Number Usage: Influences of Context and Construction Variability
John Lee | Joel Tetreault | Martin Chodorow
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2008

pdf
Correcting Misuse of Verb Forms
John Lee | Stephanie Seneff
Proceedings of ACL-08: HLT

pdf
A Nearest-Neighbor Approach to the Automatic Analysis of Ancient Greek Morphology
John Lee
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

2007

pdf
Detection of Non-Native Sentences Using Machine-Translated Training Data
John Lee | Ming Zhou | Xiaohua Liu
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf
A Computational Model of Text Reuse in Ancient Literary Texts
John Lee
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf abs
Combining Linguistic and Statistical Methods for Bi-directional English Chinese Translation in the Flight Domain
Stephanie Seneff | Chao Wang | John Lee
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

In this paper, we discuss techniques to combine an interlingua translation framework with phrase-based statistical methods, for translation from Chinese into English. Our goal is to achieve high-quality translation, suitable for use in language tutoring applications. We explore these ideas in the context of a flight domain, for which we have a large corpus of English queries, obtained from users interacting with a dialogue system. Our techniques exploit a pre-existing English-to-Chinese translation system to automatically produce a synthetic bilingual corpus. Several experiments were conducted combining linguistic and statistical methods, and manual evaluation was conducted for a set of 460 Chinese sentences. The best performance achieved an “adequate” or better analysis (3 or above rating) on nearly 94% of the 409 parsable subset. Using a Rover scheme to combine four systems resulted in an “adequate or better” rating for 88% of all the utterances.

pdf
Combining Statistical and Knowledge-Based Spoken Language Understanding in Conditional Models
Ye-Yi Wang | Alex Acero | Milind Mahajan | John Lee
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions