2022
pdf
abs
中文糖尿病问题分类体系及标注语料库构建研究(The Construction of Question Taxonomy and An Annotated Chinese Corpus for Diabetes Question Classification)
Xiaobo Qian (钱晓波)
|
Wenxiu Xie (谢文秀)
|
Shaopei Long (龙绍沛)
|
Murong Lan (兰牧融)
|
Yuanyuan Mu (慕媛媛)
|
Tianyong Hao (郝天永)
Proceedings of the 21st Chinese National Conference on Computational Linguistics
“糖尿病作为一种典型慢性疾病已成为全球重大公共卫生挑战之一。随着互联网的快速发展,庞大的二型糖尿病患者和高危人群对糖尿病专业信息获取的需求日益突出,糖尿病自动问答服务对患者和高危人群的日常健康服务也发挥着越来越重要的作用,然而存在缺乏细粒度分类等突出问题。本文设计了一个表示用户意图的新型糖尿病问题分类体系,包括6个大类和23个细类。基于该体系,本文从两个专业医疗问答网站爬取并构建了一个包含122732个问答对的中文糖尿病问答语料库DaCorp,同时对其中的8000个糖尿病问题进行人工标注,形成一个细粒度的糖尿病标注数据集。此外,为评估该标注数据集的质量,本文实现了8个主流基线分类模型。实验结果表明,最佳分类模型的准确率达到88.7%,验证了糖尿病标注数据集及所提分类体系的有效性。Dacorp、糖尿病标注数据集和标注指南已在线发布,可以免费用于学术研究。”
2021
pdf
abs
Unsupervised Adverbial Identification in Modern Chinese Literature
Wenxiu Xie
|
John Lee
|
Fangqiong Zhan
|
Xiao Han
|
Chi-Yin Chow
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
In many languages, adverbials can be derived from words of various parts-of-speech. In Chinese, the derivation may be marked either with the standard adverbial marker DI, or the non-standard marker DE. Since DE also serves double duty as the attributive marker, accurate identification of adverbials requires disambiguation of its syntactic role. As parsers are trained predominantly on texts using the standard adverbial marker DI, they often fail to recognize adverbials suffixed with the non-standard DE. This paper addresses this problem with an unsupervised, rule-based approach for adverbial identification that utilizes dependency tree patterns. Experiment results show that this approach outperforms a masked language model baseline. We apply this approach to analyze standard and non-standard adverbial marker usage in modern Chinese literature.
2020
pdf
abs
A Counselling Corpus in Cantonese
John Lee
|
Tianyuan Cai
|
Wenxiu Xie
|
Lam Xing
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Virtual agents are increasingly used for delivering health information in general, and mental health assistance in particular. This paper presents a corpus designed for training a virtual counsellor in Cantonese, a variety of Chinese. The corpus consists of a domain-independent subcorpus that supports small talk for rapport building with users, and a domain-specific subcorpus that provides material for a particular area of counselling. The former consists of ELIZA style responses, chitchat expressions, and a dataset of general dialog, all of which are reusable across counselling domains. The latter consists of example user inputs and appropriate chatbot replies relevant to the specific domain. In a case study, we created a chatbot with a domain-specific subcorpus that addressed 25 issues in test anxiety, with 436 inputs solicited from native speakers of Cantonese and 150 chatbot replies harvested from mental health websites. Preliminary evaluations show that Word Mover’s Distance achieved 56% accuracy in identifying the issue in user input, outperforming a number of baselines.
2016
pdf
abs
A Customizable Editor for Text Simplification
John Lee
|
Wenlong Zhao
|
Wenxiu Xie
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations
We present a browser-based editor for simplifying English text. Given an input sentence, the editor performs both syntactic and lexical simplification. It splits a complex sentence into shorter ones, and suggests word substitutions in drop-down lists. The user can choose the best substitution from the list, undo any inappropriate splitting, and further edit the sentence as necessary. A significant novelty is that the system accepts a customized vocabulary list for a target reader population. It identifies all words in the text that do not belong to the list, and attempts to substitute them with words from the list, thus producing a text tailored for the targeted readers.