Stephen Wu


2020

Word embeddings have been successfully trained in many languages. However, both intrinsic and extrinsic metrics are variable across languages, especially for languages that depart significantly from English in morphology and orthography. This study focuses on building a word embedding model suitable for the Semitic language of Amharic (Ethiopia), which is both morphologically rich and written as an alphasyllabary (abugida) rather than an alphabet. We compare embeddings from tailored neural models, simple pre-processing steps, off-the-shelf baselines, and parallel tasks on a better-resourced Semitic language – Arabic. Experiments show our model’s performance on word analogy tasks, illustrating the divergent objectives of morphological vs. semantic analogies.

2016

Domain-specific annotations for NLP are often centered on real-world applications of text, and incorrect annotations may be particularly unacceptable. In medical text, the process of manual chart review (of a patient’s medical record) is error-prone due to its complexity. We propose a staggered NLP-assisted approach to the refinement of clinical annotations, an interactive process that allows initial human judgments to be verified or falsified by means of comparison with an improving NLP system. We show on our internal Asthma Timelines dataset that this approach improves the quality of the human-produced clinical annotations.
Privacy concerns have often served as an insurmountable barrier for the production of research and resources in clinical information retrieval (IR). We believe that both clinical IR research innovation and legitimate privacy concerns can be served by the creation of intra-institutional, fully protected resources. In this paper, we provide some principles and tools for IR resource-building in the unique problem setting of patient-level IR, following the tradition of the Cranfield paradigm.

2013

2011

2010

2009