Weiruo Qu


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2008

pdf bib
Targeting Chinese Nominal Compounds in Corpora
Weiruo Qu | Christoph Ringlstetter | Randy Goebel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

For compounding languages, a great part of the topical semantics is transported via nominal compounds. Various applications of natural language processing can profit from explicit access to these compounds, provided by a lexicon. The best way to acquire such a resource is to harvest corpora that represent the domain in question. For Chinese, a significant difficulty lies in the fact that the text comes as a string of characters, only segmented by sentence boundaries. Extraction algorithms that solely rely on context variety do not perform precisely enough. We propose a pipeline of filters that starts from a candidate set established by accessor variety and then employs several methods to improve precision. For the experiments the Xinhua part of the Chinese Gigaword Corpus was used. We extracted a random sample of 200 story texts with 119,509 Hanzi characters. All compound words of this evaluation corpus were tagged, segmented into their morphemes, and augmented with the POS-information of their segments. A cascade of filters applied to a preliminary set of compound candidates led to a very high precision of over 90%, measured for the types. The result also holds for a small corpus where a solely contextual method introduces too much noise, even for the longer compounds. An introduction of MI into the basic candidacy algorithm led to a much higher recall with still reasonable precision for subsequent manual processing. Especially for the four-character compounds, that in our sample represent over 40% of the target data, the method has sufficient efficacy to support the rapid construction of compound dictionaries from domain corpora.