Hongzhi Xu


Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions
Carlos Ramisch | Agata Savary | Bruno Guillaume | Jakub Waszczuk | Marie Candito | Ashwini Vaidya | Verginica Barbu Mititelu | Archna Bhatia | Uxoa Iñurrieta | Voula Giouli | Tunga Güngör | Menghan Jiang | Timm Lichte | Chaya Liebeskind | Johanna Monti | Renata Ramisch | Sara Stymne | Abigail Walsh | Hongzhi Xu
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.

Modeling Morphological Typology for Unsupervised Learning of Language Morphology
Hongzhi Xu | Jordan Kodner | Mitchell Marcus | Charles Yang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper describes a language-independent model for fully unsupervised morphological analysis that exploits a universal framework leveraging morphological typology. By modeling morphological processes including suffixation, prefixation, infixation, and full and partial reduplication with constrained stem change rules, our system effectively constrains the search space and offers a wide coverage in terms of morphological typology. The system is tested on nine typologically and genetically diverse languages, and shows superior performance over leading systems. We also investigate the effect of an oracle that provides only a handful of bits per language to signal morphological type.

Morphological Segmentation for Low Resource Languages
Justin Mott | Ann Bies | Stephanie Strassel | Jordan Kodner | Caitlin Richter | Hongzhi Xu | Mitchell Marcus
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes a new morphology resource created by Linguistic Data Consortium and the University of Pennsylvania for the DARPA LORELEI Program. The data consists of approximately 2000 tokens annotated for morphological segmentation in each of 9 low resource languages, along with root information for 7 of the languages. The languages annotated show a broad diversity of typological features. A minimal annotation scheme for segmentation was developed such that it could capture the patterns of a wide range of languages and also be performed reliably by non-linguist annotators. The basic annotation guidelines were designed to be language-independent, but included language-specific morphological paradigms and other specifications. The resulting annotated corpus is designed to support and stimulate the development of unsupervised morphological segmenters and analyzers by providing a gold standard for their evaluation on a more typologically diverse set of languages than has previously been available. By providing root annotation, this corpus is also a step toward supporting research in identifying richer morphological structures than simple morpheme boundaries.


Annotating Chinese Light Verb Constructions according to PARSEME guidelines
Menghan Jiang | Natalia Klyueva | Hongzhi Xu | Chu-Ren Huang
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Unsupervised Morphology Learning with Statistical Paradigms
Hongzhi Xu | Mitchell Marcus | Charles Yang | Lyle Ungar
Proceedings of the 27th International Conference on Computational Linguistics

This paper describes an unsupervised model for morphological segmentation that exploits the notion of paradigms, which are sets of morphological categories (e.g., suffixes) that can be applied to a homogeneous set of words (e.g., nouns or verbs). Our algorithm identifies statistically reliable paradigms from the morphological segmentation result of a probabilistic model, and chooses reliable suffixes from them. The new suffixes can be fed back iteratively to improve the accuracy of the probabilistic model. Finally, the unreliable paradigms are subjected to pruning to eliminate unreliable morphological relations between words. The paradigm-based algorithm significantly improves segmentation accuracy. Our method achieves start-of-the-art results on experiments using the Morpho-Challenge data, including English, Turkish, and Finnish.


Case Studies in the Automatic Characterization of Grammars from Small Wordlists
Jordan Kodner | Spencer Caplan | Hongzhi Xu | Mitchell P. Marcus | Charles Yang
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages


Database of Mandarin Neighborhood Statistics
Karl Neergaard | Hongzhi Xu | Chu-Ren Huang
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In the design of controlled experiments with language stimuli, researchers from psycholinguistic, neurolinguistic, and related fields, require language resources that isolate variables known to affect language processing. This article describes a freely available database that provides word level statistics for words and nonwords of Mandarin, Chinese. The featured lexical statistics include subtitle corpus frequency, phonological neighborhood density, neighborhood frequency, and homophone density. The accompanying word descriptors include pinyin, ascii phonetic transcription (sampa), lexical tone, syllable structure, dominant PoS, and syllable, segment and pinyin lengths for each phonological word. It is designed for researchers particularly concerned with language processing of isolated words and made to accommodate multiple existing hypotheses concerning the structure of the Mandarin syllable. The database is divided into multiple files according to the desired search criteria: 1) the syllable segmentation schema used to calculate density measures, and 2) whether the search is for words or nonwords. The database is open to the research community at https://github.com/karlneergaard/Mandarin-Neighborhood-Statistics.


Sentiment Analyzer with Rich Features for Ironic and Sarcastic Tweets
Piyoros Tungthamthiti | Enrico Santus | Hongzhi Xu | Chu-Ren Huang | Kiyoaki Shirai
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

Auditory Synaesthesia and Near Synonyms: A Corpus-Based Analysis of sheng1 and yin1 in Mandarin Chinese
Qingqing Zhao | Chu-Ren Huang | Hongzhi Xu
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

LLT-PolyU: Identifying Sentiment Intensity in Ironic Tweets
Hongzhi Xu | Enrico Santus | Anna Laszlo | Chu-Ren Huang
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)


pdf bib
Corpus-based Study and Identification of Mandarin Chinese Light Verb Variations
Chu-Ren Huang | Jingxia Lin | Menghan Jiang | Hongzhi Xu
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

Annotation and Classification of Light Verbs and Light Verb Variations in Mandarin Chinese
Jingxia Lin | Hongzhi Xu | Menghan Jiang | Chu-Ren Huang
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

Annotate and Identify Modalities, Speech Acts and Finer-Grained Event Types in Chinese Text
Hongzhi Xu | Chu-Ren Huang
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing


Primitives of Events and the Semantic Representation
Hongzhi Xu | Chu-Ren Huang
Proceedings of the 6th International Conference on Generative Approaches to the Lexicon (GL2013)

A Rule System for Chinese Time Entity Recognition by Comprehensive Linguistic Study
Hongzhi Xu | Chu-Ren Huang
Proceedings of the Sixth International Joint Conference on Natural Language Processing


A Grammar-informed Corpus-based Sentence Database for Linguistic and Computational Studies
Hongzhi Xu | Helen Kaiyun Chen | Chu-Ren Huang | Qin Lu | Dingxu Shi | Tin-Shing Chiu
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We adopt the corpus-informed approach to example sentence selections for the construction of a reference grammar. In the process, a database containing sentences that are carefully selected by linguistic experts including the full range of linguistic facts covered in an authoritative Chinese Reference Grammar is constructed and structured according to the reference grammar. A search engine system is developed to facilitate the process of finding the most typical examples the users need to study a linguistic problem or prove their hypotheses. The database can also be used as a training corpus by computational linguists to train models for Chinese word segmentation, POS tagging and sentence parsing.

Compositionality of NN Compounds: A Case Study on [N1+Artifactual-Type Event Nouns]
Shan Wang | Chu-Ren Huang | Hongzhi Xu
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

The Headedness of Mandarin Chinese Serial Verb Constructions: A Corpus-Based Study
Jingxia Lin | Chu-Ren Huang | Huarui Zhang | Hongzhi Xu
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation


Expanding Chinese Sentiment Dictionaries from Large Scale Unlabeled Corpus
Hongzhi Xu | Kai Zhao | Likun Qiu | Changjian Hu
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation


Discovery of Dependency Tree Patterns for Relation Extraction
Hongzhi Xu | Changjian Hu | Guoyang Shen
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2


Combining Context Features by Canonical Belief Network for Chinese Part-Of-Speech Tagging
Hongzhi Xu | Chunping Li
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II