Hongzhi Xu


2024

pdf
A Deep Analysis of the Impact of Multiword Expressions and Named Entities on Chinese-English Machine Translations
Huacheng Song | Hongzhi Xu
Findings of the Association for Computational Linguistics: EMNLP 2024

In this paper, we present a study on the impact of so-called multiword expressions (MWEs) and multiword named entities (NEs) on the performance of Chinese-English machine translation (MT) systems. Built on an extended version of the data from the WMT22 Metrics Shared Task (with extra labels of 9 types of Chinese MWEs, and 19 types of Chinese multiword NEs) which includes scores and error annotations provided by human experts, we make further extraction of MWE- and NE-related translation errors. By investigating the human evaluation scores and the error rates on each category of MWEs and NEs, we find that: 1) MT systems tend to perform significantly worse on Chinese sentences with most kinds of MWEs and NEs; 2) MWEs and NEs which make up of about twenty percent of tokens, i.e. characters in Chinese, result in one-third of translation errors; 3) for 13 categories of MWEs and NEs, the error rates exceed 50% with the highest to be 84.8%. Based on the results, we emphasize that MWEs and NEs are still a bottleneck issue for MT and special attention to MWEs and NEs should be paid to further improving the performance of MT systems.

pdf
Annotating Chinese Word Senses with English WordNet: A Practice on OntoNotes Chinese Sense Inventories
Hongzhi Xu | Jingxia Lin | Sameer Pradhan | Mitchell Marcus | Ming Liu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we present our exploration of annotating Chinese word senses using English WordNet synsets, with examples extracted from OntoNotes Chinese sense inventories. Given a target word along with the example that contains it, the annotators select a WordNet synset that best describes the meaning of the target word in the context. The result demonstrates an inter-annotator agreement of 38% between two annotators. We delve into the instances of disagreement by comparing the two annotated synsets, including their positions within the WordNet hierarchy. The examination reveals intriguing patterns among closely related synsets, shedding light on similar concepts represented within the WordNet structure. The data offers as an indirect linking of Chinese word senses defined in OntoNotes Chinese sense inventories to WordNet sysnets, and thus promotes the value of the OntoNotes corpus. Compared to a direct linking of Chinese word senses to WordNet synsets, the example-based annotation has the merit of not being affected by inaccurate sense definitions and thus offers a new way of mapping WordNets of different languages. At the same time, the annotated data also serves as a valuable linguistic resource for exploring potential lexical differences between English and Chinese, with potential contributions to the broader understanding of cross-linguistic semantic mapping

pdf
Benchmarking the Performance of Machine Translation Evaluation Metrics with Chinese Multiword Expressions
Huacheng Song | Hongzhi Xu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

To investigate the impact of Multiword Expressions (MWEs) on the fine-grained performance of the state-of-the-art metrics for Machine Translation Evaluation (MTE), we conduct experiments on the WMT22 Metrics Shared Task dataset with a preliminary focus on the Chinese-to-English language pair. We further annotate 28 types of Chinese MWEs on the source texts and then examine the performance of 31 MTE metrics on groups of sentences containing different MWEs. We have 3 interesting findings: 1) Machine Translation (MT) systems tend to perform worse on most Chinese MWE categories, confirming the previous claim that MWEs are a bottleneck of MT; 2) automatic metrics tend to overrate the translation of sentences containing MWEs; 3) most neural-network-based metrics perform better than string-overlap-based metrics. It concludes that both MT systems and MTE metrics still suffer from MWEs, suggesting richer annotation of data to facilitate MWE-aware automatic MTE and MT.

pdf
How Grammatical Features Impact Machine Translation: A New Test Suite for Chinese-English MT Evaluation
Huacheng Song | Yi Li | Yiwen Wu | Yu Liu | Jingxia Lin | Hongzhi Xu
Proceedings of the Ninth Conference on Machine Translation

Machine translation (MT) evaluation has evolved toward a trend of fine-grained granularity, enabling a more precise diagnosis of hidden flaws and weaknesses of MT systems from various perspectives. This paper examines how MT systems are potentially affected by certain grammatical features, offering insights into the challenges these features pose and suggesting possible directions for improvement. We develop a new test suite by extracting 7,848 sentences from a multi-domain Chinese-English parallel corpus. All the Chinese text was further annotated with 43 grammatical features using a semi-automatic method. This test suite was subsequently used to evaluate eight state-of-the-art MT systems according to six different automatic evaluation metrics. The results reveal intriguing patterns of MT performance associated with different domains and various grammatical features, highlighting the test suite’s effectiveness. The test suite was made publicly available and it will serve as an important benchmark for evaluating and diagnosing Chinese-English MT systems.

2023

pdf
PARSEME corpus release 1.3
Agata Savary | Cherifa Ben Khelil | Carlos Ramisch | Voula Giouli | Verginica Barbu Mititelu | Najet Hadj Mohamed | Cvetana Krstev | Chaya Liebeskind | Hongzhi Xu | Sara Stymne | Tunga Güngör | Thomas Pickard | Bruno Guillaume | Eduard Bejček | Archna Bhatia | Marie Candito | Polona Gantar | Uxoa Iñurrieta | Albert Gatt | Jolanta Kovalevskaite | Timm Lichte | Nikola Ljubešić | Johanna Monti | Carla Parra Escartín | Mehrnoush Shamsfard | Ivelina Stoyanova | Veronika Vincze | Abigail Walsh
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)

We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.

2020

pdf
Modeling Morphological Typology for Unsupervised Learning of Language Morphology
Hongzhi Xu | Jordan Kodner | Mitchell Marcus | Charles Yang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper describes a language-independent model for fully unsupervised morphological analysis that exploits a universal framework leveraging morphological typology. By modeling morphological processes including suffixation, prefixation, infixation, and full and partial reduplication with constrained stem change rules, our system effectively constrains the search space and offers a wide coverage in terms of morphological typology. The system is tested on nine typologically and genetically diverse languages, and shows superior performance over leading systems. We also investigate the effect of an oracle that provides only a handful of bits per language to signal morphological type.

pdf
Morphological Segmentation for Low Resource Languages
Justin Mott | Ann Bies | Stephanie Strassel | Jordan Kodner | Caitlin Richter | Hongzhi Xu | Mitchell Marcus
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes a new morphology resource created by Linguistic Data Consortium and the University of Pennsylvania for the DARPA LORELEI Program. The data consists of approximately 2000 tokens annotated for morphological segmentation in each of 9 low resource languages, along with root information for 7 of the languages. The languages annotated show a broad diversity of typological features. A minimal annotation scheme for segmentation was developed such that it could capture the patterns of a wide range of languages and also be performed reliably by non-linguist annotators. The basic annotation guidelines were designed to be language-independent, but included language-specific morphological paradigms and other specifications. The resulting annotated corpus is designed to support and stimulate the development of unsupervised morphological segmenters and analyzers by providing a gold standard for their evaluation on a more typologically diverse set of languages than has previously been available. By providing root annotation, this corpus is also a step toward supporting research in identifying richer morphological structures than simple morpheme boundaries.

pdf
Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions
Carlos Ramisch | Agata Savary | Bruno Guillaume | Jakub Waszczuk | Marie Candito | Ashwini Vaidya | Verginica Barbu Mititelu | Archna Bhatia | Uxoa Iñurrieta | Voula Giouli | Tunga Güngör | Menghan Jiang | Timm Lichte | Chaya Liebeskind | Johanna Monti | Renata Ramisch | Sara Stymne | Abigail Walsh | Hongzhi Xu
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.

2018

pdf
Unsupervised Morphology Learning with Statistical Paradigms
Hongzhi Xu | Mitchell Marcus | Charles Yang | Lyle Ungar
Proceedings of the 27th International Conference on Computational Linguistics

This paper describes an unsupervised model for morphological segmentation that exploits the notion of paradigms, which are sets of morphological categories (e.g., suffixes) that can be applied to a homogeneous set of words (e.g., nouns or verbs). Our algorithm identifies statistically reliable paradigms from the morphological segmentation result of a probabilistic model, and chooses reliable suffixes from them. The new suffixes can be fed back iteratively to improve the accuracy of the probabilistic model. Finally, the unreliable paradigms are subjected to pruning to eliminate unreliable morphological relations between words. The paradigm-based algorithm significantly improves segmentation accuracy. Our method achieves start-of-the-art results on experiments using the Morpho-Challenge data, including English, Turkish, and Finnish.

pdf
Annotating Chinese Light Verb Constructions according to PARSEME guidelines
Menghan Jiang | Natalia Klyueva | Hongzhi Xu | Chu-Ren Huang
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf
Case Studies in the Automatic Characterization of Grammars from Small Wordlists
Jordan Kodner | Spencer Caplan | Hongzhi Xu | Mitchell P. Marcus | Charles Yang
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

2016

pdf
Database of Mandarin Neighborhood Statistics
Karl Neergaard | Hongzhi Xu | Chu-Ren Huang
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In the design of controlled experiments with language stimuli, researchers from psycholinguistic, neurolinguistic, and related fields, require language resources that isolate variables known to affect language processing. This article describes a freely available database that provides word level statistics for words and nonwords of Mandarin, Chinese. The featured lexical statistics include subtitle corpus frequency, phonological neighborhood density, neighborhood frequency, and homophone density. The accompanying word descriptors include pinyin, ascii phonetic transcription (sampa), lexical tone, syllable structure, dominant PoS, and syllable, segment and pinyin lengths for each phonological word. It is designed for researchers particularly concerned with language processing of isolated words and made to accommodate multiple existing hypotheses concerning the structure of the Mandarin syllable. The database is divided into multiple files according to the desired search criteria: 1) the syllable segmentation schema used to calculate density measures, and 2) whether the search is for words or nonwords. The database is open to the research community at https://github.com/karlneergaard/Mandarin-Neighborhood-Statistics.

2015

pdf
LLT-PolyU: Identifying Sentiment Intensity in Ironic Tweets
Hongzhi Xu | Enrico Santus | Anna Laszlo | Chu-Ren Huang
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf
Sentiment Analyzer with Rich Features for Ironic and Sarcastic Tweets
Piyoros Tungthamthiti | Enrico Santus | Hongzhi Xu | Chu-Ren Huang | Kiyoaki Shirai
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf
Auditory Synaesthesia and Near Synonyms: A Corpus-Based Analysis of sheng1 and yin1 in Mandarin Chinese
Qingqing Zhao | Chu-Ren Huang | Hongzhi Xu
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

2014

pdf bib
Corpus-based Study and Identification of Mandarin Chinese Light Verb Variations
Chu-Ren Huang | Jingxia Lin | Menghan Jiang | Hongzhi Xu
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

pdf
Annotation and Classification of Light Verbs and Light Verb Variations in Mandarin Chinese
Jingxia Lin | Hongzhi Xu | Menghan Jiang | Chu-Ren Huang
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

pdf
Annotate and Identify Modalities, Speech Acts and Finer-Grained Event Types in Chinese Text
Hongzhi Xu | Chu-Ren Huang
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

2013

pdf
A Rule System for Chinese Time Entity Recognition by Comprehensive Linguistic Study
Hongzhi Xu | Chu-Ren Huang
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Primitives of Events and the Semantic Representation
Hongzhi Xu | Chu-Ren Huang
Proceedings of the 6th International Conference on Generative Approaches to the Lexicon (GL2013)

2012

pdf
A Grammar-informed Corpus-based Sentence Database for Linguistic and Computational Studies
Hongzhi Xu | Helen Kaiyun Chen | Chu-Ren Huang | Qin Lu | Dingxu Shi | Tin-Shing Chiu
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We adopt the corpus-informed approach to example sentence selections for the construction of a reference grammar. In the process, a database containing sentences that are carefully selected by linguistic experts including the full range of linguistic facts covered in an authoritative Chinese Reference Grammar is constructed and structured according to the reference grammar. A search engine system is developed to facilitate the process of finding the most typical examples the users need to study a linguistic problem or prove their hypotheses. The database can also be used as a training corpus by computational linguists to train models for Chinese word segmentation, POS tagging and sentence parsing.

pdf
Compositionality of NN Compounds: A Case Study on [N1+Artifactual-Type Event Nouns]
Shan Wang | Chu-Ren Huang | Hongzhi Xu
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf
The Headedness of Mandarin Chinese Serial Verb Constructions: A Corpus-Based Study
Jingxia Lin | Chu-Ren Huang | Huarui Zhang | Hongzhi Xu
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

2010

pdf
Expanding Chinese Sentiment Dictionaries from Large Scale Unlabeled Corpus
Hongzhi Xu | Kai Zhao | Likun Qiu | Changjian Hu
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

2009

pdf
Discovery of Dependency Tree Patterns for Relation Extraction
Hongzhi Xu | Changjian Hu | Guoyang Shen
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

2008

pdf
Combining Context Features by Canonical Belief Network for Chinese Part-Of-Speech Tagging
Hongzhi Xu | Chunping Li
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II