Hao Xu


2022

pdf
ZiNet: Linking Chinese Characters Spanning Three Thousand Years
Yang Chi | Fausto Giunchiglia | Daqian Shi | Xiaolei Diao | Chuntao Li | Hao Xu
Findings of the Association for Computational Linguistics: ACL 2022

Modern Chinese characters evolved from 3,000 years ago. Up to now, tens of thousands of glyphs of ancient characters have been discovered, which must be deciphered by experts to interpret unearthed documents. Experts usually need to compare each ancient character to be examined with similar known ones in whole historical periods. However, it is inevitably limited by human memory and experience, which often cost a lot of time but associations are limited to a small scope. To help researchers discover glyph similar characters, this paper introduces ZiNet, the first diachronic knowledge base describing relationships and evolution of Chinese characters and words. In addition, powered by the knowledge of radical systems in ZiNet, this paper introduces glyph similarity measurement between ancient Chinese characters, which could capture similar glyph pairs that are potentially related in origins or semantics. Results show strong positive correlations between scores from the method and from human experts. Finally, qualitative analysis and implicit future applications are presented.

2020

pdf
A Large Scale Speech Sentiment Corpus
Eric Chen | Zhiyun Lu | Hao Xu | Liangliang Cao | Yu Zhang | James Fan
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a multimodal corpus for sentiment analysis based on the existing Switchboard-1 Telephone Speech Corpus released by the Linguistic Data Consortium. This corpus extends the Switchboard-1 Telephone Speech Corpus by adding sentiment labels from 3 different human annotators for every transcript segment. Each sentiment label can be one of three options: positive, negative, and neutral. Annotators are recruited using Google Cloud’s data labeling service and the labeling task was conducted over the internet. The corpus contains a total of 49500 labeled speech segments covering 140 hours of audio. To the best of our knowledge, this is the largest multimodal Corpus for sentiment analysis that includes both speech and text features.