Chu-Ren Huang

Also published as: Chu-ren Huang

2024

pdf abs
Be Helpful but Don’t Talk too Much - Enhancing Helpfulness in Conversations through Relevance in Multi-Turn Emotional Support
Junlin Li | Bo Peng | Yu-Yin Hsu | Chu-Ren Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

For a conversation to help and support, speakers should maintain an “effect-effort” tradeoff. As outlined in the gist of “Cognitive Relevance Principle”, helpful speakers should optimize the “cognitive relevance” through maximizing the “cognitive effects” and minimizing the “processing effort” imposed on listeners. Although preference learning methods have given rise a boon of studies in pursuit of“effect-optimization”, none have delved into the critical “effort-optimiazation” to fully cultivate the awareness of “optimal relevance” into thecognition of conversation agents. To address this gap, we integrate the “Cognitive Relevance Principle” into emotional support agents in the environment of multi-turn conversation. The results demonstrate a significant and robust improvement against the baseline systems with respect to response quality, human-likedness and supportivenss. This study offers compelling evidence for the effectiveness of the “Relevance Principle” in generating human-like, helpful, and harmless emotional support conversations. The source code will be available at https://github.com/CN-Eyetk/VLESA-ORL.git

pdf abs
CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese
Le Qiu | Shanyue Guo | Tak-Sum Wong | Emmanuele Chersoni | John Lee | Chu-Ren Huang
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

The prediction of lexical complexity in context is assuming an increasing relevance in Natural Language Processing research, since identifying complex words is often the first step of text simplification pipelines. To the best of our knowledge, though, datasets annotated with complex words are available only for English and for a limited number of Western languages.In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.

pdf abs
Employing Glyphic Information for Chinese Event Extraction with Vision-Language Model
Xiaoyi Bao | Jinghang Gu | Zhongqing Wang | Minjie Qiang | Chu-Ren Huang
Findings of the Association for Computational Linguistics: EMNLP 2024

As a complex task that requires rich information input, features from various aspects have been utilized in event extraction. However, most of the previous works ignored the value of glyph, which could contain enriched semantic information and can not be fully expressed by the pre-trained embedding in hieroglyphic languages like Chinese. We argue that, compared with combining the sophisticated textual features, glyphic information from visual modality could provide us with extra and straight semantic information in extracting events. Motivated by this, we propose a glyphic multi-modal Chinese event extraction model with hieroglyphic images to capture the intra- and inter-character morphological structure from the sequence. Extensive experiments build a new state-of-the-art performance in the ACE2005 Chinese and KBP Eval 2017 dataset, which underscores the effectiveness of our proposed glyphic event extraction model, and more importantly, the glyphic feature can be obtained at nearly zero cost.

pdf abs
EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information
Yu Xi Li | Bo Peng | Yu-Yin Hsu | Chu-Ren Huang
Findings of the Association for Computational Linguistics: EMNLP 2024

The identification of metaphor is a crucial prerequisite for many downstream language tasks, such as sentiment analysis, opinion mining, and textual entailment. State-of-the-art systems of metaphor detection implement heuristic principles such as Metaphor Identification Procedure (MIP) and Selection Preference Violation (SPV). We propose an innovative approach that leverages the cognitive information of embodiment that can be derived from word embeddings, and explicitly models the process of sensorimotor change that has been demonstrated as essential for human metaphor processing. We showed that this cognitively motivated module is effective and can improve metaphor detection, compared with the heuristic MIP that has been applied previously.

pdf abs
PolyuCBS at SMM4H 2024: LLM-based Medical Disorder and Adverse Drug Event Detection with Low-rank Adaptation
Zhai Yu | Xiaoyi Bao | Emmanuele Chersoni | Beatrice Portelli | Sophia Lee | Jinghang Gu | Chu-Ren Huang
Proceedings of The 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks

This is the demonstration of systems and results of our team’s participation in the Social Medical Mining for Health (SMM4H) 2024 Shared Task. Our team participated in two tasks: Task 1 and Task 5. Task 5 requires the detection of tweet sentences that claim children’s medical disorders from certain users. Task 1 needs teams to extract and normalize Adverse Drug Event terms in the tweet sentence. The team selected several Pre-trained Language Models and generative Large Language Models to meet the requirements. Strategies to improve the performance include cloze test, prompt engineering, Low Rank Adaptation etc. The test result of our system has an F1 score of 0.935, Precision of 0.954 and Recall of 0.917 in Task 5 and an overall F1 score of 0.08 in Task 1.

pdf abs
From Text to Historical Ecological Knowledge: The Construction and Application of the Shan Jing Knowledge Base
Ke Liang | Chu-Ren Huang | Xin-Lan Jiang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Traditional Ecological Knowledge (TEK) has been recognized as a shared cultural heritage and a crucial instrument to tackle today’s environmental challenges. In this paper, we deal with historical ecological knowledge, a special type of TEK that is based on ancient language texts. In particular, we aim to build a language resource based on Shanhai Jing (The Classic of Mountains and Seas). Written 2000 years ago, Shanhai Jing is a record of flora and fauna in ancient China, anchored by mountains (shan) and seas (hai). This study focuses on the entities in the Shan Jing part and builds a knowledge base for them. We adopt a pattern-driven and bottom-up strategy to accommodate two features of the source: highly stylized narrative and juxtaposition of knowledge from multiple domains. The PRF values of both entity and relationship extraction are above 96%. Quality assurance measures like entity disambiguation and resolution were done by domain experts. Neo4j graph database is used to visualize the result. We think the knowledge base, containing 1432 systematically classified entities and 3294 relationships, can provide the foundation for the construction of a historical ecological knowledge base of China. Additionally, the ruled-based text-matching method can be helpful in ancient language processing.

2023

pdf abs
ChiWUG: A Graph-based Evaluation Dataset for Chinese Lexical Semantic Change Detection
Jing Chen | Emmanuele Chersoni | Dominik Schlechtweg | Jelena Prokic | Chu-Ren Huang
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

Recent studies suggested that language models are efficient tools for measuring lexical semantic change. In our paper, we present the compilation of the first graph-based evaluation dataset for lexical semantic change in the context of the Chinese language, specifically covering the periods of pre- and post- Reform and Opening Up. Exploiting the existing framework DURel, we collect over 61,000 human semantic relatedness judgments for 40 targets. The inferred word usage graphs and semantic change scores provide a basis for visualization and evaluation of semantic change.

pdf
Tracing Social Change through Metaphor: A Diachronic Corpus-Assisted Analysis
Winnie Huiheng Zeng | Kathleen Ahrens | Chu-Ren Huang
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf
Existence Justifies Reason: A Data Analysis on Chinese Classifiers Based on Eye Tracking and Transformers
Yu Wang | Emmanuele Chersoni | Chu-Ren Huang
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

2022

pdf
Cross-strait Variations on Two Near-synonymous Loanwords xie2shang1 and tan2pan4: A Corpus-based Comparative Study
Yueyue Huang | Chu-Ren Huang
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf
From Frying to Speculating: Google Ngram evidence to the meaning development of ‘?’ in Mandarin Chinese
Jing Chen | Chu-Ren Huang
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf
Gain-framed Buying or Loss-framed Selling? The Analysis of Near Synonyms in Mandarin in Prospect Theory
Xin Luo | Chu-Ren Huang
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf abs
Lexicon of Changes: Towards the Evaluation of Diachronic Semantic Shift in Chinese
Jing Chen | Emmanuele Chersoni | Chu-ren Huang
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

Recent research has brought a wind of using computational approaches to the classic topic of semantic change, aiming to tackle one of the most challenging issues in the evolution of human language. While several methods for detecting semantic change have been proposed, such studies are limited to a few languages, where evaluation datasets are available. This paper presents the first dataset for evaluating Chinese semantic change in contexts preceding and following the Reform and Opening-up, covering a 50-year period in Modern Chinese. Following the DURel framework, we collected 6,000 human judgments for the dataset. We also reported the performance of alignment-based word embedding models on this evaluation dataset, achieving high and significant correlation scores.

pdf bib
Proceedings of the First Computing Social Responsibility Workshop within the 13th Language Resources and Evaluation Conference
Mingyu Wan | Chu-Ren Huang
Proceedings of the First Computing Social Responsibility Workshop within the 13th Language Resources and Evaluation Conference

pdf abs
Framing Legitimacy in CSR: A Corpus of Chinese and American Petroleum Company CSR Reports and Preliminary Analysis
Jieyu Chen | Kathleen Ahrens | Chu-Ren Huang
Proceedings of the First Computing Social Responsibility Workshop within the 13th Language Resources and Evaluation Conference

We examine how Chinese and American oil companies use the gain- and loss-framed BUILDING source domain to legitimize their business in Corporate Social Responsibility (CSR) reports. Gain and loss frames can create legitimacy because they can ethically position an issue. We will focus on oil companies in China and the U.S. because different socio-cultural contexts in these two countries can potentially lead to different legitimation strategies in CSR reports, which can shed light on differences in Chinese and American CSR. All of the oil companies in our data are on the Fortune 500 list (2020). The results showed that Chinese oil companies used BUILDING metaphors more frequently than American oil companies. The most frequent keyword in Chinese CSRs “build” highlights environmental achievements in compliance with governments’ policies. American CSRs often used the metaphorical verb “support” to show their alignment with environmental policies and the interests of different stakeholders. The BUILDING source domain was used more often as gain frames in both Chinese and American CSR reports to show how oil companies create benefits for different stakeholders.

pdf abs
Inclusion in CSR Reports: The Lens from a Data-Driven Machine Learning Model
Lu Lu | Jinghang Gu | Chu-Ren Huang
Proceedings of the First Computing Social Responsibility Workshop within the 13th Language Resources and Evaluation Conference

Inclusion, as one of the foundations in the diversity, equity, and inclusion initiative, concerns the degree of being treated as an ingroup member in a workplace. Despite of its importance in a corporate’s ecosystem, the inclusion strategies and its performance are not adequately addressed in corporate social responsibility (CSR) and CSR reporting. This study proposes a machine learning and big data-based model to examine inclusion through the use of stereotype content in actual language use. The distribution of the stereotype content in general corpora of a given society is utilized as a baseline, with which texts about corporate texts are compared. This study not only propose a model to identify and classify inclusion in language use, but also provides insights to measure and track progress by including inclusion in CSR reports as a strategy to build an inclusive corporate team.

pdf bib abs
Discovering Financial Hypernyms by Prompting Masked Language Models
Bo Peng | Emmanuele Chersoni | Yu-Yin Hsu | Chu-Ren Huang
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

With the rising popularity of Transformer-based language models, several studies have tried to exploit their masked language modeling capabilities to automatically extract relational linguistic knowledge, although this kind of research has rarely investigated semantic relations in specialized domains. The present study aims at testing a general-domain and a domain-adapted Transformer models on two datasets of financial term-hypernym pairs using the prompt methodology. Our results show that the differences of prompts impact critically on models’ performance, and that domain adaptation on financial text generally improves the capacity of the models to associate the target terms with the right hypernyms, although the more successful models are those retaining a general-domain vocabulary.

2021

In this contribution, we describe the system presented by the PolyU CBS-Comp Team at the Task 1 of SemEval 2021, where the goal was the estimation of the complexity of words in a given sentence context. Our top system, based on a combination of lexical, syntactic, word embeddings and Transformers-derived features and on a Gradient Boosting Regressor, achieves a top correlation score of 0.754 on the subtask 1 for single words and 0.659 on the subtask 2 for multiword expressions.

pdf abs
Is Domain Adaptation Worth Your Investment? Comparing BERT and FinBERT on Financial Tasks
Bo Peng | Emmanuele Chersoni | Yu-Yin Hsu | Chu-Ren Huang
Proceedings of the Third Workshop on Economics and Natural Language Processing

With the recent rise in popularity of Transformer models in Natural Language Processing, research efforts have been dedicated to the development of domain-adapted versions of BERT-like architectures. In this study, we focus on FinBERT, a Transformer model trained on text from the financial domain. By comparing its performances with the original BERT on a wide variety of financial text processing tasks, we found continual pretraining from the original model to be the more beneficial option. Domain-specific pretraining from scratch, conversely, seems to be less effective.

pdf abs
ROCLING-2021 Shared Task: Dimensional Sentiment Analysis for Educational Texts
Liang-Chih Yu | Jin Wang | Bo Peng | Chu-Ren Huang
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

This paper presents the ROCLING 2021 shared task on dimensional sentiment analysis for educational texts which seeks to identify a real-value sentiment score of self-evaluation comments written by Chinese students in the both valence and arousal dimensions. Valence represents the degree of pleasant and unpleasant (or positive and negative) feelings, and arousal represents the degree of excitement and calm. Of the 7 teams registered for this shared task for two-dimensional sentiment analysis, 6 submitted results. We expected that this evaluation campaign could produce more advanced dimensional sentiment analysis techniques for the educational domain. All data sets with gold standards and scoring script are made publicly available to researchers.

pdf abs
Decoding Word Embeddings with Brain-Based Semantic Features
Emmanuele Chersoni | Enrico Santus | Chu-Ren Huang | Alessandro Lenci
Computational Linguistics, Volume 47, Issue 3 - November 2021

Word embeddings are vectorial semantic representations built with either counting or predicting techniques aimed at capturing shades of meaning from word co-occurrences. Since their introduction, these representations have been criticized for lacking interpretable dimensions. This property of word embeddings limits our understanding of the semantic features they actually encode. Moreover, it contributes to the “black box” nature of the tasks in which they are used, since the reasons for word embedding performance often remain opaque to humans. In this contribution, we explore the semantic properties encoded in word embeddings by mapping them onto interpretable vectors, consisting of explicit and neurobiologically motivated semantic features (Binder et al. 2016). Our exploration takes into account different types of embeddings, including factorized count vectors and predict models (Skip-Gram, GloVe, etc.), as well as the most recent contextualized representations (i.e., ELMo and BERT). In our analysis, we first evaluate the quality of the mapping in a retrieval task, then we shed light on the semantic features that are better encoded in each embedding type. A large number of probing tasks is finally set to assess how the original and the mapped embeddings perform in discriminating semantic categories. For each probing task, we identify the most relevant semantic features and we show that there is a correlation between the embedding performance and how they encode those features. This study sets itself as a step forward in understanding which aspects of meaning are captured by vector spaces, by proposing a new and simple method to carve human-interpretable semantic representations from distributional vectors.

pdf abs
Scikit-talk: A toolkit for processing real-world conversational speech data
Andreas Liesenfeld | Gabor Parti | Chu-Ren Huang
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

We present Scikit-talk, an open-source toolkit for processing collections of real-world conversational speech in Python. First of its kind, the toolkit equips those interested in studying or modeling conversations with an easy-to-use interface to build and explore large collections of transcriptions and annotations of talk-in-interaction. Designed for applications in speech processing and Conversational AI, Scikit-talk provides tools to custom-build datasets for tasks such as intent prototyping, dialog flow testing, and conversation design. Its preprocessor module comes with several pre-built interfaces for common transcription formats, which aim to make working across multiple data sources more accessible. The explorer module provides a collection of tools to explore and analyse this data type via string matching and unsupervised machine learning techniques. Scikit-talk serves as a platform to collect and connect different transcription formats and representations of talk, enabling the user to quickly build multilingual datasets of varying detail and granularity. Thus, the toolkit aims to make working with authentic conversational speech data in Python more accessible and to provide the user with comprehensive options to work with representations of talk in appropriate detail for any downstream task. For the latest updates and information on currently supported languages and language resources, please refer to: https://pypi.org/project/scikit-talk/

pdf
Aspect or Manner? A Study of Reduplicated Adverbials in Mandarin Chinese
Siaw-Fong Chung | Chu-Ren Huang
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
Language change in Chinese political discourse based on the relationship between sentence and clause
Renkui Hou | Chu-Ren Huang | Kathleen Ahrens
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
Automatic Analysis of Linguistic Features in Journal Articles of Different Academic Impacts with Feature Engineering Techniques
Siyu Lei | Ruiying Yang | Chu-Ren Huang
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
Spatial-temporal attributes in verbal semantics: A corpus-based lexical semantic study of discriminating Mandarin near synonyms of “tui1” and “la1”
Qiangmei Liang | Chu-Ren Huang
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
Animosity and suffering: Metaphors of BITTERNESS in English and Chinese
Gabor Parti | Andreas Liesenfeld | Chu-Ren Huang
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
From Near-synonyms to Divergent Viewpoint Foci: A Corpus-based MARVS Driven Account of Two Verbs of Attention
Ziqian Wang | Chu-Ren Huang
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT
Won Ik Cho | Emmanuele Chersoni | Yu-Yin Hsu | Chu-Ren Huang
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf abs
Automatic Learning of Modality Exclusivity Norms with Crosslingual Word Embeddings
Emmanuele Chersoni | Rong Xiang | Qin Lu | Chu-Ren Huang
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

Collecting modality exclusivity norms for lexical items has recently become a common practice in psycholinguistics and cognitive research. However, these norms are available only for a relatively small number of languages and often involve a costly and time-consuming collection of ratings. In this work, we aim at learning a mapping between word embeddings and modality norms. Our experiments focused on crosslingual word embeddings, in order to predict modality association scores by training on a high-resource language and testing on a low-resource one. We ran two experiments, one in a monolingual and the other one in a crosslingual setting. Results show that modality prediction using off-the-shelf crosslingual embeddings indeed has moderate-to-high correlations with human ratings even when regression algorithms are trained on an English resource and tested on a completely unseen language.

pdf abs
Comparing Probabilistic, Distributional and Transformer-Based Models on Logical Metonymy Interpretation
Giulia Rambelli | Emmanuele Chersoni | Alessandro Lenci | Philippe Blache | Chu-Ren Huang
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

In linguistics and cognitive science, Logical metonymies are defined as type clashes between an event-selecting verb and an entity-denoting noun (e.g. The editor finished the article), which are typically interpreted by inferring a hidden event (e.g. reading) on the basis of contextual cues. This paper tackles the problem of logical metonymy interpretation, that is, the retrieval of the covert event via computational methods. We compare different types of models, including the probabilistic and the distributional ones previously introduced in the literature on the topic. For the first time, we also tested on this task some of the recent Transformer-based models, such as BERT, RoBERTa, XLNet, and GPT-2. Our results show a complex scenario, in which the best Transformer-based models and some traditional distributional models perform very similarly. However, the low performance on some of the testing datasets suggests that logical metonymy is still a challenging phenomenon for computational modeling.

pdf abs
Sina Mandarin Alphabetical Words:A Web-driven Code-mixing Lexical Resource
Rong Xiang | Mingyu Wan | Qi Su | Chu-Ren Huang | Qin Lu
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Mandarin Alphabetical Word (MAW) is one indispensable component of Modern Chinese that demonstrates unique code-mixing idiosyncrasies influenced by language exchanges. Yet, this interesting phenomenon has not been properly addressed and is mostly excluded from the Chinese language system. This paper addresses the core problem of MAW identification and proposes to construct a large collection of MAWs from Sina Weibo (SMAW) using an automatic web-based technique which includes rule-based identification, informatics-based extraction, as well as Baidu search engine validation. A collection of 16,207 qualified SMAWs are obtained using this technique along with an annotated corpus of more than 200,000 sentences for linguistic research and applicable inquiries.

pdf
Sketching the English Translations of Kumārajīva’s The Diamond Sutra: A Comparison of Individual Translators and Translation Teams
Xi Chen | Vincent Xian Wang | Chu-Ren Huang
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf
Language change in Report on the Work of the Government by Premiers of the People’s Republic of China
Renkui Hou | Chu-Ren Huang | Kathleen Ahrens
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf
Predicting gender and age categories in English conversations using lexical, non-lexical, and turn-taking features
Andreas Liesenfeld | Gábor Parti | Yuyin Hsu | Chu-Ren Huang
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf
Sensorimotor Enhanced Neural Network for Metaphor Detection
Mingyu Wan | Baixi Xing | Qi Su | Pengyuan Liu | Chu-Ren Huang
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf
A Parallel Corpus-driven Approach to Bilingual Oenology Term Banks: How Culture Differences Influence Wine Tasting Terms
Vincent Xian Wang | Xi Chen | Songnan Quan | Chu-Ren Huang
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf
Corpus-based Comparison of Verbs of Separation “Qie” and “Ge”
Nga-In Wu | Chu-Ren Huang | Lap-Kei Lee
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf
Marking Trustworthiness with Near Synonyms: A Corpus-based Study of “Renwei” and “Yiwei” in Chinese
Bei Li | Chu-Ren Huang | Si Chen
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf
Abstract Meaning Representation for MWE: A study of the mapping of aspectuality based on Mandarin light verb jiayi
Lu Lu | Nianwen Xue | Chu-Ren Huang
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

Deep neural network models have played a critical role in sentiment analysis with promising results in the recent decade. One of the essential challenges, however, is how external sentiment knowledge can be effectively utilized. In this work, we propose a novel affection-driven approach to incorporating affective knowledge into neural network models. The affective knowledge is obtained in the form of a lexicon under the Affect Control Theory (ACT), which is represented by vectors of three-dimensional attributes in Evaluation, Potency, and Activity (EPA). The EPA vectors are mapped to an affective influence value and then integrated into Long Short-term Memory (LSTM) models to highlight affective terms. Experimental results show a consistent improvement of our approach over conventional LSTM models by 1.0% to 1.5% in accuracy on three large benchmark datasets. Evaluations across a variety of algorithms have also proven the effectiveness of leveraging affective terms for deep model enhancement.

pdf abs
Are Word Embeddings Really a Bad Fit for the Estimation of Thematic Fit?
Emmanuele Chersoni | Ludovica Pannitto | Enrico Santus | Alessandro Lenci | Chu-Ren Huang
Proceedings of the Twelfth Language Resources and Evaluation Conference

While neural embeddings represent a popular choice for word representation in a wide variety of NLP tasks, their usage for thematic fit modeling has been limited, as they have been reported to lag behind syntax-based count models. In this paper, we propose a complete evaluation of count models and word embeddings on thematic fit estimation, by taking into account a larger number of parameters and verb roles and introducing also dependency-based embeddings in the comparison. Our results show a complex scenario, where a determinant factor for the performance seems to be the availability to the model of reliable syntactic information for building the distributional representations of the roles.

Automatic Chinese irony detection is a challenging task, and it has a strong impact on linguistic research. However, Chinese irony detection often lacks labeled benchmark datasets. In this paper, we introduce Ciron, the first Chinese benchmark dataset available for irony detection for machine learning models. Ciron includes more than 8.7K posts, collected from Weibo, a micro blogging platform. Most importantly, Ciron is collected with no pre-conditions to ensure a much wider coverage. Evaluation on seven different machine learning classifiers proves the usefulness of Ciron as an important resource for Chinese irony detection.

pdf bib
Proceedings of the Second Workshop on Linguistic and Neurocognitive Resources
Emmanuele Chersoni | Barry Devereux | Chu-Ren Huang
Proceedings of the Second Workshop on Linguistic and Neurocognitive Resources

This paper reports a linguistically-enriched method of detecting token-level metaphors for the second shared task on Metaphor Detection. We participate in all four phases of competition with both datasets, i.e. Verbs and AllPOS on the VUA and the TOFEL datasets. We use the modality exclusivity and embodiment norms for constructing a conceptual representation of the nodes and the context. Our system obtains an F-score of 0.652 for the VUA Verbs track, which is 5% higher than the strong baselines. The experimental results across models and datasets indicate the salient contribution of using modality exclusivity and modality shift information for predicting metaphoricity.

2019

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019. This year, the campaign included five shared tasks, including one task re-run – German Dialect Identification (GDI) – and four new tasks – Cross-lingual Morphological Analysis (CMA), Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT), Moldavian vs. Romanian Cross-dialect Topic identification (MRC), and Cuneiform Language Identification (CLI). A total of 22 teams submitted runs across the five shared tasks. After the end of the competition, we received 14 system description papers, which are published in the VarDial workshop proceedings and referred to in this report.

pdf abs
Distributional Semantics Meets Construction Grammar. towards a Unified Usage-Based Model of Grammar and Meaning
Giulia Rambelli | Emmanuele Chersoni | Philippe Blache | Chu-Ren Huang | Alessandro Lenci
Proceedings of the First International Workshop on Designing Meaning Representations

In this paper, we propose a new type of semantic representation of Construction Grammar that combines constructions with the vector representations used in Distributional Semantics. We introduce a new framework, Distributional Construction Grammar, where grammar and meaning are systematically modeled from language use, and finally, we discuss the kind of contributions that distributional models can provide to CxG representation from a linguistic and cognitive perspective.

2018

pdf bib
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation
Stephen Politzer-Ahles | Yu-Yin Hsu | Chu-Ren Huang | Yao Yao
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
Facilitating and Blocking Conditions of Haplology: A comparative study of Hong Kong Cantonese and Taiwan Mandarin
Sam Yin Wong | I-Hsuan Chen | Chu-Ren Huang
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
Semantic Transparency of Radicals in Chinese Characters: An Ontological Perspective
Yike Yang | Chu-Ren Huang | Sicong Dong | Si Chen
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 25th Joint Workshop on Linguistics and Language Processing
Stephen Politzer-Ahles | Yu-Yin Hsu | Chu-Ren Huang | Yao Yao
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 25th Joint Workshop on Linguistics and Language Processing

pdf bib
How do non-tastes taste? A corpus-based study on Chinese people’s perception of spicy and numbing food
Sicong Dong | Yin Zhong | Chu-Ren Huang
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 25th Joint Workshop on Linguistics and Language Processing

pdf
Pleasing to the Mouth of Pleasant Personality: A corpus-based study of conceptualization of desserts in online Chinese food reviews
Yin Zhong | Chu-Ren Huang
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 25th Joint Workshop on Linguistics and Language Processing

pdf bib
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation
Stephen Politzer-Ahles | Yu-Yin Hsu | Chu-Ren Huang | Yao Yao
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

pdf
Annotating Chinese Light Verb Constructions according to PARSEME guidelines
Menghan Jiang | Natalia Klyueva | Hongzhi Xu | Chu-Ren Huang
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf abs
Dual Memory Network Model for Biased Product Review Classification
Yunfei Long | Mingyu Ma | Qin Lu | Rong Xiang | Chu-Ren Huang
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

In sentiment analysis (SA) of product reviews, both user and product information are proven to be useful. Current tasks handle user profile and product information in a unified model which may not be able to learn salient features of users and products effectively. In this work, we propose a dual user and product memory network (DUPMN) model to learn user profiles and product reviews using separate memory networks. Then, the two representations are used jointly for sentiment prediction. The use of separate models aims to capture user profiles and product information more effectively. Compared to state-of-the-art unified prediction models, the evaluations on three benchmark datasets, IMDB, Yelp13, and Yelp14, show that our dual learning model gives performance gain of 0.6%, 1.2%, and 0.9%, respectively. The improvements are also deemed very significant measured by p-values.

2017

pdf abs
Leveraging Eventive Information for Better Metaphor Detection and Classification
I-Hsuan Chen | Yunfei Long | Qin Lu | Chu-Ren Huang
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Metaphor detection has been both challenging and rewarding in natural language processing applications. This study offers a new approach based on eventive information in detecting metaphors by leveraging the Chinese writing system, which is a culturally bound ontological system organized according to the basic concepts represented by radicals. As such, the information represented is available in all Chinese text without pre-processing. Since metaphor detection is another culturally based conceptual representation, we hypothesize that sub-textual information can facilitate the identification and classification of the types of metaphoric events denoted in Chinese text. We propose a set of syntactic conditions crucial to event structures to improve the model based on the classification of radical groups. With the proposed syntactic conditions, the model achieves a performance of 0.8859 in terms of F-scores, making 1.7% of improvement than the same classifier with only Bag-of-word features. Results show that eventive information can improve the effectiveness of metaphor detection. Event information is rooted in every language, and thus this approach has a high potential to be applied to metaphor detection in other languages.

pdf
Stylometric Studies based on Tone and Word Length Motifs
Renkui Hou | Chu-Ren Huang
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

pdf
Multi-dimensional Meanings of Subjective Adverbs - Case Study of Mandarin Chinese Adverb Pianpian
Mi Zhou | Yao Yao | Chu-Ren Huang
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

pdf
Lexicalization, Separation and transitivity: A comparative study of Mandarin VO compound Variations
Menghan Jiang | Chu-Ren Huang
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

pdf abs
Fake News Detection Through Multi-Perspective Speaker Profiles
Yunfei Long | Qin Lu | Rong Xiang | Minglei Li | Chu-Ren Huang
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Automatic fake news detection is an important, yet very challenging topic. Traditional methods using lexical features have only very limited success. This paper proposes a novel method to incorporate speaker profiles into an attention based LSTM model for fake news detection. Speaker profiles contribute to the model in two ways. One is to include them in the attention model. The other includes them as additional input data. By adding speaker profiles such as party affiliation, speaker title, location and credit history, our model outperforms the state-of-the-art method by 14.5% in accuracy using a benchmark fake news detection dataset. This proves that speaker profiles provide valuable information to validate the credibility of news articles.

pdf abs
A Cognition Based Attention Model for Sentiment Analysis
Yunfei Long | Qin Lu | Rong Xiang | Minglei Li | Chu-Ren Huang
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Attention models are proposed in sentiment analysis because some words are more important than others. However,most existing methods either use local context based text information or user preference information. In this work, we propose a novel attention model trained by cognition grounded eye-tracking data. A reading prediction model is first built using eye-tracking data as dependent data and other features in the context as independent data. The predicted reading time is then used to build a cognition based attention (CBA) layer for neural sentiment analysis. As a comprehensive model, We can capture attentions of words in sentences as well as sentences in documents. Different attention mechanisms can also be incorporated to capture other aspects of attentions. Evaluations show the CBA based method outperforms the state-of-the-art local context based attention methods significantly. This brings insight to how cognition grounded data can be brought into NLP tasks.

2016

pdf
Endurant vs Perdurant: Ontological Motivation for Language Variations
Chu-Ren Huang
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Keynote Speeches and Invited Talks

pdf
Testing APSyn against Vector Cosine on Similarity Estimation
Enrico Santus | Emmanuele Chersoni | Alessandro Lenci | Chu-Ren Huang | Philippe Blache
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

pdf
Transitivity in Light Verb Variations in Mandarin Chinese – A Comparable Corpus-based Statistical Approach
Menghan Jiang | Dingxu Shi | Chu-Ren Huang
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters

pdf
The Synaesthetic and Metaphorical Uses of 味 wei ‘taste’ in Chinese Buddhist Suttas
Jiajuan Xiong | Chu-Ren Huang
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters

pdf
The use of body part terms in Taiwan and China: Analyzing 血 xue ‘blood’ and 骨 gu ‘bone’ in Chinese Gigaword v. 2.0
Ren-feng Duann | Chu-Ren Huang
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters

pdf
Representing Verbs with Rich Contexts: an Evaluation on Verb Similarity
Emmanuele Chersoni | Enrico Santus | Alessandro Lenci | Philippe Blache | Chu-Ren Huang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf abs
Selective Annotation of Sentence Parts: Identification of Relevant Sub-sentential Units
Ge Xu | Xiaoyan Yang | Chu-Ren Huang
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

Many NLP tasks involve sentence-level annotation yet the relevant information is not encoded at sentence level but at some relevant parts of the sentence. Such tasks include but are not limited to: sentiment expression annotation, product feature annotation, and template annotation for Q&A systems. However, annotation of the full corpus sentence by sentence is resource intensive. In this paper, we propose an approach that iteratively extracts frequent parts of sentences for annotating, and compresses the set of sentences after each round of annotation. Our approach can also be used in preparing training sentences for binary classification (domain-related vs. noise, subjectivity vs. objectivity, etc.), assuming that sentence-type annotation can be predicted by annotation of the most relevant sub-sentences. Two experiments are performed to test our proposal and evaluated in terms of time saved and agreement of annotation.

pdf abs
A lexicon of perception for the identification of synaesthetic metaphors in corpora
Francesca Strik Lievers | Chu-Ren Huang
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Synaesthesia is a type of metaphor associating linguistic expressions that refer to two different sensory modalities. Previous studies, based on the analysis of poetic texts, have shown that synaesthetic transfers tend to go from the lower toward the higher senses (e.g., sweet music vs. musical sweetness). In non-literary language synaesthesia is rare, and finding a sufficient number of examples manually would be too time-consuming. In order to verify whether the directionality also holds for conventional synaesthesia found in non-literary texts, an automatic procedure for the identification of instances of synaesthesia is therefore highly desirable. In this paper, we first focus on the preliminary step of this procedure, that is, the creation of a controlled lexicon of perception. Next, we present the results of a small pilot study that applies the extraction procedure to English and Italian corpus data.

pdf abs
Database of Mandarin Neighborhood Statistics
Karl Neergaard | Hongzhi Xu | Chu-Ren Huang
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In the design of controlled experiments with language stimuli, researchers from psycholinguistic, neurolinguistic, and related fields, require language resources that isolate variables known to affect language processing. This article describes a freely available database that provides word level statistics for words and nonwords of Mandarin, Chinese. The featured lexical statistics include subtitle corpus frequency, phonological neighborhood density, neighborhood frequency, and homophone density. The accompanying word descriptors include pinyin, ascii phonetic transcription (sampa), lexical tone, syllable structure, dominant PoS, and syllable, segment and pinyin lengths for each phonological word. It is designed for researchers particularly concerned with language processing of isolated words and made to accommodate multiple existing hypotheses concerning the structure of the Mandarin syllable. The database is divided into multiple files according to the desired search criteria: 1) the syllable segmentation schema used to calculate density measures, and 2) whether the search is for words or nonwords. The database is open to the research community at https://github.com/karlneergaard/Mandarin-Neighborhood-Statistics.

pdf abs
Nine Features in a Random Forest to Learn Taxonomical Semantic Relations
Enrico Santus | Alessandro Lenci | Tin-Shing Chiu | Qin Lu | Chu-Ren Huang
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

ROOT9 is a supervised system for the classification of hypernyms, co-hyponyms and random words that is derived from the already introduced ROOT13 (Santus et al., 2016). It relies on a Random Forest algorithm and nine unsupervised corpus-based features. We evaluate it with a 10-fold cross validation on 9,600 pairs, equally distributed among the three classes and involving several Parts-Of-Speech (i.e. adjectives, nouns and verbs). When all the classes are present, ROOT9 achieves an F1 score of 90.7%, against a baseline of 57.2% (vector cosine). When the classification is binary, ROOT9 achieves the following results against the baseline. hypernyms-co-hyponyms 95.7% vs. 69.8%, hypernyms-random 91.8% vs. 64.1% and co-hyponyms-random 97.8% vs. 79.4%. In order to compare the performance with the state-of-the-art, we have also evaluated ROOT9 in subsets of the Weeds et al. (2014) datasets, proving that it is in fact competitive. Finally, we investigated whether the system learns the semantic relation or it simply learns the prototypical hypernyms, as claimed by Levy et al. (2015). The second possibility seems to be the most likely, even though ROOT9 can be trained on negative examples (i.e., switched hypernyms) to drastically reduce this bias.

pdf abs
What a Nerd! Beating Students and Vector Cosine in the ESL and TOEFL Datasets
Enrico Santus | Alessandro Lenci | Tin-Shing Chiu | Qin Lu | Chu-Ren Huang
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we claim that Vector Cosine ― which is generally considered one of the most efficient unsupervised measures for identifying word similarity in Vector Space Models ― can be outperformed by a completely unsupervised measure that evaluates the extent of the intersection among the most associated contexts of two target words, weighting such intersection according to the rank of the shared contexts in the dependency ranked lists. This claim comes from the hypothesis that similar words do not simply occur in similar contexts, but they share a larger portion of their most relevant contexts compared to other related words. To prove it, we describe and evaluate APSyn, a variant of Average Precision that ― independently of the adopted parameters ― outperforms the Vector Cosine and the co-occurrence on the ESL and TOEFL test sets. In the best setting, APSyn reaches 0.73 accuracy on the ESL dataset and 0.70 accuracy in the TOEFL dataset, beating therefore the non-English US college applicants (whose average, as reported in the literature, is 64.50%) and several state-of-the-art approaches.

pdf abs
EVALution-MAN: A Chinese Dataset for the Training and Evaluation of DSMs
Liu Hongchao | Karl Neergaard | Enrico Santus | Chu-Ren Huang
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Distributional semantic models (DSMs) are currently being used in the measurement of word relatedness and word similarity. One shortcoming of DSMs is that they do not provide a principled way to discriminate different semantic relations. Several approaches have been adopted that rely on annotated data either in the training of the model or later in its evaluation. In this paper, we introduce a dataset for training and evaluating DSMs on semantic relations discrimination between words, in Mandarin, Chinese. The construction of the dataset followed EVALution 1.0, which is an English dataset for the training and evaluating of DSMs. The dataset contains 360 relation pairs, distributed in five different semantic relations, including antonymy, synonymy, hypernymy, meronymy and nearsynonymy. All relation pairs were checked manually to estimate their quality. In the 360 word relation pairs, there are 373 relata. They were all extracted and subsequently manually tagged according to their semantic type. The relatas’ frequency was calculated in a combined corpus of Sinica and Chinese Gigaword. To the best of our knowledge, EVALution-MAN is the first of its kind for Mandarin, Chinese.

2015

pdf
LLT-PolyU: Identifying Sentiment Intensity in Ironic Tweets
Hongzhi Xu | Enrico Santus | Anna Laszlo | Chu-Ren Huang
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Create a Manual Chinese Word Segmentation Dataset Using Crowdsourcing Method
Shichang Wang | Chu-Ren Huang | Yao Yao | Angel Chan
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing

pdf
EVALution 1.0: an Evolving Semantic Dataset for Training and Evaluation of Distributional Semantic Models
Enrico Santus | Frances Yung | Alessandro Lenci | Chu-Ren Huang
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications

pdf
What You Need to Know about Chinese for Chinese Language Processing
Chu-Ren Huang
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Tutorial Abstracts

pdf
Mechanical Turk-based Experiment vs Laboratory-based Experiment: A Case Study on the Comparison of Semantic Transparency Rating Data
Shichang Wang | Chu-Ren Huang | Yao Yao | Angel Chan
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf
Sentiment Analyzer with Rich Features for Ironic and Sarcastic Tweets
Piyoros Tungthamthiti | Enrico Santus | Hongzhi Xu | Chu-Ren Huang | Kiyoaki Shirai
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf
The Invertible Construction in Chinese
Yan Cong | Chu-Ren Huang | Lian-Hee Wee
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf
Auditory Synaesthesia and Near Synonyms: A Corpus-Based Analysis of sheng1 and yin1 in Mandarin Chinese
Qingqing Zhao | Chu-Ren Huang | Hongzhi Xu
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf
When Embodiment Meets Generative Lexicon: The Human Body Part Metaphors in Sinica Corpus
Ren-feng Duann | Chu-Ren Huang
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf
De-verbalization and Nominal Categories in Mandarin Chinese: A corpus-driven study in both Mainland Mandarin and Taiwan Mandarin
Jiajuan Xiong | Chu-Ren Huang
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf
Graph Theoretic Features of the Adult Mental lexicon Predict Language Production in Mandarin: Clustering Coefficient
Karl Neergaard | Chu-Ren Huang
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters

2014

pdf abs
Annotating Events in an Emotion Corpus
Sophia Lee | Shoushan Li | Chu-Ren Huang
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents the development of a Chinese event-based emotion corpus. It specifically describes the corpus design, collection and annotation. The proposed annotation scheme provides a consistent way of identifying some emotion-associated events (namely pre-events and post-events). Corpus data show that there are significant interactions between emotions and pre-events as well as that of between emotion and post-events. We believe that emotion as a pivot event underlies an innovative approach towards a linguistic model of emotion as well as automatic emotion detection and classification.

pdf
Taking Antonymy Mask off in Vector Space
Enrico Santus | Qin Lu | Alessandro Lenci | Chu-Ren Huang
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

pdf
On the Argument Structures of the Transitive Verb ‘annoy; be annoyed; bother to do’: A Study Based on Two Comparable Corpora
Jiajuan Xiong | Chu-Ren Huang
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

pdf bib
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)
Michael Zock | Reinhard Rapp | Chu-Ren Huang
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

pdf
Exploring Mental Lexicon in an Efficient and Economic Way: Crowdsourcing Method for Linguistic Experiments
Shichang Wang | Chu-Ren Huang | Yao Yao | Angel Chan
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

pdf bib
Corpus-based Study and Identification of Mandarin Chinese Light Verb Variations
Chu-Ren Huang | Jingxia Lin | Menghan Jiang | Hongzhi Xu
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

pdf
Annotation and Classification of Light Verbs and Light Verb Variations in Mandarin Chinese
Jingxia Lin | Hongzhi Xu | Menghan Jiang | Chu-Ren Huang
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

pdf
Building a Semantic Transparency Dataset of Chinese Nominal Compounds: A Practice of Crowdsourcing Methodology
Shichang Wang | Chu-Ren Huang | Yao Yao | Angel Chan
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

pdf
Annotate and Identify Modalities, Speech Acts and Finer-Grained Event Types in Chinese Text
Hongzhi Xu | Chu-Ren Huang
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

2013

pdf
Joint Modeling of News Reader’s and Comment Writer’s Emotions
Huanhuan Liu | Shoushan Li | Guodong Zhou | Chu-Ren Huang | Peifeng Li
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Metaphor and Qualia: Embodiment or Eventuality
Chu-Ren Huang | Kathleen Ahrens | Francesca Quattri
Proceedings of the 6th International Conference on Generative Approaches to the Lexicon (GL2013)

pdf
Primitives of Events and the Semantic Representation
Hongzhi Xu | Chu-Ren Huang
Proceedings of the 6th International Conference on Generative Approaches to the Lexicon (GL2013)

pdf
A Rule System for Chinese Time Entity Recognition by Comprehensive Linguistic Study
Hongzhi Xu | Chu-Ren Huang
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
以中文十億詞語料庫為基礎之兩岸詞彙對比研究 (Cross-Strait Lexical Differences: A Comparative Study based on Chinese Gigaword Corpus) [In Chinese]
Jia-Fei Hong | Chu-Ren Huang
International Journal of Computational Linguistics & Chinese Language Processing, Volume 18, Number 2, June 2013-Special Issue on Chinese Lexical Resources: Theories and Applications

2012

pdf
Compositionality of NN Compounds: A Case Study on [N1+Artifactual-Type Event Nouns]
Shan Wang | Chu-Ren Huang | Hongzhi Xu
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf
The Headedness of Mandarin Chinese Serial Verb Constructions: A Corpus-Based Study
Jingxia Lin | Chu-Ren Huang | Huarui Zhang | Hongzhi Xu
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf
Type Construction of Event Nouns in Mandarin Chinese
Shan Wang | Chu-Ren Huang
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf
Active Learning for Chinese Word Segmentation
Shoushan Li | Guodong Zhou | Chu-Ren Huang
Proceedings of COLING 2012: Posters

pdf
Sourcing the Crowd for a Few Good Ones: Event Type Detection
Tommaso Caselli | Chu-Ren Huang
Proceedings of COLING 2012: Posters

pdf
SMR-Cmp: Square-Mean-Root Approach to Comparison of Monolingual Contrastive Corpora
HuaRui Zhang | Chu-Ren Huang | Francesca Quattri
Proceedings of COLING 2012: Demonstration Papers

pdf abs
A Grammar-informed Corpus-based Sentence Database for Linguistic and Computational Studies
Hongzhi Xu | Helen Kaiyun Chen | Chu-Ren Huang | Qin Lu | Dingxu Shi | Tin-Shing Chiu
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We adopt the corpus-informed approach to example sentence selections for the construction of a reference grammar. In the process, a database containing sentences that are carefully selected by linguistic experts including the full range of linguistic facts covered in an authoritative Chinese Reference Grammar is constructed and structured according to the reference grammar. A search engine system is developed to facilitate the process of finding the most typical examples the users need to study a linguistic problem or prove their hypotheses. The database can also be used as a training corpus by computational linguists to train models for Chinese word segmentation, POS tagging and sentence parsing.

2011

pdf
Compound Event Nouns of the ‘Modifier-head’ Type in Mandarin Chinese
Shan Wang | Chu-Ren Huang
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

pdf
The Co-occurrence of Two Delimiters: An Investigation of Mandarin Chinese Resultatives
Jingxia Lin | Chu-Ren Huang
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

2010

pdf
Employing Personal/Impersonal Views in Supervised and Semi-Supervised Sentiment Classification
Shoushan Li | Chu-Ren Huang | Guodong Zhou | Sophia Yat Mei Lee
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf
Using Corpus-based Linguistic Approaches in Sense Prediction Study
Jia-Fei Hong | Sue-Jin Ker | Chu-Ren Huang | Kathleen Ahrens
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf
Incorporate Credibility into Context for the Best Social Media Answers
Qi Su | Helen Kai-yun Chen | Chu-Ren Huang
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf
Adjectival Modification to Nouns in Mandarin Chinese: Case Studies on “cháng+noun” and “adjective+tú shū gu n”
Shan Wang | Chu-Ren Huang
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf
Compositional Operations of Mandarin Chinese Perception Verb “kàn”: A Generative Lexicon Approach
Shan Wang | Chu-Ren Huang
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf
Cross-sortal Predication and Polysemy
Petr Šimon | Chu-Ren Huang
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf
A Text-driven Rule-based System for Emotion Cause Detection
Sophia Yat Mei Lee | Ying Chen | Chu-Ren Huang
Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text

pdf bib
Evidentiality for Text Trustworthiness Detection
Qi Su | Chu-Ren Huang | Kai-yun Chen
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground

pdf bib
Textual Emotion Processing From Event Analysis
Chu-Ren Huang | Ying Chen | Sophia Yat Mei Lee
CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
The Chinese Persons Name Diambiguation Evaluation: Exploration of Personal Name Disambiguation in Chinese News
Ying Chen | Peng Jin | Wenjie Li | Chu-Ren Huang
CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf bib
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)
Chu-Ren Huang | Dan Jurafsky
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf
Emotion Cause Detection with Linguistic Constructions
Ying Chen | Sophia Yat Mei Lee | Shoushan Li | Chu-Ren Huang
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf
Sentiment Classification and Polarity Shifting
Shoushan Li | Sophia Y. M. Lee | Ying Chen | Chu-Ren Huang | Guodong Zhou
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Coling 2010: Posters
Chu-Ren Huang | Dan Jurafsky
Coling 2010: Posters

pdf abs
Emotion Cause Events: Corpus Construction and Analysis
Sophia Yat Mei Lee | Ying Chen | Shoushan Li | Chu-Ren Huang
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Emotion processing has always been a great challenge. Given the fact that an emotion is triggered by cause events and that cause events are an integral part of emotion, this paper constructs a Chinese emotion cause corpus as a first step towards automatic inference of cause-emotion correlation. The corpus focuses on five primary emotions, namely happiness, sadness, fear, anger, and surprise. It is annotated with emotion cause events based on our proposed annotation scheme. Corpus data shows that most emotions are expressed with causes, and that causes mostly occur before the corresponding emotion verbs. We also examine the correlations between emotions and cause events in terms of linguistic cues: causative verbs, perception verbs, epistemic markers, conjunctions, prepositions, and others. Results show that each group of linguistic cues serves as an indicator marking the cause events in different structures of emotional constructions. We believe that the emotion cause corpus will be the useful resource for automatic emotion cause detection as well as emotion detection and classification.

pdf abs
Automatic Acquisition of Chinese Novel Noun Compounds
Meng Wang | Chu-Ren Huang | Shiwen Yu | Weiwei Sun
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Automatic acquisition of novel compounds is notoriously difficult because most novel compounds have relatively low frequency in a corpus. The current study proposes a new method to deal with the novel compound acquisition challenge. We model this task as a two-class classification problem in which a candidate compound is either classified as a compound or a non-compound. A machine learning method using SVM, incorporating two types of linguistically motivated features: semantic features and character features, is applied to identify rare but valid noun compounds. We explore two kinds of training data: one is virtual training data which is obtained by three statistical scores, i.e. co-occurrence frequency, mutual information and dependent ratio, from the frequent compounds; the other is real training data which is randomly selected from the infrequent compounds. We conduct comparative experiments, and the experimental results show that even with limited direct evidence in the corpus for the novel compounds, we can make full use of the typical frequent compounds to help in the discovery of the novel compounds.

2009

pdf
An Integrated Approach to Heterogeneous Data for Information Extraction
Ying Chen | Sophia Y. M. Lee | Chu-Ren Huang
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

pdf
Are Emotions Enumerable or Decomposable? And its Implications for Emotion Processing
Ying Chen | Sophia Y. M. Lee | Chu-Ren Huang
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

pdf
Chinese WordNet Domains: Bootstrapping Chinese WordNet with Semantic Domain Labels
Lung-Hao Lee | Yu-Ting Yu | Chu-Ren Huang
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

pdf
Cause Event Representations for Happiness and Surprise
Sophia Yat Mei Lee | Ying Chen | Chu-Ren Huang
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

pdf
Sentiment Classification Considering Negation and Contrast Transition
Shoushan Li | Chu-Ren Huang
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

pdf
Bridging the Gap between Graph Modeling and Developmental Psycholinguistics: An Experiment on Measuring Lexical Proximity in Chinese Semantic Space
Shu-Kai Hsieh | Chun-Han Chang | Ivy Kuo | Hintat Cheung | Chu-Ren Huang | Bruno Gaume
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

pdf
Word Boundary Decision with CRF for Chinese Word Segmentation
Shoushan Li | Chu-Ren Huang
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

pdf bib
Proceedings of the Third Linguistic Annotation Workshop (LAW III)
Manfred Stede | Chu-Ren Huang | Nancy Ide | Adam Meyers
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf bib
A Cognitive-based Annotation System for Emotion Computing
Ying Chen | Sophia Y. M. Lee | Chu-Ren Huang
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf
CWN-LMF: Chinese WordNet in the Lexical Markup Framework
Lung-Hao Lee | Shu-Kai Hsieh | Chu-Ren Huang
Proceedings of the 7th Workshop on Asian Language Resources (ALR7)

pdf
A Framework of Feature Selection Methods for Text Categorization
Shoushan Li | Rui Xia | Chengqing Zong | Chu-Ren Huang
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Fundamentals of Chinese Language Processing
Chu-Ren Huang | Qin Lu
Tutorial Abstracts of ACL-IJCNLP 2009

2008

pdf bib
Coling 2008: Proceedings of the Workshop on Cognitive Aspects of the Lexicon (COGALEX 2008)
Michael Zock | Chu-Ren Huang
Coling 2008: Proceedings of the Workshop on Cognitive Aspects of the Lexicon (COGALEX 2008)

pdf
Multilingual Conceptual Access to Lexicon based on Shared Orthography: An ontology-driven study of Chinese and Japanese
Chu-Ren Huang | Ya-Min Chou | Chiyo Hotani | Sheng-Yi Chen | Wan-Ying Lin
Coling 2008: Proceedings of the Workshop on Cognitive Aspects of the Lexicon (COGALEX 2008)

pdf
A Realistic and Robust Model for Chinese Word Segmentation
Chu-Ren Huang | Ting-Shuo Yo | Petr Šimon | Shu-Kai Hsieh
Proceedings of the 20th Conference on Computational Linguistics and Speech Processing

pdf
多領域文件集之詞彙概念擴展與知識架構之建立 (Conceptual Expansion and Ontological Mapping of Multi-domain Documents) [In Chinese]
Yong-Xiang Chen | Xiu-Ling Ke | Keh-Jiann Chen | Chu-Ren Huang
ROCLING 2008 Poster Papers

pdf
An Ontology of Chinese Radicals: Concept Derivation and Knowledge Representation based on the Semantic Symbols of the Four Hoofed-Mammals
Chu-Ren Huang | Ya-Jun Yang | Sheng-Yi Chen
Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation

pdf
Contrastive Approach towards Text Source Classification based on Top-Bag-of-Word Similarity
Chu-Ren Huang | Lung-Hao Lee
Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation

pdf abs
Quality Assurance of Automatic Annotation of Very Large Corpora: a Study based on heterogeneous Tagging System
Chu-Ren Huang | Lung-Hao Lee | Wei-guang Qu | Jia-Fei Hong | Shiwen Yu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We propose a set of heuristics for improving annotation quality of very large corpora efficiently. The Xinhua News portion of the Chinese Gigaword Corpus was tagged independently with both the Peking University ICL tagset and the Academia Sinica CKIP tagset. The corpus-based POS tags mapping will serve as the basis of the possible contrast in grammatical systems between PRC and Taiwan. And it can serve as the basic model for mapping between the CKIP and ICL tagging systems for any data.

We outline work performed within the framework of a current EC project. The goal is to construct a language-independent information system for a specific domain (environment/ecology/biodiversity) anchored in a language-independent ontology that is linked to wordnets in seven languages. For each language, information extraction and identification of lexicalized concepts with ontological entries is carried out by text miners (Kybots). The mapping of language-specific lexemes to the ontology allows for crosslinguistic identification and translation of equivalent terms. The infrastructure developed within this project enables long-range knowledge sharing and transfer across many languages and cultures, addressing the need for global and uniform transition of knowledge beyond the specific domains addressed here.

pdf abs
Extracting Concrete Senses of Lexicon through Measurement of Conceptual Similarity in Ontologies
Siaw-Fong Chung | Laurent Prévot | Mingwei Xu | Kathleen Ahrens | Shu-Kai Hsieh | Chu-Ren Huang
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The measurement of conceptual similarity in a hierarchical structure has been proposed by studies such as Wu and Palmer (1994) which have been summarized and evaluated in Budanisky and Hirst (2006). The present study applies the measurement of conceptual similarity to conceptual metaphor research by comparing concreteness of ontological resource nodes to several prototypical concrete nodes selected by human subjects. Here, the purpose of comparing conceptual similarity between nodes is to select a concrete sense for a word which is used metaphorically. Through using WordNet-SUMO interface such as SinicaBow (Huang, Chang and Lee, 2004), concrete senses of a lexicon will be selected once its SUMO nodes have been compared in terms of conceptual similarity with the prototypical concrete nodes. This study has strong implications for the interaction of psycholinguistic and computational linguistic fields in conceptual metaphor research.

Corpus-based approaches and statistical approaches have been the main stream of natural language processing research for the past two decades. Language resources play a key role in such approaches, but there is an insufficient amount of language resources in many Asian languages. In this situation, standardisation of language resources would be of great help in developing resources in new languages. This paper presents the latest development efforts of our project which aims at creating a common standard for Asian language resources that is compatible with an international standard. In particular, the paper focuses on i) lexical specification and data categories relevant for building multilingual lexical resources for Asian languages; ii) a core upper-layer ontology needed for ensuring multilingual interoperability and iii) the evaluation platform used to test the entire architectural framework.

pdf abs
The Extended Architecture of Hantology for Japan Kanji
Ya-Min Chou | Chu-Ren Huang | Jia-Fei Hong
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Chinese writing system is not only used by Chinese but also used by Japanese. The motivation of this paper is to extend the architecture of Hantology which describes the features of Chinese writing system to integrate Japan Kanji and Chinese characters into the same ontology. The problem is Chinese characters adopted by Japan have been changed, thus, the modification of the original architecture of Hantology is needed. A extended architecture consists orthographic, pronunciation, sense and derived lexicon dimensions. is proposed in this paper. The contribution of this study is that the extension architecture of Hantology provides a platform to analyze the variation of Chinese characters used in Japan. The analytic results of variation for a specific Kanji can be integrated into Hantology, so it is easier to study the variation of Chinese characters systematically

2007

pdf
Computing Thresholds of Linguistic Saliency
Siaw-Fong Chung | Kathleen Ahrens | Chung-Ping Cheng | Chu-Ren Huang | Petr Šimon
Proceedings of the 21st Pacific Asia Conference on Language, Information and Computation

pdf
The Polysemy of Da3: An ontology-based lexical semantic study
Jia-Fei Hong | Chu-Ren Huang | Kathleen Ahrens
Proceedings of the 21st Pacific Asia Conference on Language, Information and Computation

pdf
Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification
Chu-Ren Huang | Petr Šimon | Shu-Kai Hsieh | Laurent Prévot
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf
Automatic Discovery of Named Entity Variants: Grammar-driven Approaches to Non-Alphabetical Transliterations
Chu-Ren Huang | Petr Šimon | Shu-Kai Hsieh
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf
以中文十億詞語料庫為基礎之兩岸詞彙對比研究 (A Study of Lexical Differences between China and Taiwan based on the Chinese Gigaword Corpus) [In Chinese]
Jia-Fei Hung | Chu-Ren Huang | Ming-Wei Xu
ROCLING 2007 Poster Papers

2006

pdf abs
Uniform and Effective Tagging of a Heterogeneous Giga-word Corpus
Wei-Yun Ma | Chu-Ren Huang
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Tagging as the most crucial annotation of language resources can still be challenging when the corpus size is big and when the corpus data is not homogeneous. The Chinese Gigaword Corpus is confounded by both challenges. The corpus containsroughly 1.12 billion Chinese characters from two heterogeneous sources: respective news in Taiwan and in Mainland China. In other words, in addition to its size, the data also contains two variants of Chinese that are known to exhibit substantial linguistic differences. We utilize Chinese Sketch Engine as the corpus query tool, by which grammar behaviours of the two heterogeneous resources could be captured and displayed in a unified web interface. In this paper, we report our answer to the two challenges to effectively tag this large-scale corpus. The evaluation result shows our mechanism of tagging maintains high annotation quality.

pdf abs
Hantology-A Linguistic Resource for Chinese Language Processing and Studying
Ya-Min Chou | Chu-Ren Huang
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Hantology, a character-based Chinese language resource is created to provide an infrastructure for language processing and research on the writing system. Unlike alphabetic or syllabic writing systems, the ideographic writing system of Chinese poses both a challenge and an opportunity. The challenge is that a totally different resources structure must be created to represent and process speakers conventionalization of the language. The rare opportunity is that the structure itself is enriched with conceptual classification and can be utilized for ontology building. We describe the contents and possible applications of Hantology in this paper. The applications of Hantology include: (1) an account for the diachronic development of Chinese lexica (2) character-based language processing, (3) a study of conceptual structure differences in Chinese and English, and (4) comparisons of different ideographic writing systems.

pdf
When Conset Meets Synset: A Preliminary Survey of an Ontological Lexical Resource Based on Chinese Characters
Shu-Kai Hsieh | Chu-Ren Huang
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf bib
中文動詞名物化判斷的統計式模型設計 (A Stochastic Model for Prediction of Deverbal Nouns in Mandarin Chinese) [In Chinese]
Wei-Yun Ma | Chu-Ren Huang
Proceedings of the 18th Conference on Computational Linguistics and Speech Processing

pdf
大規模詞彙語意關係自動標示之初步研究: 以中文詞網(Chinese Wordnet)為例 (A Preliminary Study on Large-scale Automatic Labeling of Lexical Semantic Relations: A Case study of Chinese Wordnet) [In Chinese]
Shu-Kai Hsieh | Petr Šimon | Chu-Ren Huang
Proceedings of the 18th Conference on Computational Linguistics and Speech Processing

pdf
Towards Agent-based Cross-Lingual Interoperability of Distributed Lexical Resources
Claudia Soria | Maurizio Tesconi | Andrea Marchetti | Francesca Bertagna | Monica Monachini | Chu-Ren Huang | Nicoletta Calzolari
Proceedings of the Workshop on Multilingual Language Resources and Interoperability

pdf
Using Chinese Gigaword Corpus and Chinese Word Sketch in linguistic Research
Jia-Fei Hong | Chu-Ren Huang
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation

pdf
Knowledge-Rich Approach to Automatic Grammatical Information Acquisition: Enriching Chinese Sketch Engine with a Lexical Grammar
Chu-Ren Huang | Wei-Yun Ma | Yi-Ching Wu | Chih-Ming Chiu
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation

pdf
Using the Swadesh list for creating a simple common taxonomy
Laurent Prévot | Chu-Ren Huang | I-Li Su
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation

Chu-Ren Huang

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

1989

1988

Co-authors

Venues