Qin Lu

Also published as: Q. Lu

2023

Pre-trained encoder-only and sequence-to-sequence (seq2seq) models each have advantages, however training both model types from scratch is computationally expensive. We explore recipes to improve pre-training efficiency by initializing one model from the other. (1) Extracting the encoder from a seq2seq model, we show it under-performs a Masked Language Modeling (MLM) encoder, particularly on sequence labeling tasks. Variations of masking during seq2seq training, reducing the decoder size, and continuing with a small amount of MLM training do not close the gap. (2) Conversely, using an encoder to warm-start seq2seq training, we show that by unfreezing the encoder partway through training, we can match task performance of a from-scratch seq2seq model. Overall, this two-stage approach is an efficient recipe to obtain both a multilingual encoder and a seq2seq model, matching the performance of training each model from scratch while reducing the total compute cost by 27%.

2021

In this contribution, we describe the system presented by the PolyU CBS-Comp Team at the Task 1 of SemEval 2021, where the goal was the estimation of the complexity of words in a given sentence context. Our top system, based on a combination of lexical, syntactic, word embeddings and Transformers-derived features and on a Gradient Boosting Regressor, achieves a top correlation score of 0.754 on the subtask 1 for single words and 0.659 on the subtask 2 for multiword expressions.

2020

pdf abs
Sina Mandarin Alphabetical Words:A Web-driven Code-mixing Lexical Resource
Rong Xiang | Mingyu Wan | Qi Su | Chu-Ren Huang | Qin Lu
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Mandarin Alphabetical Word (MAW) is one indispensable component of Modern Chinese that demonstrates unique code-mixing idiosyncrasies influenced by language exchanges. Yet, this interesting phenomenon has not been properly addressed and is mostly excluded from the Chinese language system. This paper addresses the core problem of MAW identification and proposes to construct a large collection of MAWs from Sina Weibo (SMAW) using an automatic web-based technique which includes rule-based identification, informatics-based extraction, as well as Baidu search engine validation. A collection of 16,207 qualified SMAWs are obtained using this technique along with an annotated corpus of more than 200,000 sentences for linguistic research and applicable inquiries.

pdf abs
Automatic Learning of Modality Exclusivity Norms with Crosslingual Word Embeddings
Emmanuele Chersoni | Rong Xiang | Qin Lu | Chu-Ren Huang
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

Collecting modality exclusivity norms for lexical items has recently become a common practice in psycholinguistics and cognitive research. However, these norms are available only for a relatively small number of languages and often involve a costly and time-consuming collection of ratings. In this work, we aim at learning a mapping between word embeddings and modality norms. Our experiments focused on crosslingual word embeddings, in order to predict modality association scores by training on a high-resource language and testing on a low-resource one. We ran two experiments, one in a monolingual and the other one in a crosslingual setting. Results show that modality prediction using off-the-shelf crosslingual embeddings indeed has moderate-to-high correlations with human ratings even when regression algorithms are trained on an English resource and tested on a completely unseen language.

Deep neural network models have played a critical role in sentiment analysis with promising results in the recent decade. One of the essential challenges, however, is how external sentiment knowledge can be effectively utilized. In this work, we propose a novel affection-driven approach to incorporating affective knowledge into neural network models. The affective knowledge is obtained in the form of a lexicon under the Affect Control Theory (ACT), which is represented by vectors of three-dimensional attributes in Evaluation, Potency, and Activity (EPA). The EPA vectors are mapped to an affective influence value and then integrated into Long Short-term Memory (LSTM) models to highlight affective terms. Experimental results show a consistent improvement of our approach over conventional LSTM models by 1.0% to 1.5% in accuracy on three large benchmark datasets. Evaluations across a variety of algorithms have also proven the effectiveness of leveraging affective terms for deep model enhancement.

Automatic Chinese irony detection is a challenging task, and it has a strong impact on linguistic research. However, Chinese irony detection often lacks labeled benchmark datasets. In this paper, we introduce Ciron, the first Chinese benchmark dataset available for irony detection for machine learning models. Ciron includes more than 8.7K posts, collected from Weibo, a micro blogging platform. Most importantly, Ciron is collected with no pre-conditions to ensure a much wider coverage. Evaluation on seven different machine learning classifiers proves the usefulness of Ciron as an important resource for Chinese irony detection.

2019

pdf abs
Improving Multi-label Emotion Classification by Integrating both General and Domain-specific Knowledge
Wenhao Ying | Rong Xiang | Qin Lu
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Deep learning based general language models have achieved state-of-the-art results in many popular tasks such as sentiment analysis and QA tasks. Text in domains like social media has its own salient characteristics. Domain knowledge should be helpful in domain relevant tasks. In this work, we devise a simple method to obtain domain knowledge and further propose a method to integrate domain knowledge with general knowledge based on deep language models to improve performance of emotion classification. Experiments on Twitter data show that even though a deep language model fine-tuned by a target domain data has attained comparable results to that of previous state-of-the-art models, this fine-tuned model can still benefit from our extracted domain knowledge to obtain more improvement. This highlights the importance of making use of domain knowledge in domain-specific applications.

2018

pdf abs
Leveraging Writing Systems Change for Deep Learning Based Chinese Emotion Analysis
Rong Xiang | Yunfei Long | Qin Lu | Dan Xiong | I-Hsuan Chen
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Social media text written in Chinese communities contains mixed scripts including major text written in Chinese, an ideograph-based writing system, and some minor text using Latin letters, an alphabet-based writing system. This phenomenon is called writing systems changes (WSCs). Past studies have shown that WSCs can be used to express emotions, particularly where the social and political environment is more conservative. However, because WSCs can break the syntax of the major text, it poses more challenges in Natural Language Processing (NLP) tasks like emotion classification. In this work, we present a novel deep learning based method to include WSCs as an effective feature for emotion analysis. The method first identifies all WSCs points. Then representation of the major text is learned through an LSTM model whereas the minor text is learned by a separate CNN model. Emotions in the minor text are further highlighted through an attention mechanism before emotion classification. Performance evaluation shows that incorporating WSCs features using deep learning models can improve performance measured by F1-scores compared to the state-of-the-art model.

pdf abs
Dual Memory Network Model for Biased Product Review Classification
Yunfei Long | Mingyu Ma | Qin Lu | Rong Xiang | Chu-Ren Huang
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

In sentiment analysis (SA) of product reviews, both user and product information are proven to be useful. Current tasks handle user profile and product information in a unified model which may not be able to learn salient features of users and products effectively. In this work, we propose a dual user and product memory network (DUPMN) model to learn user profiles and product reviews using separate memory networks. Then, the two representations are used jointly for sentiment prediction. The use of separate models aims to capture user profiles and product information more effectively. Compared to state-of-the-art unified prediction models, the evaluations on three benchmark datasets, IMDB, Yelp13, and Yelp14, show that our dual learning model gives performance gain of 0.6%, 1.2%, and 0.9%, respectively. The improvements are also deemed very significant measured by p-values.

2017

pdf abs
Are Manually Prepared Affective Lexicons Really Useful for Sentiment Analysis
Minglei Li | Qin Lu | Yunfei Long
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

In this paper, we investigate the effectiveness of different affective lexicons through sentiment analysis of phrases. We examine how phrases can be represented through manually prepared lexicons, extended lexicons using computational methods, or word embedding. Comparative studies clearly show that word embedding using unsupervised distributional method outperforms manually prepared lexicons no matter what affective models are used in the lexicons. Our conclusion is that although different affective lexicons are cognitively backed by theories, they do not show any advantage over the automatically obtained word embedding.

pdf abs
Fake News Detection Through Multi-Perspective Speaker Profiles
Yunfei Long | Qin Lu | Rong Xiang | Minglei Li | Chu-Ren Huang
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Automatic fake news detection is an important, yet very challenging topic. Traditional methods using lexical features have only very limited success. This paper proposes a novel method to incorporate speaker profiles into an attention based LSTM model for fake news detection. Speaker profiles contribute to the model in two ways. One is to include them in the attention model. The other includes them as additional input data. By adding speaker profiles such as party affiliation, speaker title, location and credit history, our model outperforms the state-of-the-art method by 14.5% in accuracy using a benchmark fake news detection dataset. This proves that speaker profiles provide valuable information to validate the credibility of news articles.

pdf abs
A Cognition Based Attention Model for Sentiment Analysis
Yunfei Long | Qin Lu | Rong Xiang | Minglei Li | Chu-Ren Huang
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Attention models are proposed in sentiment analysis because some words are more important than others. However,most existing methods either use local context based text information or user preference information. In this work, we propose a novel attention model trained by cognition grounded eye-tracking data. A reading prediction model is first built using eye-tracking data as dependent data and other features in the context as independent data. The predicted reading time is then used to build a cognition based attention (CBA) layer for neural sentiment analysis. As a comprehensive model, We can capture attentions of words in sentences as well as sentences in documents. Different attention mechanisms can also be incorporated to capture other aspects of attentions. Evaluations show the CBA based method outperforms the state-of-the-art local context based attention methods significantly. This brings insight to how cognition grounded data can be brought into NLP tasks.

Emotion cause extraction aims to identify the reasons behind a certain emotion expressed in text. It is a much more difficult task compared to emotion classification. Inspired by recent advances in using deep memory networks for question answering (QA), we propose a new approach which considers emotion cause identification as a reading comprehension task in QA. Inspired by convolutional neural networks, we propose a new mechanism to store relevant context in different memory slots to model context information. Our proposed approach can extract both word level sequence features and lexical features. Performance evaluation shows that our method achieves the state-of-the-art performance on a recently released emotion cause dataset, outperforming a number of competitive baselines by at least 3.01% in F-measure.

pdf abs
Leveraging Eventive Information for Better Metaphor Detection and Classification
I-Hsuan Chen | Yunfei Long | Qin Lu | Chu-Ren Huang
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Metaphor detection has been both challenging and rewarding in natural language processing applications. This study offers a new approach based on eventive information in detecting metaphors by leveraging the Chinese writing system, which is a culturally bound ontological system organized according to the basic concepts represented by radicals. As such, the information represented is available in all Chinese text without pre-processing. Since metaphor detection is another culturally based conceptual representation, we hypothesize that sub-textual information can facilitate the identification and classification of the types of metaphoric events denoted in Chinese text. We propose a set of syntactic conditions crucial to event structures to improve the model based on the classification of radical groups. With the proposed syntactic conditions, the model achieves a performance of 0.8859 in terms of F-scores, making 1.7% of improvement than the same classifier with only Bag-of-word features. Results show that eventive information can improve the effectiveness of metaphor detection. Event information is rooted in every language, and thus this approach has a high potential to be applied to metaphor detection in other languages.

2016

pdf
Event-Driven Emotion Cause Extraction with Corpus Construction
Lin Gui | Dongyin Wu | Ruifeng Xu | Qin Lu | Yu Zhou
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf
Event Based Emotion Classification for News Articles
Minglei Li | Da Wang | Qin Lu | Yunfei Long
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

This paper reports our work on building up a Cantonese Speech-to-Text (STT) system with a syllable based acoustic model. This is a part of an effort in building a STT system to aid dyslexic students who have cognitive deficiency in writing skills but have no problem expressing their ideas through speech. For Cantonese speech recognition, the basic unit of acoustic models can either be the conventional Initial-Final (IF) syllables, or the Onset-Nucleus-Coda (ONC) syllables where finals are further split into nucleus and coda to reflect the intra-syllable variations in Cantonese. By using the Kaldi toolkit, our system is trained using the stochastic gradient descent optimization model with the aid of GPUs for the hybrid Deep Neural Network and Hidden Markov Model (DNN-HMM) with and without I-vector based speaker adaptive training technique. The input features of the same Gaussian Mixture Model with speaker adaptive training (GMM-SAT) to DNN are used in all cases. Experiments show that the ONC-based syllable acoustic modeling with I-vector based DNN-HMM achieves the best performance with the word error rate (WER) of 9.66% and the real time factor (RTF) of 1.38812.

pdf abs
Nine Features in a Random Forest to Learn Taxonomical Semantic Relations
Enrico Santus | Alessandro Lenci | Tin-Shing Chiu | Qin Lu | Chu-Ren Huang
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

ROOT9 is a supervised system for the classification of hypernyms, co-hyponyms and random words that is derived from the already introduced ROOT13 (Santus et al., 2016). It relies on a Random Forest algorithm and nine unsupervised corpus-based features. We evaluate it with a 10-fold cross validation on 9,600 pairs, equally distributed among the three classes and involving several Parts-Of-Speech (i.e. adjectives, nouns and verbs). When all the classes are present, ROOT9 achieves an F1 score of 90.7%, against a baseline of 57.2% (vector cosine). When the classification is binary, ROOT9 achieves the following results against the baseline. hypernyms-co-hyponyms 95.7% vs. 69.8%, hypernyms-random 91.8% vs. 64.1% and co-hyponyms-random 97.8% vs. 79.4%. In order to compare the performance with the state-of-the-art, we have also evaluated ROOT9 in subsets of the Weeds et al. (2014) datasets, proving that it is in fact competitive. Finally, we investigated whether the system learns the semantic relation or it simply learns the prototypical hypernyms, as claimed by Levy et al. (2015). The second possibility seems to be the most likely, even though ROOT9 can be trained on negative examples (i.e., switched hypernyms) to drastically reduce this bias.

pdf abs
What a Nerd! Beating Students and Vector Cosine in the ESL and TOEFL Datasets
Enrico Santus | Alessandro Lenci | Tin-Shing Chiu | Qin Lu | Chu-Ren Huang
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we claim that Vector Cosine ― which is generally considered one of the most efficient unsupervised measures for identifying word similarity in Vector Space Models ― can be outperformed by a completely unsupervised measure that evaluates the extent of the intersection among the most associated contexts of two target words, weighting such intersection according to the rank of the shared contexts in the dependency ranked lists. This claim comes from the hypothesis that similar words do not simply occur in similar contexts, but they share a larger portion of their most relevant contexts compared to other related words. To prove it, we describe and evaluate APSyn, a variant of Average Precision that ― independently of the adopted parameters ― outperforms the Vector Cosine and the co-occurrence on the ESL and TOEFL test sets. In the best setting, APSyn reaches 0.73 accuracy on the ESL dataset and 0.70 accuracy in the TOEFL dataset, beating therefore the non-English US college applicants (whose average, as reported in the literature, is 64.50%) and several state-of-the-art approaches.

We adopt the corpus-informed approach to example sentence selections for the construction of a reference grammar. In the process, a database containing sentences that are carefully selected by linguistic experts including the full range of linguistic facts covered in an authoritative Chinese Reference Grammar is constructed and structured according to the reference grammar. A search engine system is developed to facilitate the process of finding the most typical examples the users need to study a linguistic problem or prove their hypotheses. The database can also be used as a training corpus by computational linguists to train models for Chinese word segmentation, POS tagging and sentence parsing.

2011

pdf
A Hybrid Extraction Model for Chinese Noun/Verb Synonymous bi-gram Collocations
Wanyin Li | Qin Lu
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

2010

pdf bib
Proceedings of the 6th Workshop on Ontologies and Lexical Resources
Alessandro Oltramari | Piek Vossen | Qin Lu
Proceedings of the 6th Workshop on Ontologies and Lexical Resources

pdf
Combining Constituent and Dependency Syntactic Views for Chinese Semantic Role Labeling
Shiqi Li | Qin Lu | Tiejun Zhao | Pengyuan Liu | Hanjing Li
Coling 2010: Posters

pdf
A Study on Position Information in Document Summarization
You Ouyang | Wenjie Li | Qin Lu | Renxian Zhang
Coling 2010: Posters

pdf
Sentence Ordering with Event-Enriched Semantics and Two-Layered Clustering for Multi-Document News Summarization
Renxian Zhang | Wenjie Li | Qin Lu
Coling 2010: Posters

pdf
PKU_HIT: An Event Detection System Based on Instances Expansion and Rich Syntactic Features
Shiqi Li | Pengyuan Liu | Tiejun Zhao | Qin Lu | Hanjing Li
Proceedings of the 5th International Workshop on Semantic Evaluation

2009

pdf
An Integrated Multi-document Summarization Approach based on Word Hierarchical Representation
You Ouyang | Wenjie Li | Qin Lu
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf
Chinese Term Extraction Using Different Types of Relevance
Yuhang Yang | Tiejun Zhao | Qin Lu | Dequan Zheng | Hao Yu
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf bib
Fundamentals of Chinese Language Processing
Chu-Ren Huang | Qin Lu
Tutorial Abstracts of ACL-IJCNLP 2009

2008

pdf abs
Chinese Core Ontology Construction from a Bilingual Term Bank
Yirong Chen | Qin Lu | Wenjie Li | Gaoying Cui
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

A core ontology is a mid-level ontology which bridges the gap between an upper ontology and a domain ontology. Automatic Chinese core ontology construction can help quickly model domain knowledge. A graph based core ontology construction algorithm (COCA) is proposed to automatically construct a core ontology from an English-Chinese bilingual term bank. This algorithm computes the mapping strength from a selected Chinese term to WordNet synset with association to an upper-level SUMO concept. The strength is measured using a graph model integrated with several mapping features from multiple information sources. The features include multiple translation feature between Chinese core term and WordNet, extended string feature and Part-of-Speech feature. Evaluation of COCA repeated on an English-Chinese bilingual Term bank with more than 130K entries shows that the algorithm is improved in performance compared with our previous research and can better serve the semi-automatic construction of mid-level ontology.

pdf abs
Chinese Term Extraction Based on Delimiters
Yuhang Yang | Qin Lu | Tiejun Zhao
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Existing techniques extract term candidates by looking for internal and contextual information associated with domain specific terms. The algorithms always face the dilemma that fewer features are not enough to distinguish terms from non-terms whereas more features lead to more conflicts among selected features. This paper presents a novel approach for term extraction based on delimiters which are much more stable and domain independent. The proposed approach is not as sensitive to term frequency as that of previous works. This approach has no strict limit or hard rules and thus they can deal with all kinds of terms. It also requires no prior domain knowledge and no additional training to adapt to new domains. Consequently, the proposed approach can be applied to different domains easily and it is especially useful for resource-limited domains. Evaluations conducted on two different domains for Chinese term extraction show significant improvements over existing techniques which verifies its efficiency and domain independent nature. Experiments on new term extraction indicate that the proposed approach can also serve as an effective tool for domain lexicon expansion.

pdf abs
Corpus Exploitation from Wikipedia for Ontology Construction
Gaoying Cui | Qin Lu | Wenjie Li | Yirong Chen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Ontology construction usually requires a domain-specific corpus for building corresponding concept hierarchy. The domain corpus must have a good coverage of domain knowledge. Wikipedia(Wiki), the worlds largest online encyclopaedic knowledge source, is open-content, collaboratively edited, and free of charge. It covers millions of articles and still keeps on expanding continuously. These characteristics make Wiki a good candidate as domain corpus resource in ontology construction. However, the selected article collection must have considerable quality and quantity. In this paper, a novel approach is proposed to identify articles in Wiki as domain-specific corpus by using available classification information in Wiki pages. The main idea is to generate a domain hierarchy from the hyperlinked pages of Wiki. Only articles strongly linked to this hierarchy are selected as the domain corpus. The proposed approach makes use of linked category information in Wiki pages to produce the hierarchy as a directed graph for obtaining a set of pages in the same connected branch. Ranking and filtering are then done on these pages based on the classification tree generated by the traversal algorithm. The experiment and evaluation results show that Wiki is a good resource for acquiring a relative high quality domain-specific corpus for ontology construction.

pdf abs
Exploiting the Role of Position Feature in Chinese Relation Extraction
Peng Zhang | Wenjie Li | Furu Wei | Qin Lu | Yuexian Hou
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Relation extraction is the task of finding pre-defined semantic relations between two entities or entity mentions from text. Many methods, such as feature-based and kernel-based methods, have been proposed in the literature. Among them, feature-based methods draw much attention from researchers. However, to the best of our knowledge, existing feature-based methods did not explicitly incorporate the position feature and no in-depth analysis was conducted in this regard. In this paper, we define and exploit nine types of position information between two named entity mentions and then use it along with other features in a multi-class classification framework for Chinese relation extraction. Experiments on the ACE 2005 data set show that the position feature is more effective than the other recognized features like entity type/subtype and character-based N-gram context. Most important, it can be easily captured and does not require as much effort as applying deep natural language processing.

pdf
A Novel Feature-based Approach to Chinese Entity Relation Extraction
Wenjie Li | Peng Zhang | Furu Wei | Yuexian Hou | Qin Lu
Proceedings of ACL-08: HLT, Short Papers

pdf
Preliminary Chinese Term Classification for Ontology Construction
Gaoying Cui | Qin Lu | Wenjie Li
Proceedings of the 6th Workshop on Asian Language Resources

pdf
PNR2: Ranking Sentences with Positive and Negative Reinforcement for Query-Oriented Update Summarization
Wenjie Li | Furu Wei | Qin Lu | Yanxiang He
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf
Chinese Term Extraction Using Minimal Resources
Yuhang Yang | Qin Lu | Tiejun Zhao
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf
Annotating Chinese Collocations with Multi Information
Ruifeng Xu | Qin Lu | Kam-Fai Wong | Wenjie Li
Proceedings of the Linguistic Annotation Workshop

pdf
Extractive Summarization Based on Event Term Clustering
Maofu Liu | Wenjie Li | Mingli Wu | Qin Lu
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

pdf
Extractive Summarization using Inter- and Intra- Event Relevance
Wenjie Li | Mingli Wu | Qin Lu | Wei Xu | Chunfa Yuan
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
A Comparative Study of the Effect of Word Segmentation On Chinese Terminology Extraction
Luning Ji | Qin Lu | Wenjie Li | YiRong Chen
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation

pdf
TCtract-A Collocation Extraction Approach for Noun Phrases Using Shallow Parsing Rules and Statistic Models
Wan Yin Li | Qin Lu | James Liu
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation

pdf abs
Mining Implicit Entities in Queries
Wei Li | Wenjie Li | Qin Lu
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Entities are pivotal in describing events and objects, and also very important in Document Summarization. In general only explicit entities which can be extracted by a Named Entity Recognizer are used in real applications. However, implicit entities hidden behind the phrases or words, e.g. entity referred by the phrase cross border, are proved to be helpful in Document Summarization. In our experiment, we extract the implicit entities from the web resources.

pdf abs
A Study on Terminology Extraction Based on Classified Corpora
Yirong Chen | Qin Lu | Wenjie Li | Zhifang Sui | Luning Ji
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Algorithms for automatic term extraction in a specific domain should consider at least two issues, namely Unithood and Termhood (Kageura, 1996). Unithood refers to the degree of a string to occur as a word or a phrase. Termhood (Chen Yirong, 2005) refers to the degree of a word or a phrase to occur as a domain specific concept. Unlike unithood, study on termhood is not yet widely reported. In classified corpora, the class information provides the cue to the nature of data and can be used in termhood calculation. Three algorithms are provided and evaluated to investigate termhood based on classified corpora. The three algorithms are based on lexicon set computing, term frequency and document frequency, and the strength of the relation between a term and its document class respectively. Our objective is to investigate the effects of these different termhood measurement features. After evaluation, we can find which features are more effective and also, how we can improve these different features to achieve the best performance. Preliminary results show that the first measure can effectively filter out independent terms or terms of general use.

pdf abs
Interaction between Lexical Base and Ontology with Formal Concept Analysis
Sujian Li | Qin Lu | Wenjie Li | Ruifeng Xu
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

An ontology describes conceptual knowledge in a specific domain. A lexical base collects a repository of words and gives independent definition of concepts. In this paper, we propose to use FCA as a tool to help constructing an ontology through an existing lexical base. We mainly address two issues. The first issue is how to select attributes to visualize the relations between lexical terms. The second issue is how to revise lexical definitions through analysing the relations in the ontology. Thus the focus is on the effect of interaction between a lexical base and an ontology for the purpose of good ontology construction. Finally, experiments have been conducted to verify our ideas.

pdf abs
The Design and Construction of A Chinese Collocation Bank
Ruifeng Xu | Qin Lu | Sujian Li
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper presents an annotated Chinese collocation bank developed at the Hong Kong Polytechnic University. The definition of collocation with good linguistic consistency and good computational operability is first discussed and the properties of collocations are then presented. Secondly, based on the combination of different properties, collocations are classified into four types. Thirdly, the annotation guideline is presented. Fourthly, the implementation issues for collocation bank construction are addressed including the annotation with categorization, dependency and contextual information. Currently, the collocation bank is completed for 3,643 headwords in a 5-million-word corpus.