Fu Lee Wang


2020

pdf bib
Neural Mixed Counting Models for Dispersed Topic Discovery
Jiemin Wu | Yanghui Rao | Zusheng Zhang | Haoran Xie | Qing Li | Fu Lee Wang | Ziye Chen
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Mixed counting models that use the negative binomial distribution as the prior can well model over-dispersed and hierarchically dependent random variables; thus they have attracted much attention in mining dispersed document topics. However, the existing parameter inference method like Monte Carlo sampling is quite time-consuming. In this paper, we propose two efficient neural mixed counting models, i.e., the Negative Binomial-Neural Topic Model (NB-NTM) and the Gamma Negative Binomial-Neural Topic Model (GNB-NTM) for dispersed topic discovery. Neural variational inference algorithms are developed to infer model parameters by using the reparameterization of Gamma distribution and the Gaussian approximation of Poisson distribution. Experiments on real-world datasets indicate that our models outperform state-of-the-art baseline models in terms of perplexity and topic coherence. The results also validate that both NB-NTM and GNB-NTM can produce explainable intermediate variables by generating dispersed proportions of document topics.

2018

pdf bib
Siamese Network-Based Supervised Topic Modeling
Minghui Huang | Yanghui Rao | Yuwei Liu | Haoran Xie | Fu Lee Wang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Label-specific topics can be widely used for supporting personality psychology, aspect-level sentiment analysis, and cross-domain sentiment classification. To generate label-specific topics, several supervised topic models which adopt likelihood-driven objective functions have been proposed. However, it is hard for them to get a precise estimation on both topic discovery and supervised learning. In this study, we propose a supervised topic model based on the Siamese network, which can trade off label-specific word distributions with document-specific label distributions in a uniform framework. Experiments on real-world datasets validate that our model performs competitive in topic discovery quantitatively and qualitatively. Furthermore, the proposed model can effectively predict categorical or real-valued labels for new documents by generating word embeddings from a label-specific topical space.

2017

pdf bib
A Network Framework for Noisy Label Aggregation in Social Media
Xueying Zhan | Yaowei Wang | Yanghui Rao | Haoran Xie | Qing Li | Fu Lee Wang | Tak-Lam Wong
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper focuses on the task of noisy label aggregation in social media, where users with different social or culture backgrounds may annotate invalid or malicious tags for documents. To aggregate noisy labels at a small cost, a network framework is proposed by calculating the matching degree of a document’s topics and the annotators’ meta-data. Unlike using the back-propagation algorithm, a probabilistic inference approach is adopted to estimate network parameters. Finally, a new simulation method is designed for validating the effectiveness of the proposed framework in aggregating noisy labels.

2006

pdf bib
Towards Unified Chinese Segmentation Algorithm
Fu Lee Wang | Xiaotie Deng | Feng Zou
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

As Chinese is an ideographic character-based language, the words in the texts are not delimited by spaces. Indexing of Chinese documents is impossible without a proper segmentation algorithm. Many Chinese segmentation algorithms have been proposed in the past. Traditional segmentation algorithms cannot operate without a large dictionary or a large corpus of training data. Nowadays, the Web has become the largest corpus that is ideal for Chinese segmentation. Although the search engines do not segment texts into proper words, they maintain huge databases of documents and frequencies of character sequences in the documents. Their databases are important potential resources for segmentation. In this paper, we propose a segmentation algorithm by mining web data with the help from search engines. It is the first unified segmentation algorithm for Chinese language from different geographical areas. Experiments have been conducted on the datasets of a recent Chinese segmentation competition. The results show that our algorithm outperforms the traditional algorithms in terms of precision and recall. Moreover, our algorithm can effectively deal with the problem of segmentation ambiguity, new word (unknown word) detection, and stop words.

pdf bib
Evaluation of Stop Word Lists in Chinese Language
Feng Zou | Fu Lee Wang | Xiaotie Deng | Song Han
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In modern information retrieval systems, effective indexing can be achieved by removal of stop words. Till now many stop word lists have been developed for English language. However, no standard stop word list has been constructed for Chinese language yet. With the fast development of information retrieval in Chinese language, exploring the evaluation of Chinese stop word lists becomes critical. In this paper, to save the time and release the burden of manual comparison, we propose a novel stop word list evaluation method with a mutual information-based Chinese segmentation methodology. Experiments have been conducted on training texts taken from a recent international Chinese segmentation competition. Results show that effective stop word lists can improve the accuracy of Chinese segmentation significantly.