Wenjie Li


2023

pdf
Multiview Identifiers Enhanced Generative Retrieval
Yongqi Li | Nan Yang | Liang Wang | Furu Wei | Wenjie Li
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Instead of simply matching a query to pre-existing passages, generative retrieval generates identifier strings of passages as the retrieval target. At a cost, the identifier must be distinctive enough to represent a passage. Current approaches use either a numeric ID or a text piece (such as a title or substrings) as the identifier. However, these identifiers cannot cover a passage’s content well. As such, we are motivated to propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage and could integrate contextualized information that text pieces lack. Furthermore, we simultaneously consider multiview identifiers, including synthetic identifiers, titles, and substrings. These views of identifiers complement each other and facilitate the holistic ranking of passages from multiple perspectives. We conduct a series of experiments on three public datasets, and the results indicate that our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.

pdf
ORGAN: Observation-Guided Radiology Report Generation via Tree Reasoning
Wenjun Hou | Kaishuai Xu | Yi Cheng | Wenjie Li | Jiang Liu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper explores the task of radiology report generation, which aims at generating free-text descriptions for a set of radiographs. One significant challenge of this task is how to correctly maintain the consistency between the images and the lengthy report. Previous research explored solving this issue through planning-based methods, which generate reports only based on high-level plans. However, these plans usually only contain the major observations from the radiographs (e.g., lung opacity), lacking much necessary information, such as the observation characteristics and preliminary clinical diagnoses. To address this problem, the system should also take the image information into account together with the textual plan and perform stronger reasoning during the generation process. In this paper, we propose an Observation-guided radiology Report Generation framework (ORGan). It first produces an observation plan and then feeds both the plan and radiographs for report generation, where an observation graph and a tree reasoning mechanism are adopted to precisely enrich the plan information by capturing the multi-formats of each observation. Experimental results demonstrate that our framework outperforms previous state-of-the-art methods regarding text quality and clinical efficacy.

pdf
Enhancing Event Causality Identification with Counterfactual Reasoning
Feiteng Mu | Wenjie Li
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Existing methods for event causality identification (ECI) focus on mining potential causal signals, i.e., causal context keywords and event pairs. However, causal signals are ambiguous, which may lead to the context-keywords bias and the event-pairs bias.To solve this issue, we propose the counterfactual reasoning that explicitly estimates the influence of context keywords and event pairs in training, so that we are able to eliminate the biases in inference.Experiments are conducted on two datasets, the result demonstrates the effectiveness of our method.

pdf
Dialogue Planning via Brownian Bridge Stochastic Process for Goal-directed Proactive Dialogue
Jian Wang | Dongding Lin | Wenjie Li
Findings of the Association for Computational Linguistics: ACL 2023

Goal-directed dialogue systems aim to proactively reach a pre-determined target through multi-turn conversations. The key to achieving this task lies in planning dialogue paths that smoothly and coherently direct conversations towards the target. However, this is a challenging and under-explored task. In this work, we propose a coherent dialogue planning approach that uses a stochastic process to model the temporal dynamics of dialogue paths. We define a latent space that captures the coherence of goal-directed behavior using a Brownian bridge process, which allows us to incorporate user feedback flexibly in dialogue planning. Based on the derived latent trajectories, we generate dialogue paths explicitly using pre-trained language models. We finally employ these paths as natural language prompts to guide dialogue generation. Our experiments show that our approach generates more coherent utterances and achieves the goal with a higher success rate.

pdf
Medical Dialogue Generation via Dual Flow Modeling
Kaishuai Xu | Wenjun Hou | Yi Cheng | Jian Wang | Wenjie Li
Findings of the Association for Computational Linguistics: ACL 2023

Medical dialogue systems (MDS) aim to provide patients with medical services, such as diagnosis and prescription. Since most patients cannot precisely describe their symptoms, dialogue understanding is challenging for MDS. Previous studies mainly addressed this by extracting the mentioned medical entities as critical dialogue history information. In this work, we argue that it is also essential to capture the transitions of the medical entities and the doctor’s dialogue acts in each turn, as they help the understanding of how the dialogue flows and enhance the prediction of the entities and dialogue acts to be adopted in the following turn. Correspondingly, we propose a Dual Flow enhanced Medical (DFMed) dialogue generation framework. It extracts the medical entities and dialogue acts used in the dialogue history and models their transitions with an entity-centric graph flow and a sequential act flow, respectively. We employ two sequential models to encode them and devise an interweaving component to enhance their interactions. Experiments on two datasets demonstrate that our method exceeds baselines in both automatic and manual evaluations.

pdf
Separating Context and Pattern: Learning Disentangled Sentence Representations for Low-Resource Extractive Summarization
Ruifeng Yuan | Shichao Sun | Zili Wang | Ziqiang Cao | Wenjie Li
Findings of the Association for Computational Linguistics: ACL 2023

Extractive summarization aims to select a set of salient sentences from the source document to form a summary. Context information has been considered one of the key factors for this task. Meanwhile, there also exist other pattern factors that can identify sentence importance, such as sentence position or certain n-gram tokens. However, such pattern information is only effective in specific datasets or domains and can not be generalized like the context information when there only exists limited data. In this case, current extractive summarization models may suffer from a performance drop when transferring to a new dataset. In this paper, we attempt to apply disentangled representation learning on extractive summarization, and separate the two key factors for the task, context and pattern, for a better generalization ability in the low-resource setting. To achieve this, we propose two groups of losses for encoding and disentangling sentence representations into context representations and pattern representations. In this case, we can either use only the context information in the zero-shot setting or fine-tune the pattern information in the few-shot setting. Experimental results on three summarization datasets from different domains show the effectiveness of our proposed approach.

2022

pdf
CARE: Causality Reasoning for Empathetic Responses by Conditional Graph Generation
Jiashuo Wang | Yi Cheng | Wenjie Li
Findings of the Association for Computational Linguistics: EMNLP 2022

Recent approaches to empathetic response generation incorporate emotion causalities to enhance comprehension of both the user’s feelings and experiences. However, these approaches suffer from two critical issues. First, they only consider causalities between the user’s emotion and the user’s experiences, and ignore those between the user’s experiences. Second, they neglect interdependence among causalities and reason them independently. To solve the above problems, we expect to reason all plausible causalities interdependently and simultaneously, given the user’s emotion, dialogue history, and future dialogue content. Then, we infuse these causalities into response generation for empathetic responses. Specifically, we design a new model, i.e., the Conditional Variational Graph Auto-Encoder (CVGAE), for the causality reasoning, and adopt a multi-source attention mechanism in the decoder for the causality infusion. We name the whole framework as CARE, abbreviated for CAusality Reasoning for Empathetic conversation. Experimental results indicate that our method achieves state-of-the-art performance.

pdf
MMCoQA: Conversational Question Answering over Text, Tables, and Images
Yongqi Li | Wenjie Li | Liqiang Nie
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The rapid development of conversational assistants accelerates the study on conversational question answering (QA). However, the existing conversational QA systems usually answer users’ questions with a single knowledge source, e.g., paragraphs or a knowledge graph, but overlook the important visual cues, let alone multiple knowledge sources of different modalities. In this paper, we hence define a novel research task, i.e., multimodal conversational question answering (MMCoQA), aiming to answer users’ questions with multimodal knowledge sources via multi-turn conversations. This new task brings a series of research challenges, including but not limited to priority, consistency, and complementarity of multimodal knowledge. To facilitate the data-driven approaches in this area, we construct the first multimodal conversational QA dataset, named MMConvQA. Questions are fully annotated with not only natural language answers but also the corresponding evidence and valuable decontextualized self-contained questions. Meanwhile, we introduce an end-to-end baseline model, which divides this complex research task into question understanding, multi-modal evidence retrieval, and answer extraction. Moreover, we report a set of benchmarking results, and the results indicate that there is ample room for improvement.

pdf
Improving Multi-turn Emotional Support Dialogue Generation with Lookahead Strategy Planning
Yi Cheng | Wenge Liu | Wenjie Li | Jiashuo Wang | Ruihui Zhao | Bang Liu | Xiaodan Liang | Yefeng Zheng
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Providing Emotional Support (ES) to soothe people in emotional distress is an essential capability in social interactions. Most existing researches on building ES conversation systems only considered single-turn interactions with users, which was over-simplified. In comparison, multi-turn ES conversation systems can provide ES more effectively, but face several new technical challenges, including: (1) how to adopt appropriate support strategies to achieve the long-term dialogue goal of comforting the user’s emotion; (2) how to dynamically model the user’s state. In this paper, we propose a novel system MultiESC to address these issues. For strategy planning, drawing inspiration from the A* search algorithm, we propose lookahead heuristics to estimate the future user feedback after using particular strategies, which helps to select strategies that can lead to the best long-term effects. For user state modeling, MultiESC focuses on capturing users’ subtle emotional expressions and understanding their emotion causes. Extensive experiments show that MultiESC significantly outperforms competitive baselines in both dialogue generation and strategy planning.

pdf
Few-shot Query-Focused Summarization with Prefix-Merging
Ruifeng Yuan | Zili Wang | Ziqiang Cao | Wenjie Li
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Query-focused summarization has been considered as an important extension for text summarization. It aims to generate a concise highlight for a given query. Different from text summarization, query-focused summarization has long been plagued by the problem of lacking high-quality large-scale datasets. In this paper, we investigate the idea that whether we can integrate and transfer the knowledge of text summarization and question answering to assist the few-shot learning in query-focused summarization. Here, we propose prefix-merging, a prefix-based pretraining strategy for few-shot learning in query-focused summarization. Drawn inspiration from prefix-tuning, we are allowed to integrate the task knowledge from text summarization and question answering into a properly designed prefix and apply the merged prefix to query-focused summarization. With only a small amount of trainable parameters, prefix-merging outperforms fine-tuning on query-focused summarization. We further discuss the influence of different prefix designs and propose a visualized explanation for how prefix-merging works.

2021

pdf bib
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Chengqing Zong | Fei Xia | Wenjie Li | Roberto Navigli
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf
Effect Generation Based on Causal Reasoning
Feiteng Mu | Wenjie Li | Zhipeng Xie
Findings of the Association for Computational Linguistics: EMNLP 2021

Causal reasoning aims to predict the future scenarios that may be caused by the observed actions. However, existing causal reasoning methods deal with causalities on the word level. In this paper, we propose a novel event-level causal reasoning method and demonstrate its use in the task of effect generation. In particular, we structuralize the observed cause-effect event pairs into an event causality network, which describes causality dependencies. Given an input cause sentence, a causal subgraph is retrieved from the event causality network and is encoded with the graph attention mechanism, in order to support better reasoning of the potential effects. The most probable effect event is then selected from the causal subgraph and is used as guidance to generate an effect sentence. Experiments show that our method generates more reasonable effect sentences than various well-designed competitors.

pdf
PolyU CBS-Comp at SemEval-2021 Task 1: Lexical Complexity Prediction (LCP)
Rong Xiang | Jinghang Gu | Emmanuele Chersoni | Wenjie Li | Qin Lu | Chu-Ren Huang
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

In this contribution, we describe the system presented by the PolyU CBS-Comp Team at the Task 1 of SemEval 2021, where the goal was the estimation of the complexity of words in a given sentence context. Our top system, based on a combination of lexical, syntactic, word embeddings and Transformers-derived features and on a Gradient Boosting Regressor, achieves a top correlation score of 0.754 on the subtask 1 for single words and 0.659 on the subtask 2 for multiword expressions.

pdf
Event Graph based Sentence Fusion
Ruifeng Yuan | Zili Wang | Wenjie Li
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Sentence fusion is a conditional generation task that merges several related sentences into a coherent one, which can be deemed as a summary sentence. The importance of sentence fusion has long been recognized by communities in natural language generation, especially in text summarization. It remains challenging for a state-of-the-art neural abstractive summarization model to generate a well-integrated summary sentence. In this paper, we explore the effective sentence fusion method in the context of text summarization. We propose to build an event graph from the input sentences to effectively capture and organize related events in a structured way and use the constructed event graph to guide sentence fusion. In addition to make use of the attention over the content of sentences and graph nodes, we further develop a graph flow attention mechanism to control the fusion process via the graph structure. When evaluated on sentence fusion data built from two summarization datasets, CNN/DaliyMail and Multi-News, our model shows to achieve state-of-the-art performance in terms of Rouge and other metrics like fusion rate and faithfulness.

pdf bib
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Chengqing Zong | Fei Xia | Wenjie Li | Roberto Navigli
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Chengqing Zong | Fei Xia | Wenjie Li | Roberto Navigli
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2020

pdf
Fact-level Extractive Summarization with Hierarchical Graph Mask on BERT
Ruifeng Yuan | Zili Wang | Wenjie Li
Proceedings of the 28th International Conference on Computational Linguistics

Most current extractive summarization models generate summaries by selecting salient sentences. However, one of the problems with sentence-level extractive summarization is that there exists a gap between the human-written gold summary and the oracle sentence labels. In this paper, we propose to extract fact-level semantic units for better extractive summarization. We also introduce a hierarchical structure, which incorporates the multi-level of granularities of the textual information into the model. In addition, we incorporate our model with BERT using a hierarchical graph mask. This allows us to combine BERT’s ability in natural language understanding and the structural information without increasing the scale of the model. Experiments on the CNN/DaliyMail dataset show that our model achieves state-of-the-art results.

2019

pdf
Jointly Learning Semantic Parser and Natural Language Generator via Dual Information Maximization
Hai Ye | Wenjie Li | Lu Wang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Semantic parsing aims to transform natural language (NL) utterances into formal meaning representations (MRs), whereas an NL generator achieves the reverse: producing an NL description for some given MRs. Despite this intrinsic connection, the two tasks are often studied separately in prior work. In this paper, we model the duality of these two tasks via a joint learning framework, and demonstrate its effectiveness of boosting the performance on both tasks. Concretely, we propose a novel method of dual information maximization (DIM) to regularize the learning process, where DIM empirically maximizes the variational lower bounds of expected joint distributions of NL and MRs. We further extend DIM to a semi-supervision setup (SemiDIM), which leverages unlabeled data of both tasks. Experiments on three datasets of dialogue management and code generation (and summarization) show that performance on both semantic parsing and NL generation can be consistently improved by DIM, in both supervised and semi-supervised setups.

2018

pdf
Retrieve, Rerank and Rewrite: Soft Template Based Neural Summarization
Ziqiang Cao | Wenjie Li | Sujian Li | Furu Wei
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most previous seq2seq summarization systems purely depend on the source text to generate summaries, which tends to work unstably. Inspired by the traditional template-based summarization approaches, this paper proposes to use existing summaries as soft templates to guide the seq2seq model. To this end, we use a popular IR platform to Retrieve proper summaries as candidate templates. Then, we extend the seq2seq framework to jointly conduct template Reranking and template-aware summary generation (Rewriting). Experiments show that, in terms of informativeness, our model significantly outperforms the state-of-the-art methods, and even soft templates themselves demonstrate high competitiveness. In addition, the import of high-quality external summaries improves the stability and readability of generated summaries.

pdf
Unpaired Sentiment-to-Sentiment Translation: A Cycled Reinforcement Learning Approach
Jingjing Xu | Xu Sun | Qi Zeng | Xiaodong Zhang | Xuancheng Ren | Houfeng Wang | Wenjie Li
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The goal of sentiment-to-sentiment “translation” is to change the underlying sentiment of a sentence while keeping its content. The main challenge is the lack of parallel data. To solve this problem, we propose a cycled reinforcement learning method that enables training on unpaired data by collaboration between a neutralization module and an emotionalization module. We evaluate our approach on two review datasets, Yelp and Amazon. Experimental results show that our approach significantly outperforms the state-of-the-art systems. Especially, the proposed method substantially improves the content preservation performance. The BLEU score is improved from 1.64 to 22.46 and from 0.56 to 14.06 on the two datasets, respectively.

pdf
Query and Output: Generating Words by Querying Distributed Word Representations for Paraphrase Generation
Shuming Ma | Xu Sun | Wei Li | Sujian Li | Wenjie Li | Xuancheng Ren
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Most recent approaches use the sequence-to-sequence model for paraphrase generation. The existing sequence-to-sequence model tends to memorize the words and the patterns in the training dataset instead of learning the meaning of the words. Therefore, the generated sentences are often grammatically correct but semantically improper. In this work, we introduce a novel model based on the encoder-decoder framework, called Word Embedding Attention Network (WEAN). Our proposed model generates the words by querying distributed word representations (i.e. neural word embeddings), hoping to capturing the meaning of the according words. Following previous work, we evaluate our model on two paraphrase-oriented tasks, namely text simplification and short text abstractive summarization. Experimental results show that our model outperforms the sequence-to-sequence baseline by the BLEU score of 6.3 and 5.5 on two English text simplification datasets, and the ROUGE-2 F1 score of 5.7 on a Chinese summarization dataset. Moreover, our model achieves state-of-the-art performances on these three benchmark datasets.

pdf
Variational Autoregressive Decoder for Neural Response Generation
Jiachen Du | Wenjie Li | Yulan He | Ruifeng Xu | Lidong Bing | Xuan Wang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Combining the virtues of probability graphic models and neural networks, Conditional Variational Auto-encoder (CVAE) has shown promising performance in applications such as response generation. However, existing CVAE-based models often generate responses from a single latent variable which may not be sufficient to model high variability in responses. To solve this problem, we propose a novel model that sequentially introduces a series of latent variables to condition the generation of each word in the response sequence. In addition, the approximate posteriors of these latent variables are augmented with a backward Recurrent Neural Network (RNN), which allows the latent variables to capture long-term dependencies of future tokens in generation. To facilitate training, we supplement our model with an auxiliary objective that predicts the subsequent bag of words. Empirical experiments conducted on Opensubtitle and Reddit datasets show that the proposed model leads to significant improvement on both relevance and diversity over state-of-the-art baselines.

pdf
NEXUS Network: Connecting the Preceding and the Following in Dialogue Generation
Xiaoyu Shen | Hui Su | Wenjie Li | Dietrich Klakow
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Sequence-to-Sequence (seq2seq) models have become overwhelmingly popular in building end-to-end trainable dialogue systems. Though highly efficient in learning the backbone of human-computer communications, they suffer from the problem of strongly favoring short generic responses. In this paper, we argue that a good response should smoothly connect both the preceding dialogue history and the following conversations. We strengthen this connection by mutual information maximization. To sidestep the non-differentiability of discrete natural language tokens, we introduce an auxiliary continuous code space and map such code space to a learnable prior distribution for generation purpose. Experiments on two dialogue datasets validate the effectiveness of our model, where the generated responses are closely related to the dialogue context and lead to more interactive conversations.

2017

pdf
Determining Gains Acquired from Word Embedding Quantitatively Using Discrete Distribution Clustering
Jianbo Ye | Yanran Li | Zhaohui Wu | James Z. Wang | Wenjie Li | Jia Li
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Word embeddings have become widely-used in document analysis. While a large number of models for mapping words to vector spaces have been developed, it remains undetermined how much net gain can be achieved over traditional approaches based on bag-of-words. In this paper, we propose a new document clustering approach by combining any word embedding with a state-of-the-art algorithm for clustering empirical distributions. By using the Wasserstein distance between distributions, the word-to-word semantic relationship is taken into account in a principled way. The new clustering method is easy to use and consistently outperforms other methods on a variety of data sets. More importantly, the method provides an effective framework for determining when and how much word embeddings contribute to document analysis. Experimental results with multiple embedding models are reported.

pdf
A Conditional Variational Framework for Dialog Generation
Xiaoyu Shen | Hui Su | Yanran Li | Wenjie Li | Shuzi Niu | Yang Zhao | Akiko Aizawa | Guoping Long
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Deep latent variable models have been shown to facilitate the response generation for open-domain dialog systems. However, these latent variables are highly randomized, leading to uncontrollable generated responses. In this paper, we propose a framework allowing conditional response generation based on specific attributes. These attributes can be either manually assigned or automatically detected. Moreover, the dialog states for both speakers are modeled separately in order to reflect personal features. We validate this framework on two different scenarios, where the attribute refers to genericness and sentiment states respectively. The experiment result testified the potential of our model, where meaningful responses can be generated in accordance with the specified attributes.

pdf
Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization
Shuming Ma | Xu Sun | Jingjing Xu | Houfeng Wang | Wenjie Li | Qi Su
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Current Chinese social media text summarization models are based on an encoder-decoder framework. Although its generated summaries are similar to source texts literally, they have low semantic relevance. In this work, our goal is to improve semantic relevance between source texts and summaries for Chinese social media summarization. We introduce a Semantic Relevance Based neural model to encourage high semantic similarity between texts and summaries. In our model, the source text is represented by a gated attention encoder, while the summary representation is produced by a decoder. Besides, the similarity score between the representations is maximized during training. Our experiments show that the proposed model outperforms baseline systems on a social media corpus.

pdf
DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset
Yanran Li | Hui Su | Xiaoyu Shen | Wenjie Li | Ziqiang Cao | Shuzi Niu
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We develop a high-quality multi-turn dialog dataset, DailyDialog, which is intriguing in several aspects. The language is human-written and less noisy. The dialogues in the dataset reflect our daily communication way and cover various topics about our daily life. We also manually label the developed dataset with communication intention and emotion information. Then, we evaluate existing approaches on DailyDialog dataset and hope it benefit the research field of dialog systems. The dataset is available on http://yanran.li/dailydialog

2016

pdf
Emotion Corpus Construction Based on Selection from Hashtags
Minglei Li | Yunfei Long | Lu Qin | Wenjie Li
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The availability of labelled corpus is of great importance for supervised learning in emotion classification tasks. Because it is time-consuming to manually label text, hashtags have been used as naturally annotated labels to obtain a large amount of labelled training data from microblog. However, natural hashtags contain too much noise for it to be used directly in learning algorithms. In this paper, we design a three-stage semi-automatic method to construct an emotion corpus from microblogs. Firstly, a lexicon based voting approach is used to verify the hashtag automatically. Secondly, a SVM based classifier is used to select the data whose natural labels are consistent with the predicted labels. Finally, the remaining data will be manually examined to filter out the noisy data. Out of about 48K filtered Chinese microblogs, 39k microblogs are selected to form the final corpus with the Kappa value reaching over 0.92 for the automatic parts and over 0.81 for the manual part. The proportion of automatic selection reaches 54.1%. Thus, the method can reduce about 44.5% of manual workload for acquiring quality data. Experiment on a classifier trained on this corpus shows that it achieves comparable results compared to the manually annotated NLP&CC2013 corpus.

pdf
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
Ziqiang Cao | Wenjie Li | Sujian Li | Furu Wei | Yanran Li
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Query relevance ranking and sentence saliency ranking are the two main tasks in extractive query-focused summarization. Previous supervised summarization systems often perform the two tasks in isolation. However, since reference summaries are the trade-off between relevance and saliency, using them as supervision, neither of the two rankers could be trained well. This paper proposes a novel summarization system called AttSum, which tackles the two tasks jointly. It automatically learns distributed representations for sentences as well as the document cluster. Meanwhile, it applies the attention mechanism to simulate the attentive reading of human behavior when a query is given. Extensive experiments are conducted on DUC query-focused summarization benchmark datasets. Without using any hand-crafted features, AttSum achieves competitive performance. We also observe that the sentences recognized to focus on the query indeed meet the query need.

pdf
Content-based Influence Modeling for Opinion Behavior Prediction
Chengyao Chen | Zhitao Wang | Yu Lei | Wenjie Li
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Nowadays, social media has become a popular platform for companies to understand their customers. It provides valuable opportunities to gain new insights into how a person’s opinion about a product is influenced by his friends. Though various approaches have been proposed to study the opinion formation problem, they all formulate opinions as the derived sentiment values either discrete or continuous without considering the semantic information. In this paper, we propose a Content-based Social Influence Model to study the implicit mechanism underlying the change of opinions. We then apply the learned model to predict users’ future opinions. The advantages of the proposed model is the ability to handle the semantic information and to learn two influence components including the opinion influence of the content information and the social relation factors. In the experiments conducted on Twitter datasets, our model significantly outperforms other popular opinion formation models.

pdf
PolyU at CL-SciSumm 2016
Ziqiang Cao | Wenjie Li | Dapeng Wu
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)

2015

pdf
Learning to Adapt Credible Knowledge in Cross-lingual Sentiment Analysis
Qiang Chen | Wenjie Li | Yu Lei | Xule Liu | Yanxiang He
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf
A Hierarchical Knowledge Representation for Expert Finding on Social Media
Yanran Li | Wenjie Li | Sujian Li
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf
Learning Summary Prior Representation for Extractive Summarization
Ziqiang Cao | Furu Wei | Sujian Li | Wenjie Li | Ming Zhou | Houfeng Wang
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf
Component-Enhanced Chinese Character Embeddings
Yanran Li | Wenjie Li | Fei Sun | Sujian Li
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
Reinforcing the Topic of Embeddings with Theta Pure Dependence for Text Classification
Ning Xing | Yuexian Hou | Peng Zhang | Wenjie Li | Dawei Song
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Cross-lingual Sentiment Lexicon Learning With Bilingual Word Graph Label Propagation
Dehong Gao | Furu Wei | Wenjie Li | Xiaohua Liu | Ming Zhou
Computational Linguistics, Volume 41, Issue 1 - March 2015

2014

pdf
Text-level Discourse Dependency Parsing
Sujian Li | Liang Wang | Ziqiang Cao | Wenjie Li
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Feature-Frequency–Adaptive On-line Training for Fast and Accurate Natural Language Processing
Xu Sun | Wenjie Li | Houfeng Wang | Qin Lu
Computational Linguistics, Volume 40, Issue 3 - September 2014

2013

pdf
Generalized Abbreviation Prediction with Negative Full Forms and Its Application on Improving Chinese Web Search
Xu Sun | Wenjie Li | Fanqi Meng | Houfeng Wang
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics
Dehong Gao | Wenjie Li | Renxian Zhang
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf
Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
Xu Sun | Houfeng Wang | Wenjie Li
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Implicit Discourse Relation Recognition by Selecting Typical Training Examples
Xun Wang | Sujian Li | Jiwei Li | Wenjie Li
Proceedings of COLING 2012

pdf
Fine-Grained Classification of Named Entities by Fusing Multi-Features
Wenjie Li | Jiwei Li | Ye Tian | Zhifang Sui
Proceedings of COLING 2012: Posters

pdf
Efficient Feedback-based Feature Learning for Blog Distillation as a Terabyte Challenge
Dehong Gao | Wenjie Li | Renxian Zhang
Proceedings of COLING 2012: Demonstration Papers

pdf
Beyond Twitter Text: A Preliminary Study on Twitter Hyperlink and its Application
Dehong Gao | Wenjie Li | Renxian Zhang
Proceedings of COLING 2012: Demonstration Papers

pdf
Towards Scalable Speech Act Recognition in Twitter: Tackling Insufficient Training Data
Renxian Zhang | Dehong Gao | Wenjie Li
Proceedings of the Workshop on Semantic Analysis in Social Media

pdf
The CIPS-SIGHAN CLP 2012 ChineseWord Segmentation onMicroBlog Corpora Bakeoff
Huiming Duan | Zhifang Sui | Ye Tian | Wenjie Li
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

2011

pdf
Simultaneous Clustering and Noise Detection for Theme-based Summarization
Xiaoyan Cai | Renxian Zhang | Dehong Gao | Wenjie Li
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf
Simultaneous Ranking and Clustering of Sentences: A Reinforcement Approach to Multi-Document Summarization
Xiaoyan Cai | Wenjie Li | You Ouyang | Hong Yan
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf
A Study on Position Information in Document Summarization
You Ouyang | Wenjie Li | Qin Lu | Renxian Zhang
Coling 2010: Posters

pdf
Sentence Ordering with Event-Enriched Semantics and Two-Layered Clustering for Multi-Document News Summarization
Renxian Zhang | Wenjie Li | Qin Lu
Coling 2010: Posters

pdf
A Semi-Supervised Key Phrase Extraction Approach: Learning from Title Phrases through a Document Semantic Network
Decong Li | Sujian Li | Wenjie Li | Wei Wang | Weiguang Qu
Proceedings of the ACL 2010 Conference Short Papers

pdf
Using Deep Belief Nets for Chinese Named Entity Categorization
Yu Chen | You Ouyang | Wenjie Li | Dequan Zheng | Tiejun Zhao
Proceedings of the 2010 Named Entities Workshop

pdf
Exploring Deep Belief Network for Chinese Relation Extraction
Yu Chen | Wenjie Li | Yan Liu | Dequan Zheng | Tiejun Zhao
CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
The Chinese Persons Name Diambiguation Evaluation: Exploration of Personal Name Disambiguation in Chinese News
Ying Chen | Peng Jin | Wenjie Li | Chu-Ren Huang
CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
273. Task 5. Keyphrase Extraction Based on Core Word Identification and Word Expansion
You Ouyang | Wenjie Li | Renxian Zhang
Proceedings of the 5th International Workshop on Semantic Evaluation

2009

pdf
An Integrated Multi-document Summarization Approach based on Word Hierarchical Representation
You Ouyang | Wenjie Li | Qin Lu
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf
Co-Feedback Ranking for Query-Focused Summarization
Furu Wei | Wenjie Li | Yanxiang He
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

2008

pdf
PNR2: Ranking Sentences with Positive and Negative Reinforcement for Query-Oriented Update Summarization
Wenjie Li | Furu Wei | Qin Lu | Yanxiang He
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf
Extractive Summarization Using Supervised and Semi-Supervised Learning
Kam-Fai Wong | Mingli Wu | Wenjie Li
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf
Preliminary Chinese Term Classification for Ontology Construction
Gaoying Cui | Qin Lu | Wenjie Li
Proceedings of the 6th Workshop on Asian Language Resources

pdf
Chinese Core Ontology Construction from a Bilingual Term Bank
Yirong Chen | Qin Lu | Wenjie Li | Gaoying Cui
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

A core ontology is a mid-level ontology which bridges the gap between an upper ontology and a domain ontology. Automatic Chinese core ontology construction can help quickly model domain knowledge. A graph based core ontology construction algorithm (COCA) is proposed to automatically construct a core ontology from an English-Chinese bilingual term bank. This algorithm computes the mapping strength from a selected Chinese term to WordNet synset with association to an upper-level SUMO concept. The strength is measured using a graph model integrated with several mapping features from multiple information sources. The features include multiple translation feature between Chinese core term and WordNet, extended string feature and Part-of-Speech feature. Evaluation of COCA repeated on an English-Chinese bilingual Term bank with more than 130K entries shows that the algorithm is improved in performance compared with our previous research and can better serve the semi-automatic construction of mid-level ontology.

pdf
Corpus Exploitation from Wikipedia for Ontology Construction
Gaoying Cui | Qin Lu | Wenjie Li | Yirong Chen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Ontology construction usually requires a domain-specific corpus for building corresponding concept hierarchy. The domain corpus must have a good coverage of domain knowledge. Wikipedia(Wiki), the world’s largest online encyclopaedic knowledge source, is open-content, collaboratively edited, and free of charge. It covers millions of articles and still keeps on expanding continuously. These characteristics make Wiki a good candidate as domain corpus resource in ontology construction. However, the selected article collection must have considerable quality and quantity. In this paper, a novel approach is proposed to identify articles in Wiki as domain-specific corpus by using available classification information in Wiki pages. The main idea is to generate a domain hierarchy from the hyperlinked pages of Wiki. Only articles strongly linked to this hierarchy are selected as the domain corpus. The proposed approach makes use of linked category information in Wiki pages to produce the hierarchy as a directed graph for obtaining a set of pages in the same connected branch. Ranking and filtering are then done on these pages based on the classification tree generated by the traversal algorithm. The experiment and evaluation results show that Wiki is a good resource for acquiring a relative high quality domain-specific corpus for ontology construction.

pdf
Exploiting the Role of Position Feature in Chinese Relation Extraction
Peng Zhang | Wenjie Li | Furu Wei | Qin Lu | Yuexian Hou
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Relation extraction is the task of finding pre-defined semantic relations between two entities or entity mentions from text. Many methods, such as feature-based and kernel-based methods, have been proposed in the literature. Among them, feature-based methods draw much attention from researchers. However, to the best of our knowledge, existing feature-based methods did not explicitly incorporate the position feature and no in-depth analysis was conducted in this regard. In this paper, we define and exploit nine types of position information between two named entity mentions and then use it along with other features in a multi-class classification framework for Chinese relation extraction. Experiments on the ACE 2005 data set show that the position feature is more effective than the other recognized features like entity type/subtype and character-based N-gram context. Most important, it can be easily captured and does not require as much effort as applying deep natural language processing.

pdf
Opinion Annotation in On-line Chinese Product Reviews
Ruifeng Xu | Yunqing Xia | Kam-Fai Wong | Wenjie Li
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the design and construction of a Chinese opinion corpus based on the online product reviews. Based on the observation on the characteristics of opinion expression in Chinese online product reviews, which is quite different from in the formal texts such as news, an annotation framework is proposed to guide the construction of the first Chinese opinion corpus based on online product reviews. The opinionated sentences are manually identified from the review text. Furthermore, for each comment in the opinionated sentence, its 13 describing elements are annotated including the expressions related to the interested product attributes and user opinions as well as the polarity and degree of the opinions. Currently, 12,724 comments are annotated in 10,935 sentences from review text. Through statistical analysis on the opinion corpus, some interesting characteristics of Chinese opinion expression are presented. This corpus is shown helpful to support systematic research on Chinese opinion analysis.

pdf
A Novel Feature-based Approach to Chinese Entity Relation Extraction
Wenjie Li | Peng Zhang | Furu Wei | Yuexian Hou | Qin Lu
Proceedings of ACL-08: HLT, Short Papers

2007

pdf
Extractive Summarization Based on Event Term Clustering
Maofu Liu | Wenjie Li | Mingli Wu | Qin Lu
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf
Annotating Chinese Collocations with Multi Information
Ruifeng Xu | Qin Lu | Kam-Fai Wong | Wenjie Li
Proceedings of the Linguistic Annotation Workshop

2006

pdf
Extractive Summarization using Inter- and Intra- Event Relevance
Wenjie Li | Mingli Wu | Qin Lu | Wei Xu | Chunfa Yuan
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
A Phonetic-Based Approach to Chinese Chat Text Normalization
Yunqing Xia | Kam-Fai Wong | Wenjie Li
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
A Comparative Study of the Effect of Word Segmentation On Chinese Terminology Extraction
Luning Ji | Qin Lu | Wenjie Li | YiRong Chen
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation

pdf
Constructing A Chinese Chat Language Corpus with A Two-Stage Incremental Annotation Approach
Yunqing Xia | Kam-Fai Wong | Wenjie Li
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Chat language refers to the special human language widely used in the community of digital network chat. As chat language holds anomalous characteristics in forming words, phrases, and non-alphabetical characters, conventional natural language processing tools are ineffective to handle chat language text. Previous research shows that knowledge based methods perform less effectively in proc-essing unseen chat terms. This motivates us to construct a chat language corpus so that corpus-based techniques of chat language text processing can be developed and evaluated. However, creating the corpus merely by hand is difficult. One, this work is manpower consuming. Second, annotation inconsistency is serious. To minimize manpower and annotation inconsistency, a two-stage incre-mental annotation approach is proposed in this paper in constructing a chat language corpus. Experiments conducted in this paper show that the performance of corpus annotation can be improved greatly with this approach.

pdf
Mining Implicit Entities in Queries
Wei Li | Wenjie Li | Qin Lu
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Entities are pivotal in describing events and objects, and also very important in Document Summarization. In general only explicit entities which can be extracted by a Named Entity Recognizer are used in real applications. However, implicit entities hidden behind the phrases or words, e.g. entity referred by the phrase “cross border”, are proved to be helpful in Document Summarization. In our experiment, we extract the implicit entities from the web resources.

pdf
A Study on Terminology Extraction Based on Classified Corpora
Yirong Chen | Qin Lu | Wenjie Li | Zhifang Sui | Luning Ji
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Algorithms for automatic term extraction in a specific domain should consider at least two issues, namely Unithood and Termhood (Kageura, 1996). Unithood refers to the degree of a string to occur as a word or a phrase. Termhood (Chen Yirong, 2005) refers to the degree of a word or a phrase to occur as a domain specific concept. Unlike unithood, study on termhood is not yet widely reported. In classified corpora, the class information provides the cue to the nature of data and can be used in termhood calculation. Three algorithms are provided and evaluated to investigate termhood based on classified corpora. The three algorithms are based on lexicon set computing, term frequency and document frequency, and the strength of the relation between a term and its document class respectively. Our objective is to investigate the effects of these different termhood measurement features. After evaluation, we can find which features are more effective and also, how we can improve these different features to achieve the best performance. Preliminary results show that the first measure can effectively filter out independent terms or terms of general use.

pdf
Interaction between Lexical Base and Ontology with Formal Concept Analysis
Sujian Li | Qin Lu | Wenjie Li | Ruifeng Xu
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

An ontology describes conceptual knowledge in a specific domain. A lexical base collects a repository of words and gives independent definition of concepts. In this paper, we propose to use FCA as a tool to help constructing an ontology through an existing lexical base. We mainly address two issues. The first issue is how to select attributes to visualize the relations between lexical terms. The second issue is how to revise lexical definitions through analysing the relations in the ontology. Thus the focus is on the effect of interaction between a lexical base and an ontology for the purpose of good ontology construction. Finally, experiments have been conducted to verify our ideas.

2005

pdf
A Preliminary Work on Classifying Time Granularities of Temporal Questions
Wei Li | Wenjie Li | Qin Lu | Kam-Fai Wong
Second International Joint Conference on Natural Language Processing: Full Papers

pdf
CTEMP: A Chinese Temporal Parser for Extracting and Normalizing Temporal Information
Mingli Wu | Wenjie Li | Qin Lu | Baoli Li
Second International Joint Conference on Natural Language Processing: Full Papers

pdf
Integrating Collocation Features in Chinese Word Sense Disambiguation
Wanyin Li | Qin Lu | Wenjie Li
Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing

pdf
Experiments of Ontology Construction with Formal Concept Analysis
Sujian Li | Qin Lu | Wenjie Li
Proceedings of OntoLex 2005 - Ontologies and Lexical Resources

2004

pdf
Combining Linguistic Features with Weighted Bayesian Classifier for Temporal Reference Processing
Guihong Cao | Wenjie Li | Kam-Fai Wong | Chunfa Yuan
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf
Applying Machine Learning to Chinese Temporal Relation Resolution
Wenjie Li | Kam-Fai Wong | Guihong Cao | Chunfa Yuan
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

2002

pdf
An Indexing Method Based on Sentences
Li Li | Chunfa Yuan | K.F. Wong | Wenjie Li
COLING-02: The First SIGHAN Workshop on Chinese Language Processing

2001

pdf
A Model For Processing Temporal References In Chinese
Wenjie Li | Kam-Fai Wong | Chunfa Yuan
Proceedings of the ACL 2001 Workshop on Temporal and Spatial Information Processing

2000

pdf
An Algorithm for Situation Classification of Chinese Verbs
Xiaodan Zhu | Chunfa Yuan | K.F. Wong | Wenjie Li
Second Chinese Language Processing Workshop

1995

pdf
Are Statistics-Based Approaches Good Enough For NLP? A Case Study Of Maximal-Length NP Extraction In Mandarin Chinese
Wenjie Li | Haihua Pan | Ming Zhou | Kam-Fai Wong | Vincent Lum
Proceedings of Rocling VIII Computational Linguistics Conference VIII

Search