2025
pdf
bib
abs
Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration
Shao Zhang
|
Xihuai Wang
|
Wenhao Zhang
|
Chaoran Li
|
Junru Song
|
Tingyu Li
|
Lin Qiu
|
Xuezhi Cao
|
Xunliang Cai
|
Wen Yao
|
Weinan Zhang
|
Xinbing Wang
|
Ying Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Agents built on large language models (LLMs) have excelled in turn-by-turn human-AI collaboration but struggle with simultaneous tasks requiring real-time interaction. Latency issues and the challenge of inferring variable human strategies hinder their ability to make autonomous decisions without explicit instructions. Through experiments with current independent *System 1* and *System 2* methods, we validate the necessity of using Dual Process Theory (DPT) in real-time tasks. We propose DPT-Agent, a novel language agent framework that integrates *System 1* and *System 2* for efficient real-time simultaneous human-AI collaboration. DPT-Agent’s *System 1* uses a Finite-state Machine (FSM) and code-as-policy for fast, intuitive, and controllable decision-making. DPT-Agent’s *System 2* integrates Theory of Mind (ToM) and asynchronous reflection to infer human intentions and perform reasoning-based autonomous decisions. We demonstrate the effectiveness of DPT-Agent through further experiments with rule-based agents and human collaborators, showing significant improvements over mainstream LLM-based frameworks. To the best of our knowledge, DPT-Agent is the first language agent framework that achieves successful real-time simultaneous human-AI collaboration autonomously. Code of DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.
pdf
bib
abs
DebateCoder: Towards Collective Intelligence of LLMs via Test Case Driven LLM Debate for Code Generation
Jizheng Chen
|
Kounianhua Du
|
Xinyi Dai
|
Weiming Zhang
|
Xihuai Wang
|
Yasheng Wang
|
Ruiming Tang
|
Weinan Zhang
|
Yong Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the impressive reasoning and text generation capabilities of large language models (LLMs), methods leveraging multiple LLMs to debate each other have garnered increasing attention. However, existing debate-based approaches remain limited in effectiveness in structured and detailed domains represented by code generation due to several reasons: 1) Reliance on different instances of the same LLM for debate, neglecting the potential benefits of integrating diverse models with varied internal knowledge for more comprehensive code generation, 2) under-utilization of test cases, and 3) reliance on third-party LLM moderators for result consolidation and decision-making, probably introducing hallucinations and judgment errors. To address these challenges, we propose DebateCoder to collect intelligence of LLMs via test case-driven debate for code generation. In DebateCoder, test cases serve as a medium for models to analyze code and identify bugs, while opposing models generate test cases to challenge each other’s code during the debate process. These test cases, along with their execution results, are elaborately leveraged to refine and enhance the code through a novel contrastive analysis process. Furthermore, DebateCoder leverages test case outcomes to assess code quality and determine convergence criteria. Unlike previous approaches, DebateCoder emphasizes the collaborative improvement of both models through competitive debate and interactive analysis. Abundant experimental results on two datasets demonstrate the effectiveness of DebateCoder.
pdf
bib
abs
HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Assistant Scenarios
Jun Wang
|
Jiamu Zhou
|
Xihuai Wang
|
Xiaoyun Mo
|
Haoyu Zhang
|
Qiqiang Lin
|
Jincheng Jincheng
|
Muning Wen
|
Weinan Zhang
|
Qiuying Peng
|
Jun Wang
Findings of the Association for Computational Linguistics: ACL 2025
Evaluating the performance of LLMs in multi-turn human-agent interactions presents significant challenges, particularly due to the complexity and variability of user behavior. In this paper, we introduce HammerBench, a novel benchmark framework for assessing LLMs’ function-calling capabilities in real-world, multi-turn dialogues. HammerBench simulates diverse mobile assistant use cases, incorporating imperfect instructions, dynamic question-answer trajectories, intent and argument shifts, and the indirect use of external information through pronouns. To construct this benchmark, we curate a comprehensive dataset derived from popular mobile app functionalities and anonymized user logs, complemented by a cost-effective data generation pipeline leveraging open-source models. HammerBench is further augmented with fine-grained interaction snapshots and metrics, enabling detailed evaluation of function-calling performance across individual conversational turns. We demonstrate the effectiveness of HammerBench by evaluating several leading LLMs and uncovering key performance trends. Our experiments reveal that different types of parameter name errors are a significant source of failure across different interaction scenarios, highlighting critical areas for further improvement in LLM robustness for mobile assistant applications.
pdf
bib
abs
CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation
Qingyao Li
|
Xinyi Dai
|
Xiangyang Li
|
Weinan Zhang
|
Yasheng Wang
|
Ruiming Tang
|
Yong Yu
Findings of the Association for Computational Linguistics: ACL 2025
Code generation is a critical reasoning task for large language models (LLMs). Recent advancements have focused on optimizing the thought process of code generation, achieving significant improvements. However, such thought process lacks effective process supervision, making it hard to optimize the thoughts. Although Process Reward Models (PRMs) have been widely established in mathematical reasoning, building a code PRM is still not trivial for the gap between thoughts to code. In this paper, we propose CodePRM, a novel approach that leverages the code execution feedback to build a code PRM. Specifically, we first collect a large dataset of thought traces, where each thought step is labeled with their derived code’ pass rates, accompanied by the corresponding code snippets, and execution feedback. During training, we train a PRM to take both the reasoning process and code execution feedback as input to score individual thought steps, enabling it to leverage code execution results to distinguish between high-quality and low-quality thought steps. Finally, to use the PRM during inference, we develop a Generate-Verify-Refine (GVR) pipeline where the CodePRM serves as a process verifier to dynamically identify and correct errors in the thought process during code search. Experimental results demonstrate that CodePRM with the inference algorithm outperforms strong baselines, significantly enhancing code generation performance. Further analysis reveals the key factors for building a code PRM.
pdf
bib
abs
Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning
Jiachen Zhu
|
Congmin Zheng
|
Jianghao Lin
|
Kounianhua Du
|
Ying Wen
|
Yong Yu
|
Jun Wang
|
Weinan Zhang
Findings of the Association for Computational Linguistics: ACL 2025
While large language models (LLMs) have significantly advanced mathematical reasoning, Process Reward Models (PRMs) have been developed to evaluate the logical validity of reasoning steps. However, PRMs still struggle with out-of-distribution (OOD) challenges. This paper identifies the OOD issues including step OOD, arising from differences in reasoning patterns across model types and sizes, and question OOD, due to dataset shifts between training and real-world problems. To address these issues, we introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle these OOD issues. By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps for PRM as a warmup to stimulate its potential to judge target steps, improving generalization and reasoning consistency across different models and problem types. Our extensive experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets. Our open-source contributions include a retrieval-enhanced dataset, a tuning framework for PRM training, and the RetreivalPRM model, establishing a new standard for PRM performance.
pdf
bib
abs
Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation
Kounianhua Du
|
Hanjing Wang
|
Jianxing Liu
|
Jizheng Chen
|
Xinyi Dai
|
Yasheng Wang
|
Ruiming Tang
|
Yong Yu
|
Jun Wang
|
Weinan Zhang
Findings of the Association for Computational Linguistics: ACL 2025
To address these limitations, we propose BDC, a novel framework that Boosts reasoning exploration via multi-agent collaboration, Disentangles heterogeneous data into specialized experts, and Customizes solutions through dynamic model composition. BDC integrates a Monte Carlo Tree-of-Agents algorithm, where multiple LLMs mutually verify and refine reasoning paths through reflection-guided pruning, enabling efficient exploration of high-quality solutions. To handle data diversity, we cluster problems by latent semantics, train composable LoRA experts on each cluster, and deploy an input-aware hypernetwork to dynamically merge these experts into tailored solvers. Experiments on APPS and CodeContest benchmarks demonstrate BDC’s superiority: it achieves up to 73.8% accuracy on hard problems, outperforming state-of-the-art methods like LATS and RethinkMCTS by 9–15%. This work lays the groundwork for advancing LLM capabilities in complex reasoning tasks, offering a novel System2-to-System1 solution.
2024
pdf
bib
abs
Prove Your Point!: Bringing Proof-Enhancement Principles to Argumentative Essay Generation
Ruiyu Xiao
|
Lei Wu
|
Yuhang Gou
|
Weinan Zhang
|
Ting Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Argumentative essay generation (AEG) aims to generate complete texts on specific controversial topics or debates. Although current AEG methods can generate individual opinions, they often overlook the high-level connections between these opinions. This often leads to the generated results being mired in logical confusion, unable to proof their own arguments effectively. The generated essay may present evidence that contradicts the claims or they may fail to assemble the claims into logical flow. In this paper, we present a unified two-stage framework: Proof-Enhancement and Self-Annotation (PESA) for AEG with a focus on logical enhancement. Specifically, we first construct pseudo-labels for logical information,claims and grounds, using a large language model. We then propose a tree planning approach that introduces proof principles and ensures logical consistency. Extensive experimental results show that, benefiting from proof principle guidance, PESA generates argumentative essays with better logical validity and persuasiveness than strong baseline models.
pdf
bib
abs
Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models
Mingda Li
|
Xinyu Li
|
Yifan Chen
|
Wenfeng Xuan
|
Weinan Zhang
Findings of the Association for Computational Linguistics: ACL 2024
Although Retrieval-Augmented Large Language Models (RALMs) demonstrate their superiority in terms of factuality, they do not consistently outperform the original retrieval-free Language Models (LMs). Our experiments reveal that this example-level performance inconsistency exists not only between retrieval-augmented and retrieval-free LM but also among different retrievers. To understand this phenomenon, we investigate the degeneration behavior of RALMs and theoretically decompose it into four categories. Further analysis based on our decomposition reveals that the innate difference in knowledge sources and the unpredictable degeneration of the reader model contribute most to the inconsistency. Drawing from our analysis, we introduce Ensemble of Retrievers (EoR), a trainable framework that can adaptively retrieve from different knowledge sources and effectively decrease unpredictable reader errors. Our experiments on Open Domain Question Answering show that EoR substantially improves performance over the RALM with a single retriever by considerably reducing inconsistent behaviors.
pdf
bib
abs
A Self-verified Method for Exploring Simile Knowledge from Pre-trained Language Models
Longxuan Ma
|
Changxin Ke
|
Shuhan Zhou
|
Churui Sun
|
Wei-Nan Zhang
|
Ting Liu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Simile tasks are challenging in natural language processing (NLP) because models require adequate world knowledge to produce predictions. In recent years, pre-trained language models (PLMs) have succeeded in NLP since they learn generic knowledge from a large corpus. The knowledge embedded in PLMs can be used for different kinds of Simile tasks. However, previous work usually explored one type of simile knowledge for a specific simile task, how to fully utilize different types of knowledge embedded in the PLMs requires further exploration. This paper proposes a self-verified method for exploring simile knowledge from PLMs, which allows the PLMs to leverage one type of simile knowledge to self-validate another. To this end, we first enhance PLMs with a novel multi-level simile recognition (MLSR) task that trains PLMs to evaluate the quality of similes. Then the PLMs leverage this evaluation score to assist the simile interpretation and generation tasks. In this way, we connect different types of simile knowledge in PLMs and make better use of them. Experiments on different pre-trained models and multiple publicly available datasets show that our method works for different kinds of PLMs and can explore more accurate simile knowledge for PLMs. Our code/data will be released on GitHub.
2023
pdf
bib
abs
A Difference-aware Ensemble Method for Task-oriented Dialogue with Subjective Knowledge
Changxin Ke
|
Churui Sun
|
Longxuan Ma
|
Wei-Nan Zhang
|
Ting Liu
Proceedings of the Eleventh Dialog System Technology Challenge
We participate in the 11th Dialog System Technology Challenges (DSTC) track-5 called Task-oriented Conversational Modeling with Subjective Knowledge. Introducing subjective knowledge into task-oriented dialogue (TOD) can help the DS to understand variables of subjective user needs and to suit more dialogue scenarios. Track-5 includes several sub-tasks: 1) knowledge-seeking turn detection; 2) knowledge entity tracking; 3) knowledge entry selection; and 4) use of the selected knowledge entries for response generation. Besides the challenges of each sub-tasks own, there are two challenges across different sub-tasks. The first is that there are multiple valid knowledge entries for each knowledge-seeking turn, the accuracy of the knowledge entry selection is important for the quality of response generation. The second challenge is how to address the unseen dialogue/entities/entries in the validation and the test set. In this paper, we propose a difference-aware ensemble method to address these sub-tasks and the two challenges mentioned above. Our method helps to obtain more robust results and performs well on unseen instances. Among all the submissions for the test set, our method ranks 1st on the knowledge-seeking turn detection task and achieves 3rd on the overall automatic evaluation score. Our code and data will be released on GitHub.
pdf
bib
abs
I run as fast as a rabbit, can you? A Multilingual Simile Dialogues Datasets
Longxuan Ma
|
Wei-Nan Zhang
|
Shuhan Zhou
|
Churui Sun
|
Changxin Ke
|
Ting Liu
Findings of the Association for Computational Linguistics: ACL 2023
A simile is a figure of speech that compares two different things (called the tenor and the vehicle) via shared properties. The tenor and the vehicle are usually connected with comparator words such as “like” or “as”. The simile phenomena are unique and complex in a real-life dialogue scene where the tenor and the vehicle can be verbal phrases or sentences, mentioned by different speakers, exist in different sentences, or occur in reversed order. However, the current simile research usually focuses on similes in a triplet tuple (tenor, property, vehicle) or a single sentence where the tenor and vehicle are usually entities or noun phrases, which could not reflect complex simile phenomena in real scenarios. In this paper, we propose a novel and high-quality multilingual simile dialogue (MSD) dataset to facilitate the study of complex simile phenomena. The MSD is the largest manually annotated simile data (~21K) and it contains both English and Chinese data. Meanwhile, the MSD data can also be used on dialogue tasks to test the ability of dialogue systems when using similes. We design 3 simile tasks (recognition, interpretation, and generation) and 2 dialogue tasks (retrieval and generation) with MSD. For each task, we provide experimental results from strong pre-trained or state-of-the-art models. The experiments demonstrate the challenge of MSD and we will release the data/code on GitHub.
pdf
bib
abs
Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements
Yushan Qian
|
Weinan Zhang
|
Ting Liu
Findings of the Association for Computational Linguistics: EMNLP 2023
Empathetic dialogue is an indispensable part of building harmonious social relationships and contributes to the development of a helpful AI. Previous approaches are mainly based on fine small-scale language models. With the advent of ChatGPT, the application effect of large language models (LLMs) in this field has attracted great attention. This work empirically investigates the performance of LLMs in generating empathetic responses and proposes three improvement methods of semantically similar in-context learning, two-stage interactive generation, and combination with the knowledge base. Extensive experiments show that LLMs can significantly benefit from our proposed methods and is able to achieve state-of-the-art performance in both automatic and human evaluations. Additionally, we explore the possibility of GPT-4 simulating human evaluators.
pdf
bib
abs
Conversational Recommender System and Large Language Model Are Made for Each Other in E-commerce Pre-sales Dialogue
Yuanxing Liu
|
Weinan Zhang
|
Yifan Chen
|
Yuchi Zhang
|
Haopeng Bai
|
Fan Feng
|
Hengbin Cui
|
Yongbin Li
|
Wanxiang Che
Findings of the Association for Computational Linguistics: EMNLP 2023
E-commerce pre-sales dialogue aims to understand and elicit user needs and preferences for the items they are seeking so as to provide appropriate recommendations. Conversational recommender systems (CRSs) learn user representation and provide accurate recommendations based on dialogue context, but rely on external knowledge. Large language models (LLMs) generate responses that mimic pre-sales dialogues after fine-tuning, but lack domain-specific knowledge for accurate recommendations. Intuitively, the strengths of LLM and CRS in E-commerce pre-sales dialogues are complementary, yet no previous work has explored this. This paper investigates the effectiveness of combining LLM and CRS in E-commerce pre-sales dialogues, proposing two collaboration methods: CRS assisting LLM and LLM assisting CRS. We conduct extensive experiments on a real-world dataset of E-commerce pre-sales dialogues. We analyze the impact of two collaborative approaches with two CRSs and two LLMs on four tasks of E-commerce pre-sales dialogue. We find that collaborations between CRS and LLM can be very effective in some cases.
2022
pdf
bib
abs
Nested Named Entity Recognition with Span-level Graphs
Juncheng Wan
|
Dongyu Ru
|
Weinan Zhang
|
Yong Yu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Span-based methods with the neural networks backbone have great potential for the nested named entity recognition (NER) problem. However, they face problems such as degenerating when positive instances and negative instances largely overlap. Besides, the generalization ability matters a lot in nested NER, as a large proportion of entities in the test set hardly appear in the training set. In this work, we try to improve the span representation by utilizing retrieval-based span-level graphs, connecting spans and entities in the training data based on n-gram features. Specifically, we build the entity-entity graph and span-entity graph globally based on n-gram similarity to integrate the information of similar neighbor entities into the span representation. To evaluate our method, we conduct experiments on three common nested NER datasets, ACE2004, ACE2005, and GENIA datasets. Experimental results show that our method achieves general improvements on all three benchmarks (+0.30 ∼ 0.85 micro-F1), and obtains special superiority on low frequency entities (+0.56 ∼ 2.08 recall).
pdf
bib
abs
CLIP Models are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment
Haoyu Song
|
Li Dong
|
Weinan Zhang
|
Ting Liu
|
Furu Wei
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
CLIP has shown a remarkable zero-shot capability on a wide range of vision tasks. Previously, CLIP is only regarded as a powerful visual encoder. However, after being pre-trained by language supervision from a large amount of image-caption pairs, CLIP itself should also have acquired some few-shot abilities for vision-language tasks. In this work, we empirically show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language. We first evaluate CLIP’s zero-shot performance on a typical visual question answering task and demonstrate a zero-shot cross-modality transfer capability of CLIP on the visual entailment task. Then we propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task. We achieve competitive zero/few-shot results on the visual question answering and visual entailment tasks without introducing any additional pre-training procedure.
pdf
bib
abs
SelF-Eval: Self-supervised Fine-grained Dialogue Evaluation
Longxuan Ma
|
Ziyu Zhuang
|
Weinan Zhang
|
Mingda Li
|
Ting Liu
Proceedings of the 29th International Conference on Computational Linguistics
This paper introduces a novel Self-supervised Fine-grained Dialogue Evaluation framework (SelF-Eval). The core idea is to model the correlation between turn quality and the entire dialogue quality. We first propose a novel automatic data construction method that can automatically assign fine-grained scores for arbitrarily dialogue data. Then we train SelF-Eval with a multi-level contrastive learning schema which helps to distinguish different score levels. Experimental results on multiple benchmarks show that SelF-Eval is highly consistent with human evaluations and better than the state-of-the-art models. We give a detailed analysis of the experiments in this paper. Our code is available on GitHub.
pdf
bib
abs
PAEG: Phrase-level Adversarial Example Generation for Neural Machine Translation
Juncheng Wan
|
Jian Yang
|
Shuming Ma
|
Dongdong Zhang
|
Weinan Zhang
|
Yong Yu
|
Zhoujun Li
Proceedings of the 29th International Conference on Computational Linguistics
While end-to-end neural machine translation (NMT) has achieved impressive progress, noisy input usually leads models to become fragile and unstable. Generating adversarial examples as the augmented data has been proved to be useful to alleviate this problem. Existing methods for adversarial example generation (AEG) are word-level or character-level, which ignore the ubiquitous phrase structure. In this paper, we propose a Phrase-level Adversarial Example Generation (PAEG) framework to enhance the robustness of the translation model. Our method further improves the gradient-based word-level AEG method by adopting a phrase-level substitution strategy. We verify our method on three benchmarks, including LDC Chinese-English, IWSLT14 German-English, and WMT14 English-German tasks. Experimental results demonstrate that our approach significantly improves translation performance and robustness to noise compared to previous strong baselines.
2021
pdf
bib
abs
BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data
Haoyu Song
|
Yan Wang
|
Kaiyan Zhang
|
Wei-Nan Zhang
|
Ting Liu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Maintaining a consistent persona is essential for dialogue agents. Although tremendous advancements have been brought, the limited-scale of annotated personalized dialogue datasets is still a barrier towards training robust and consistent persona-based dialogue models. This work shows how this challenge can be addressed by disentangling persona-based dialogue generation into two sub-tasks with a novel BERT-over-BERT (BoB) model. Specifically, the model consists of a BERT-based encoder and two BERT-based decoders, where one decoder is for response generation, and another is for consistency understanding. In particular, to learn the ability of consistency understanding from large-scale non-dialogue inference data, we train the second decoder in an unlikelihood manner. Under different limited data settings, both automatic and human evaluations demonstrate that the proposed model outperforms strong baselines in response quality and persona consistency.
pdf
bib
abs
Glancing Transformer for Non-Autoregressive Neural Machine Translation
Lihua Qian
|
Hao Zhou
|
Yu Bao
|
Mingxuan Wang
|
Lin Qiu
|
Weinan Zhang
|
Yong Yu
|
Lei Li
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Recent work on non-autoregressive neural machine translation (NAT) aims at improving the efficiency by parallel decoding without sacrificing the quality. However, existing NAT methods are either inferior to Transformer or require multiple decoding passes, leading to reduced speedup. We propose the Glancing Language Model (GLM) for single-pass parallel generation models. With GLM, we develop Glancing Transformer (GLAT) for machine translation. With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8×-15× speedup. Note that GLAT does not modify the network architecture, which is a training method to learn word interdependency. Experiments on multiple WMT language directions show that GLAT outperforms all previous single pass non-autoregressive methods, and is nearly comparable to Transformer, reducing the gap to 0.25-0.9 BLEU points.
pdf
bib
abs
Neural Stylistic Response Generation with Disentangled Latent Variables
Qingfu Zhu
|
Wei-Nan Zhang
|
Ting Liu
|
William Yang Wang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Generating open-domain conversational responses in the desired style usually suffers from the lack of parallel data in the style. Meanwhile, using monolingual stylistic data to increase style intensity often leads to the expense of decreasing content relevance. In this paper, we propose to disentangle the content and style in latent space by diluting sentence-level information in style representations. Combining the desired style representation and a response content representation will then obtain a stylistic response. Our approach achieves a higher BERT-based style intensity score and comparable BLEU scores, compared with baselines. Human evaluation results show that our approach significantly improves style intensity and maintains content relevance.
pdf
bib
abs
Technical Report on Shared Task in DialDoc21
Jiapeng Li
|
Mingda Li
|
Longxuan Ma
|
Wei-Nan Zhang
|
Ting Liu
Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)
We participate in the DialDoc Shared Task sub-task 1 (Knowledge Identification). The task requires identifying the grounding knowledge in form of a document span for the next dialogue turn. We employ two well-known pre-trained language models (RoBERTa and ELECTRA) to identify candidate document spans and propose a metric-based ensemble method for span selection. Our methods include data augmentation, model pre-training/fine-tuning, post-processing, and ensemble. On the submission page, we rank 2nd based on the average of normalized F1 and EM scores used for the final evaluation. Specifically, we rank 2nd on EM and 3rd on F1.
pdf
bib
abs
Learning Logic Rules for Document-Level Relation Extraction
Dongyu Ru
|
Changzhi Sun
|
Jiangtao Feng
|
Lin Qiu
|
Hao Zhou
|
Weinan Zhang
|
Yong Yu
|
Lei Li
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Document-level relation extraction aims to identify relations between entities in a whole document. Prior efforts to capture long-range dependencies have relied heavily on implicitly powerful representations learned through (graph) neural networks, which makes the model less transparent. To tackle this challenge, in this paper, we propose LogiRE, a novel probabilistic model for document-level relation extraction by learning logic rules. LogiRE treats logic rules as latent variables and consists of two modules: a rule generator and a relation extractor. The rule generator is to generate logic rules potentially contributing to final predictions, and the relation extractor outputs final predictions based on the generated logic rules. Those two modules can be efficiently optimized with the expectation-maximization (EM) algorithm. By introducing logic rules into neural networks, LogiRE can explicitly capture long-range dependencies as well as enjoy better interpretation. Empirical results show that significantly outperforms several strong baselines in terms of relation performance and logical consistency. Our code is available at 
https://github.com/rudongyu/LogiRE.
pdf
bib
What Did You Refer to? Evaluating Co-References in Dialogue
Wei-Nan Zhang
|
Yue Zhang
|
Hanlin Tang
|
Zhengyu Zhao
|
Caihai Zhu
|
Ting Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
2020
pdf
bib
abs
Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation
Haoyu Song
|
Yan Wang
|
Wei-Nan Zhang
|
Xiaojiang Liu
|
Ting Liu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Maintaining a consistent personality in conversations is quite natural for human beings, but is still a non-trivial task for machines. The persona-based dialogue generation task is thus introduced to tackle the personality-inconsistent problem by incorporating explicit persona text into dialogue generation models. Despite the success of existing persona-based models on generating human-like responses, their one-stage decoding framework can hardly avoid the generation of inconsistent persona words. In this work, we introduce a three-stage framework that employs a generate-delete-rewrite mechanism to delete inconsistent words from a generated response prototype and further rewrite it to a personality-consistent one. We carry out evaluations by both human and automatic metrics. Experiments on the Persona-Chat dataset show that our approach achieves good performance.
pdf
bib
abs
Counterfactual Off-Policy Training for Neural Dialogue Generation
Qingfu Zhu
|
Wei-Nan Zhang
|
Ting Liu
|
William Yang Wang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Open-domain dialogue generation suffers from the data insufficiency problem due to the vast size of potential responses. In this paper, we propose to explore potential responses by counterfactual reasoning. Given an observed response, the counterfactual reasoning model automatically infers the outcome of an alternative policy that could have been taken. The resulting counterfactual response synthesized in hindsight is of higher quality than the response synthesized from scratch. Training on the counterfactual responses under the adversarial learning framework helps to explore the high-reward area of the potential response space. An empirical study on the DailyDialog dataset shows that our approach significantly outperforms the HRED model as well as the conventional adversarial learning approaches.
pdf
bib
abs
Profile Consistency Identification for Open-domain Dialogue Agents
Haoyu Song
|
Yan Wang
|
Wei-Nan Zhang
|
Zhengyu Zhao
|
Ting Liu
|
Xiaojiang Liu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Maintaining a consistent attribute profile is crucial for dialogue agents to naturally converse with humans. Existing studies on improving attribute consistency mainly explored how to incorporate attribute information in the responses, but few efforts have been made to identify the consistency relations between response and attribute profile. To facilitate the study of profile consistency identification, we create a large-scale human-annotated dataset with over 110K single-turn conversations and their key-value attribute profiles. Explicit relation between response and profile is manually labeled. We also propose a key-value structure information enriched BERT model to identify the profile consistency, and it gained improvements over strong baselines. Further evaluations on downstream tasks demonstrate that the profile consistency identification model is conducive for improving dialogue consistency.
pdf
bib
abs
A Compare Aggregate Transformer for Understanding Document-grounded Dialogue
Longxuan Ma
|
Wei-Nan Zhang
|
Runxin Sun
|
Ting Liu
Findings of the Association for Computational Linguistics: EMNLP 2020
Unstructured documents serving as external knowledge of the dialogues help to generate more informative responses. Previous research focused on knowledge selection (KS) in the document with dialogue. However, dialogue history that is not related to the current dialogue may introduce noise in the KS processing. In this paper, we propose a Compare Aggregate Transformer (CAT) to jointly denoise the dialogue context and aggregate the document information for response generation. We designed two different comparison mechanisms to reduce noise (before and during decoding). In addition, we propose two metrics for evaluating document utilization efficiency based on word overlap. Experimental results on the CMU_DoG dataset show that the proposed CAT model outperforms the state-of-the-art approach and strong baselines.
pdf
bib
abs
Active Sentence Learning by Adversarial Uncertainty Sampling in Discrete Space
Dongyu Ru
|
Jiangtao Feng
|
Lin Qiu
|
Hao Zhou
|
Mingxuan Wang
|
Weinan Zhang
|
Yong Yu
|
Lei Li
Findings of the Association for Computational Linguistics: EMNLP 2020
Active learning for sentence understanding aims at discovering informative unlabeled data for annotation and therefore reducing the demand for labeled data. We argue that the typical uncertainty sampling method for active learning is time-consuming and can hardly work in real-time, which may lead to ineffective sample selection. We propose adversarial uncertainty sampling in discrete space (AUSDS) to retrieve informative unlabeled samples more efficiently. AUSDS maps sentences into latent space generated by the popular pre-trained language models, and discover informative unlabeled text samples for annotation via adversarial attack. The proposed approach is extremely efficient compared with traditional uncertainty sampling with more than 10x speedup. Experimental results on five datasets show that AUSDS outperforms strong baselines on effectiveness.
pdf
bib
abs
CycleGT: Unsupervised Graph-to-Text and Text-to-Graph Generation via Cycle Training
Qipeng Guo
|
Zhijing Jin
|
Xipeng Qiu
|
Weinan Zhang
|
David Wipf
|
Zheng Zhang
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)
Two important tasks at the intersection of knowledge graphs and natural language processing are graph-to-text (G2T) and text-tograph (T2G) conversion. Due to the difficulty and high cost of data collection, the supervised data available in the two fields are usually on the magnitude of tens of thousands, for example, 18K in the WebNLG 2017 dataset after preprocessing, which is far fewer than the millions of data for other tasks such as machine translation. Consequently, deep learning models for G2T and T2G suffer largely from scarce training data. We present CycleGT, an unsupervised training method that can bootstrap from fully non-parallel graph and text data, and iteratively back translate between the two forms. Experiments on WebNLG datasets show that our unsupervised model trained on the same number of data achieves performance on par with several fully supervised models. Further experiments on the non-parallel GenWiki dataset verify that our method performs the best among unsupervised baselines. This validates our framework as an effective approach to overcome the data scarcity problem in the fields of G2T and T2G.
2019
pdf
bib
abs
Exploring Diverse Expressions for Paraphrase Generation
Lihua Qian
|
Lin Qiu
|
Weinan Zhang
|
Xin Jiang
|
Yong Yu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Paraphrasing plays an important role in various natural language processing (NLP) tasks, such as question answering, information retrieval and sentence simplification. Recently, neural generative models have shown promising results in paraphrase generation. However, prior work mainly focused on single paraphrase generation, while ignoring the fact that diversity is essential for enhancing generalization capability and robustness of downstream applications. Few works have been done to solve diverse paraphrase generation. In this paper, we propose a novel approach with two discriminators and multiple generators to generate a variety of different paraphrases. A reinforcement learning algorithm is applied to train our model. Our experiments on two real-world datasets demonstrate that our model not only gains a significant increase in diversity but also improves generation quality over several state-of-the-art baselines.
pdf
bib
abs
TripleNet: Triple Attention Network for Multi-Turn Response Selection in Retrieval-Based Chatbots
Wentao Ma
|
Yiming Cui
|
Nan Shao
|
Su He
|
Wei-Nan Zhang
|
Ting Liu
|
Shijin Wang
|
Guoping Hu
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
We consider the importance of different utterances in the context for selecting the response usually depends on the current query. In this paper, we propose the model TripleNet to fully model the task with the triple <context, query, response> instead of <context, response > in previous works. The heart of TripleNet is a novel attention mechanism named triple attention to model the relationships within the triple at four levels. The new mechanism updates the representation of each element based on the attention with the other two concurrently and symmetrically. We match the triple <C, Q, R> centered on the response from char to context level for prediction. Experimental results on two large-scale multi-turn response selection datasets show that the proposed model can significantly outperform the state-of-the-art methods.
pdf
bib
abs
Retrieval-Enhanced Adversarial Training for Neural Response Generation
Qingfu Zhu
|
Lei Cui
|
Wei-Nan Zhang
|
Furu Wei
|
Ting Liu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Dialogue systems are usually built on either generation-based or retrieval-based approaches, yet they do not benefit from the advantages of different models. In this paper, we propose a Retrieval-Enhanced Adversarial Training (REAT) method for neural response generation. Distinct from existing approaches, the REAT method leverages an encoder-decoder framework in terms of an adversarial training paradigm, while taking advantage of N-best response candidates from a retrieval-based system to construct the discriminator. An empirical study on a large scale public available benchmark dataset shows that the REAT method significantly outperforms the vanilla Seq2Seq model as well as the conventional adversarial training approach.
pdf
bib
abs
Dynamically Fused Graph Network for Multi-hop Reasoning
Lin Qiu
|
Yunxuan Xiao
|
Yanru Qu
|
Hao Zhou
|
Lei Li
|
Weinan Zhang
|
Yong Yu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Text-based question answering (TBQA) has been studied extensively in recent years. Most existing approaches focus on finding the answer to a question within a single paragraph. However, many difficult questions require multiple supporting evidence from scattered text among two or more documents. In this paper, we propose Dynamically Fused Graph Network (DFGN), a novel method to answer those questions requiring multiple scattered evidence and reasoning over them. Inspired by human’s step-by-step reasoning behavior, DFGN includes a dynamic fusion layer that starts from the entities mentioned in the given query, explores along the entity graph dynamically built from the text, and gradually finds relevant supporting entities from the given documents. We evaluate DFGN on HotpotQA, a public TBQA dataset requiring multi-hop reasoning. DFGN achieves competitive results on the public board. Furthermore, our analysis shows DFGN produces interpretable reasoning chains.
2018
pdf
bib
abs
Zero Pronoun Resolution with Attention-based Neural Network
Qingyu Yin
|
Yu Zhang
|
Weinan Zhang
|
Ting Liu
|
William Yang Wang
Proceedings of the 27th International Conference on Computational Linguistics
Recent neural network methods for zero pronoun resolution explore multiple models for generating representation vectors for zero pronouns and their candidate antecedents. Typically, contextual information is utilized to encode the zero pronouns since they are simply gaps that contain no actual content. To better utilize contexts of the zero pronouns, we here introduce the self-attention mechanism for encoding zero pronouns. With the help of the multiple hops of attention, our model is able to focus on some informative parts of the associated texts and therefore produces an efficient way of encoding the zero pronouns. In addition, an attention-based recurrent neural network is proposed for encoding candidate antecedents by their contents. Experiment results are encouraging: our proposed attention-based model gains the best performance on the Chinese portion of the OntoNotes corpus, substantially surpasses existing Chinese zero pronoun resolution baseline systems.
pdf
bib
abs
Context-Sensitive Generation of Open-Domain Conversational Responses
Weinan Zhang
|
Yiming Cui
|
Yifa Wang
|
Qingfu Zhu
|
Lingzhi Li
|
Lianqiang Zhou
|
Ting Liu
Proceedings of the 27th International Conference on Computational Linguistics
Despite the success of existing works on single-turn conversation generation, taking the coherence in consideration, human conversing is actually a context-sensitive process. Inspired by the existing studies, this paper proposed the static and dynamic attention based approaches for context-sensitive generation of open-domain conversational responses. Experimental results on two public datasets show that the proposed static attention based approach outperforms all the baselines on automatic and human evaluation.
pdf
bib
abs
Label-Aware Double Transfer Learning for Cross-Specialty Medical Named Entity Recognition
Zhenghui Wang
|
Yanru Qu
|
Liheng Chen
|
Jian Shen
|
Weinan Zhang
|
Shaodian Zhang
|
Yimei Gao
|
Gen Gu
|
Ken Chen
|
Yong Yu
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
We study the problem of named entity recognition (NER) from electronic medical records, which is one of the most fundamental and critical problems for medical text mining. Medical records which are written by clinicians from different specialties usually contain quite different terminologies and writing styles. The difference of specialties and the cost of human annotation makes it particularly difficult to train a universal medical NER system. In this paper, we propose a label-aware double transfer learning framework (La-DTL) for cross-specialty NER, so that a medical NER system designed for one specialty could be conveniently applied to another one with minimal annotation efforts. The transferability is guaranteed by two components: (i) we propose label-aware MMD for feature representation transfer, and (ii) we perform parameter transfer with a theoretical upper bound which is also label aware. We conduct extensive experiments on 12 cross-specialty NER tasks. The experimental results demonstrate that La-DTL provides consistent accuracy improvement over strong baselines. Besides, the promising experimental results on non-medical NER scenarios indicate that La-DTL is potential to be seamlessly adapted to a wide range of NER tasks.
pdf
bib
abs
Deep Reinforcement Learning for Chinese Zero Pronoun Resolution
Qingyu Yin
|
Yu Zhang
|
Wei-Nan Zhang
|
Ting Liu
|
William Yang Wang
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent neural network models for Chinese zero pronoun resolution gain great performance by capturing semantic information for zero pronouns and candidate antecedents, but tend to be short-sighted, operating solely by making local decisions. They typically predict coreference links between the zero pronoun and one single candidate antecedent at a time while ignoring their influence on future decisions. Ideally, modeling useful information of preceding potential antecedents is crucial for classifying later zero pronoun-candidate antecedent pairs, a need which leads traditional models of zero pronoun resolution to draw on reinforcement learning. In this paper, we show how to integrate these goals, applying deep reinforcement learning to deal with the task. With the help of the reinforcement learning agent, our system learns the policy of selecting antecedents in a sequential manner, where useful information provided by earlier predicted antecedents could be utilized for making later coreference decisions. Experimental results on OntoNotes 5.0 show that our approach substantially outperforms the state-of-the-art methods under three experimental settings.
2017
pdf
bib
abs
Chinese Zero Pronoun Resolution with Deep Memory Network
Qingyu Yin
|
Yu Zhang
|
Weinan Zhang
|
Ting Liu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Existing approaches for Chinese zero pronoun resolution typically utilize only syntactical and lexical features while ignoring semantic information. The fundamental reason is that zero pronouns have no descriptive information, which brings difficulty in explicitly capturing their semantic similarities with antecedents. Meanwhile, representing zero pronouns is challenging since they are merely gaps that convey no actual content. In this paper, we address this issue by building a deep memory network that is capable of encoding zero pronouns into vector representations with information obtained from their contexts and potential antecedents. Consequently, our resolver takes advantage of semantic information by using these continuous distributed representations. Experiments on the OntoNotes 5.0 dataset show that the proposed memory network could substantially outperform the state-of-the-art systems in various experimental settings.
pdf
bib
abs
Generating and Exploiting Large-scale Pseudo Training Data for Zero Pronoun Resolution
Ting Liu
|
Yiming Cui
|
Qingyu Yin
|
Wei-Nan Zhang
|
Shijin Wang
|
Guoping Hu
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Most existing approaches for zero pronoun resolution are heavily relying on annotated data, which is often released by shared task organizers. Therefore, the lack of annotated data becomes a major obstacle in the progress of zero pronoun resolution task. Also, it is expensive to spend manpower on labeling the data for better performance. To alleviate the problem above, in this paper, we propose a simple but novel approach to automatically generate large-scale pseudo training data for zero pronoun resolution. Furthermore, we successfully transfer the cloze-style reading comprehension neural network model into zero pronoun resolution task and propose a two-step training mechanism to overcome the gap between the pseudo training data and the real one. Experimental results show that the proposed approach significantly outperforms the state-of-the-art systems with an absolute improvements of 3.1% F-score on OntoNotes 5.0 data.
pdf
bib
Benben: A Chinese Intelligent Conversational Robot
Wei-Nan Zhang
|
Ting Liu
|
Bing Qin
|
Yu Zhang
|
Wanxiang Che
|
Yanyan Zhao
|
Xiao Ding
Proceedings of ACL 2017, System Demonstrations
2012
pdf
bib
The Use of Dependency Relation Graph to Enhance the Term Weighting in Question Retrieval
Weinan Zhang
|
Zhaoyan Ming
|
Yu Zhang
|
Liqiang Nie
|
Ting Liu
|
Tat-Seng Chua
Proceedings of COLING 2012