Non-collaborative dialogue involves two participants with conflicting interests engaging in a multi-round dialogue to achieve their own goals. Strategy planning is the key to guiding both participants towards a consensus. Most LLMs-based methods use stimulus prompts or external strategy planners for strategy planning. However, stimulus prompts fail to teach LLMs to plan dialogue strategies explicitly. Moreover, training external strategy planners doesn’t fully account for adversarial interactions, thereby limiting their effectiveness against tough resisters. In this paper, to mitigate the above issues, we propose GAIA, a Game-based Adversarial self-play InterActive training paradigm, which constructs an adversarial two-player (a persuader and a resister) zero-sum game and guides the game to approximate Nash Equilibrium (NE) via reinforcement learning (RL) for the non-collaborative dialogues. First, we design a Chain-of-Mind prompt to reason the resister’s dialogue act step-by-step to plan the persuasive strategies. Secondly, to adversarially improve the persuader, we construct diverse resistant planners and theoretically improve the persuader’s optimal lower bound. Finally, we iteratively optimise their policies via adversarial self-play interactive RL and design an 𝜖-NE verification algorithm to approximate the game’s NE. Experiments on three datasets show that our model obtains state-of-the-art performance.
Hate speech (HS) on social media exacerbates misinformation and baseless prejudices. Evidence-supported counterspeech (CS) is crucial for correcting misinformation and reducing prejudices through facts. Existing methods for generating evidence-supported CS often lack clear guidance with a core claim for organizing evidence and do not adequately address factuality and faithfulness hallucinations in CS within anti-hate contexts. In this paper, to mitigate the aforementioned, we propose F2RL, a Factuality and Faithfulness Reinforcement Learning framework for generating claim-guided and evidence-supported CS. Firstly, we generate counter-claims based on hate speech and design a self-evaluation mechanism to select the most appropriate one. Secondly, we propose a coarse-to-fine evidence retrieval method. This method initially generates broad queries to ensure the diversity of evidence, followed by carefully reranking the retrieved evidence to ensure its relevance to the claim. Finally, we design a reinforcement learning method with a triplet-based factuality reward model and a multi-aspect faithfulness reward model. The method rewards the generator to encourage greater factuality, more accurate refutation of hate speech, consistency with the claim, and better utilization of evidence. Extensive experiments on three benchmark datasets demonstrate that the proposed framework achieves excellent performance in CS generation, with strong factuality and faithfulness.
Counterspeech is an effective way to combat online hate speech. Considering the multifaceted nature of online hate speech, counterspeech with varying intents (e.g., denouncing or empathy) has significant potential to mitigate hate speech effectively. Recently, controlled approaches based on large language models (LLMs) have been explored to generate intent-specific counterspeech. Due to the lack of attention to intent-specific information by LLMs during the decoding process, those methods cater more to the semantic information rather than matching with the desired intents. Further, there are still limitations in quantitatively evaluating the effectiveness of counterspeech with different intents in mitigating hate speech. In this paper, to address the above issues, we propose DART, an LLMs-based DuAl-discRiminaTor guided framework for counterspeech generation. We employ an intent-aware discriminator and hate-mitigating discriminator to jointly guide the decoding preferences of LLMs, which facilitates the model towards generating counterspeech catering to specific intent and hate mitigation. We apply a maximum-margin relative objective for training discriminators. This objective leverages the distance between counterspeech aligned with the desired target (such as specific intent or effectiveness in hate mitigation) and undesired as an effective learning signal. Extensive experiments show that DART achieves excellent performances in matching the desired intent and mitigating hate.