Peidong Wang


2026

Existing fraud detection methods predominantly rely on transcribed text, suffering from ASR errors and missing crucial acoustic cues like vocal tone and environmental context. This limits their effectiveness against complex deceptive strategies. To address these challenges, we first propose **SAFE-QAQ**, an end-to-end comprehensive framework for audio-based slow-thinking fraud detection. First, the SAFE-QAQ framework eliminates the impact of transcription errors on detection performance. Secondly, we propose rule-based slow-thinking reward mechanisms that systematically guide the system to identify fraud-indicative patterns by accurately capturing fine-grained audio details, through hierarchical reasoning processes. Besides, our framework introduces a dynamic risk assessment framework during live calls, enabling early detection and prevention of fraud. Experiments on the TeleAntiFraud-Bench demonstrate that SAFE-QAQ achieves dramatic improvements over existing methods in multiple key dimensions, including accuracy, inference efficiency, and real-time processing capabilities. Currently deployed and analyzing over 70,000 calls daily, SAFE-QAQ effectively automates complex fraud detection, reducing human workload and financial losses. Code: https://anonymous.4open.science/r/SAFE-QAQ.
Harmful memes convey offensive intent through implicit associations between visual symbols and text, requiring a broad understanding of cultural stereotypes and visual metaphors. Small-scale Multimodal Large Language Models (MLLMs) often lack the knowledge required to identify such implicit hate, whereas Large-scale MLLMs, despite their broader knowledge, exhibit systematic labeling bias. To address these challenges, we propose DR-HM, a Distill-then-Reinforce training framework with cognition-aware data synthesis for harmful meme detection, which aims to transfer knowledge from closed-source models while mitigating their biases. DR-HM introduces a six-step structured data synthesis scheme with self-refinement that decomposes meme analysis into a progressive, human-inspired reasoning process from entity recognition to harmfulness judgment. Based on the synthesized reasoning data, we further adopt a Distill-then-Reinforce training strategy. This approach combines a two-stage Supervised Fine-Tuning (SFT) with an Adaptive Group Relative Policy Optimization (A-GRPO) algorithm, which incorporates class-ratio-aware reward weighting and dynamic KL coefficients. Experiments on three benchmark datasets show that the proposed approach consistently outperforms existing methods and achieves an accuracy of 84.7% on the FHM dataset, approaching the reported performance of human annotators.
Large Reasoning Models (LRMs) often suffer from overthinking, a phenomenon in which redundant reasoning steps are generated after a correct solution has already been reached. Existing early reasoning exit methods primarily rely on output-level heuristics or trained probing models to skip redundant reasoning steps, thereby mitigating overthinking. However, these approaches typically require additional rollout computation or externally labeled datasets. In this paper, we propose NEAT, a Neuron-based Early reAsoning exiT framework that monitors neuron-level activation dynamics to enable training-free early exits, without introducing any additional test-time computation. NEAT identifies exit-associated neurons and tracks their activation patterns during reasoning to dynamically trigger early exit or suppress reflection, thereby reducing unnecessary reasoning while preserving solution quality. Experiments on four reasoning benchmarks across six models with different scales and architectures show that, for each model, NEAT achieves an average token reduction of 22% to 28% when averaged over the four benchmarks, while maintaining accuracy.

2025

Constrained by the cost and ethical concerns of involving real seekers in AI-driven mental health, researchers develop LLM-based conversational agents (CAs) with tailored configurations, such as profiles, symptoms, and scenarios, to simulate seekers. While these efforts advance AI in mental health, achieving more realistic seeker simulation remains hindered by two key challenges: dynamic evolution and multi-session memory. Seekers’ mental states often fluctuate during counseling, which typically spans multiple sessions. To address this, we propose **AnnaAgent**, an emotional and cognitive dynamic agent system equipped with tertiary memory. AnnaAgent incorporates an emotion modulator and a complaint elicitor trained on real counseling dialogues, enabling dynamic control of the simulator’s configurations. Additionally, its tertiary memory mechanism effectively integrates short-term and long-term memory across sessions. Evaluation results, both automated and manual, demonstrate that AnnaAgent achieves more realistic seeker simulation in psychological counseling compared to existing baselines. The ethically reviewed and screened code can be found on [https://github.com/sci-m-wang/AnnaAgent](https://github.com/sci-m-wang/AnnaAgent).
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their further evolution is often hampered by the scarcity of high-quality training data and the heavy reliance of traditional methods on expert-labeled data. This reliance sets a ceiling on LLM performance and is particularly challenging in low data resource scenarios where extensive supervision is unavailable. To address this issue, we propose a novel paradigm named LANCE (**LAN**guage models as **C**ontinuous self-**E**volving data engineers) that enables LLMs to train themselves by autonomously generating, cleaning, reviewing, and annotating data with preference information. Our approach demonstrates that LLMs can serve as continuous self-evolving data engineers, significantly reducing the time and cost of post-training data construction. Through iterative fine-tuning on Qwen2 series models, we validate the effectiveness of LANCE across various tasks, showing that it can maintain high-quality data generation and continuously improve model performance. Across multiple benchmark dimensions, LANCE results in an average score enhancement of **3.64** for Qwen2-7B and **1.75** for Qwen2-7B-Instruct. This autonomous data construction paradigm not only lessens reliance on human experts or external models but also ensures data aligns with human preferences, offering a scalable path for LLM self-improvement, especially in contexts with limited supervisory data. Code is available at: https://github.com/Control-derek/LANCE.
Generative Error Correction (GEC) has emerged as a powerful post-processing method to boost the performance of Automatic Speech Recognition (ASR) systems. In this paper, we first show that GEC models struggle to generalize beyond the specific types of errors encountered during training, limiting their ability to correct new, unseen errors at test time, particularly in out-of-domain (OOD) scenarios. This phenomenon amplifies with named entities (NEs), where, in addition to insufficient contextual information or knowledge about the NEs, novel NEs keep emerging. To address these issues, we propose DARAG (Data- and Retrieval-Augmented Generative Error Correction), a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. First, we augment the GEC training dataset with synthetic data generated using foundational generative models, thereby simulating additional errors from which the model can learn from. For out-of-domain scenarios, we simulate test-time errors from new domains similarly and in an unsupervised fashion. Additionally, to better handle NEs, we introduce retrieval-augmented correction wherein we augment the model input with entities retrieved from a datastore of NEs. Our approach is simple, scalable, and both domain- and language-agnostic. We experiment on multiple datasets and settings, showing that DARAG outperforms all our baselines, achieving 8%–30% relative WER improvements in ID and 10%–33% improvements in OOD settings.

2024

Stickers, while widely recognized for enhancing empathetic communication in online interactions, remain underexplored in current empathetic dialogue research, notably due to the challenge of a lack of comprehensive datasets. In this paper, we introduce the Agent for STICKERCONV (Agent4SC), which uses collaborative agent interactions to realistically simulate human behavior with sticker usage, thereby enhancing multimodal empathetic communication. Building on this foundation, we develop a multimodal empathetic dialogue dataset, STICKERCONV, comprising 12.9K dialogue sessions, 5.8K unique stickers, and 2K diverse conversational scenarios. This dataset serves as a benchmark for multimodal empathetic generation. To advance further, we propose PErceive and Generate Stickers (PEGS), a multimodal empathetic response generation framework, complemented by a comprehensive set of empathy evaluation metrics based on LLM. Our experiments demonstrate PEGS’s effectiveness in generating contextually relevant and emotionally resonant multimodal empathetic responses, contributing to the advancement of more nuanced and engaging empathetic dialogue systems.
Responding with multimodal content has been recognized as one of the essential functionalities of intelligent conversational agents. However, existing research on multimodal dialogues primarily focuses on two topics: (1) textual response generation that ground the conversation on a given image; and (2) visual response selection based on the dialogue context. In light of the aforementioned gap, we propose mulTImodal GEnerator for dialogue Response (TIGER), a unified generative model framework for multimodal dialogue response generation. Through extensive experiments, TIGER has demonstrated new state-of-the-art results, providing users with an enhanced conversational experience. A multimodal dialogue system based on TIGER is available at https://github.com/friedrichor/TIGER. A video demonstrating the system is available at https://www.youtube.com/watch?v=Kd0CMwDs8Rk.