Qingsong Wen - ACL Anthology

This page is part of a temporary preview of a proposed change that may be incomplete or contain mistakes. It is not official and will be removed when the change is merged or abandoned.

Qingsong Wen

2026

HearSay Benchmark: Do Audio LLMs Leak What They Hear?
Jin Wang | Kaiwen Luo | Liang Lin | Weiliu Wang | Yitian Chen | Moayad Aloqaily | Xuehai Tang | Zhenhong Zhou | Kun Wang | Li Sun | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026

While Audio Large Language Models (ALLMs) have achieved remarkable progress in understanding and generation, their potential privacy implications remain largely unexplored. This paper takes the first step to investigate whether ALLMs inadvertently leak user privacy solely through acoustic voiceprints and introduces HearSay, a comprehensive benchmark constructed from over 22,000 real-world audio clips. To ensure data quality, the benchmark is meticulously curated through a rigorous pipeline involving automated profiling and human verification, guaranteeing that all privacy labels are grounded in factual records. Extensive experiments on HearSay yield three critical findings:Significant Privacy Leakage: ALLMs inherently extract private attributes from voiceprints, reaching 92.89% accuracy on gender and effectively profiling social attributes.Insufficient Safety Mechanisms: Alarmingly, existing safeguards are severely inadequate; most models fail to refuse privacy-intruding requests, exhibiting near-zero refusal rates for physiological traits.Reasoning Amplifies Risk: Chain-of-Thought (CoT) reasoning exacerbates privacy risks in capable models by uncovering deeper acoustic correlations.These findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment.The codes and dataset are available at https://github.com/JinWang79/HearSay_Benchmark

Time-RA: Towards Time Series Reasoning for Anomaly Diagnosis with LLM Feedback
Yiyuan Yang | Zichuan Liu | Lei Song | Kai Ying | Stephen Wang | Joshua Thomas Bamford | Svitlana Vyetrenko | Jiang Bian | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026

Time series anomaly detection (TSAD) has traditionally focused on binary classification and often lacks the fine-grained categorization and explanatory reasoning required for transparent decision-making. To address these limitations, we propose Time-series Reasoning for Anomaly (Time-RA), a novel task that reformulates TSAD from a discriminative into a generative, reasoning-intensive paradigm. To facilitate this, we introduce RATs40K, the first real-world large-scale multimodal benchmark with ~40,000 samples across 10 domains, integrating raw time series, textual context, and visual plots with structured reasoning annotations. Extensive benchmarking shows that while supervised fine-tuning and visual representations boost diagnostic accuracy and reasoning consistency, performance varies across complex scenarios. Notably, fine-tuned models demonstrate strong "plug-and-play" transferability, outperforming traditional baselines on unseen real-world datasets. Our work establishes a foundation for interpretable, multimodal time series analysis. All code and the RATs40K dataset are fully open-sourced to facilitate future research.

ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code
Jian Xie | Zhendong Chu | Aoxiao Zhong | Kai Zhang | Mingzhe Han | Xing Fan | Jialie Shen | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026

Large Reasoning Models (LRMs) often suffer from the “over-thinking” problem, generating unnecessarily long reasoning on simple tasks. Some strategies have been proposed to mitigate this issue, such as length penalties or routing mechanisms, but they are typically heuristic and task-specific, lacking a general framework for adaptive reasoning. In this paper, we present ARM2, a unified model that adaptively balances reasoning performance and efficiency across multiple formats through a reinforcement learning framework augmented with length-aware optimization. Beyond conventional natural language inference, ARM2 integrates vision understanding, extending its applicability to multimodal. Moreover, ARM2 integrates executable code into reasoning, enabling substantial reductions in token cost while preserving task performance compared to long CoT. Experiments demonstrate that ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. We further conduct extensive analyses to validate the effectiveness of ARM2 and the soundness of its design.

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Yibo Yan | Shen Wang | Jiahao Huo | Jingheng Ye | Zhendong Chu | Xuming Hu | Philip S. Yu | Carla P Gomes | Bart Selman | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026

Scientific reasoning, the process through which humans apply logic, evidence, and critical thinking to explore and interpret scientific phenomena, is essential in advancing knowledge reasoning across diverse fields. However, despite significant progress, current scientific reasoning models still struggle with generalization across domains and often fall short of multimodal perception. Multimodal Large Language Models (MLLMs), which integrate text, images, and other modalities, present an exciting opportunity to overcome these limitations and enhance scientific reasoning. Therefore, **this position paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology**. We highlight the current state of MLLM applications in scientific reasoning, noting their ability to integrate and reason over diverse data types. However, challenges such as multimodal alignment, data diversity, and reasoning depth remain obstacles to achieving their full potential. To address these challenges, we propose actionable suggestions in the near future. Overall, our work offers a novel perspective on MLLM integration with scientific reasoning, providing the LLM community with valuable insights for achieving Artificial General Intelligence (AGI).

As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to handle mathematical reasoning tasks is promising, as they can handle multimodal questions via cross-modal understanding capabilities compared to text-only LLMs. Current mathematical benchmarks predominantly focus on evaluating MLLMs’ problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task — multimodal error detection, and introduce **ErrorRadar, the first benchmark designed to assess MLLMs’ capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization**, providing a framework for evaluating MLLMs’ complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with expert-based annotation and metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate challenges still remain, as GPT-4o with best model performance is still around 10% behind human evaluation

Backdoor Collapse: Eliminating Unknown Threats Via Known Backdoor Aggregation In Language Models
Liang Lin | Miao Yu | Moayad Aloqaily | Zhenhong Zhou | Kun Wang | Linsey Pang | Prakhar Mehrotra | Qingsong Wen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose Locphylax, a defense framework that requires no prior knowledge of trigger settings. Locphylax is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. Locphylax leverages this through a two-stage process: first, aggregating backdoor representations by injecting known triggers, and then, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) Locphylax reduces the average Attack Success Rate to 4.41% across multiple benchmarks, outperforming existing baselines by 28.1%–69.3%. (II) Clean accuracy and utility are preserved within 0.5% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios. Our code is available at https://anonymous.4open.science/r/Locphylax.

Scaling Law for Multimodal Large Language Model Supervised Fine-Tuning
YiFan Zhang | Tao Yu | Feng Li | Chaoyou Fu | Yibo Hu | Kun Wang | Qingsong Wen | Zhang Zhang | Liang Wang | Rong Jin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The supervised fine-tuning (SFT) stage is crucial for multimodal large language models (MLLMs), yet a comprehensive scaling law to guide the optimal model-data configuration remains lacking. In this paper, we make an initial attempt to address this gap. First, we theoretically demonstrate that directly computing the optimal computation frontier for MLLM-SFT, as we can for traditional LLMs, is a challenging task. This complexity arises because MLLM-SFT is influenced by a broader range of factors, including model size, LLM pre-training tokens, and MLLM SFT tokens. To tackle this issue, we propose two scaling laws based on LLM paradigms: one applicable when training data volumes are well defined by researchers, and another for cases where models are sourced from open communities with unknown training data. Through theoretical modeling and approximations, we provide researchers with valuable recommendations for optimal resource allocation. Furthermore, we establish a strong correlation ( R² = 0.98) between training loss and downstream performance, enabling accurate performance estimation without the need for exhaustive benchmarking. To validate our scaling laws, we construct a testbed of 60 models ranging from 50 million to 8 billion parameters, totaling 1,560 checkpoints. Each checkpoint is evaluated on than 10 MLLM benchmarks, ensuring robust fitting of our formulations.

2025

LLM Agents for Education: Advances and Applications
Zhendong Chu | Shen Wang | Jian Xie | Tinghui Zhu | Yibo Yan | Jingheng Ye | Aoxiao Zhong | Xuming Hu | Jing Liang | Philip S. Yu | Qingsong Wen
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Model (LLM) agents are transforming education by automating complex pedagogical tasks and enhancing both teaching and learning processes. In this survey, we present a systematic review of recent advances in applying LLM agents to address key challenges in educational settings, such as feedback comment generation, curriculum design, etc. We analyze the technologies enabling these agents, including representative datasets, benchmarks, and algorithmic frameworks. Additionally, we highlight key challenges in deploying LLM agents in educational settings, including ethical issues, hallucination and overreliance, and integration with existing educational ecosystems. Beyond the core technical focus, we include in Appendix A a comprehensive overview of domain-specific educational agents, covering areas such as science learning, language learning, and professional development.

A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
Yibo Yan | Jiamin Su | Jianxiang He | Fangteng Fu | Xu Zheng | Yuanhuiyi Lyu | Kun Wang | Shen Wang | Qingsong Wen | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025

Mathematical reasoning, a core aspect of human cognition, is vital across many domains, from educational problem-solving to scientific advancements. As artificial general intelligence (AGI) progresses, integrating large language models (LLMs) with mathematical reasoning tasks is becoming increasingly significant. This survey provides **the first comprehensive analysis of mathematical reasoning in the era of multimodal large language models (MLLMs)**. We review over 200 studies published since 2021, and examine the state-of-the-art developments in Math-LLMs, with a focus on multimodal settings. We categorize the field into three dimensions: benchmarks, methodologies, and challenges. In particular, we explore multimodal mathematical reasoning pipeline, as well as the role of (M)LLMs and the associated methodologies. Finally, we identify five major challenges hindering the realization of AGI in this domain, offering insights into the future direction for enhancing multimodal reasoning capabilities. This survey serves as a critical resource for the research community in advancing the capabilities of LLMs to tackle complex multimodal reasoning tasks.

NetSafe: Exploring the Topological Safety of Multi-agent System
Miao Yu | Shilong Wang | Guibin Zhang | Junyuan Mao | Chenlong Yin | Qijiong Liu | Kun Wang | Qingsong Wen | Yang Wang
Findings of the Association for Computational Linguistics: ACL 2025

Large language models (LLMs) have fueled significant progress in intelligent Multi-agent Systems (MAS), with expanding academic and industrial applications. However, safeguarding these systems from malicious queries receives relatively little attention, while methods for single-agent safety are challenging to transfer. In this paper, we explore MAS safety from a topological perspective, aiming at identifying structural properties that enhance security. To this end, we propose NetSafe framework, unifying diverse MAS workflows via iterative RelCom interactions to enable generalized analysis. We identify several critical phenomena for MAS under attacks (misinformation, bias, and harmful content), termed as Agent Hallucination, Aggregation Safety and Security Bottleneck. Furthermore, we verify that highly connected and larger systems are more vulnerable to adversarial spread, with task performance in a Star Graph Topology decreasing by 29.7%. In conclusion, our work introduces a new perspective on MAS safety and discovers unreported phenomena, offering insights and posing challenges to the community.

Position: LLMs Can be Good Tutors in English Education
Jingheng Ye | Shen Wang | Deqing Zou | Yibo Yan | Kun Wang | Hai-Tao Zheng | Ruitong Liu | Zenglin Xu | Irwin King | Philip S. Yu | Qingsong Wen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

While recent efforts have begun integrating large language models (LLMs) into English education, they often rely on traditional approaches to learning tasks without fully embracing educational methodologies, thus lacking adaptability to language learning. To address this gap, we argue that **LLMs have the potential to serve as effective tutors in English Education**. Specifically, LLMs can play three critical roles: (1) as data enhancers, improving the creation of learning materials or serving as student simulations; (2) as task predictors, serving as learner assessment or optimizing learning pathway; and (3) as agents, enabling personalized and inclusive education. We encourage interdisciplinary research to explore these roles, fostering innovation while addressing challenges and risks, ultimately advancing English Education through the thoughtful integration of LLMs.

DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition
Hanjun Luo | Yingbin Jin | Yiran Wang | Xinfeng Li | Tong Shang | Xuecheng Liu | Ruizhe Chen | Kun Wang | Hanan Salam | Qingsong Wen | Zuozhu Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The advancements of Large Language Models (LLMs) have spurred a growing interest in their application to Named Entity Recognition (NER) methods. However, existing datasets are primarily designed for traditional machine learning methods and are inadequate for LLM-based methods, in terms of corpus selection and overall dataset design logic. Moreover, the prevalent fixed and relatively coarse-grained entity categorization in existing datasets fails to adequately assess the superior generalization and contextual understanding capabilities of LLM-based methods, thereby hindering a comprehensive demonstration of their broad application prospects. To address these limitations, we propose DynamicNER, the first NER dataset designed for LLM-based methods with dynamic categorization, introducing various entity types and entity type lists for the same entity in different context, leveraging the generalization of LLM-based NER better. The dataset is also multilingual and multi-granular, covering 8 languages and 155 entity types, with corpora spanning a diverse range of domains. Furthermore, we introduce CascadeNER, a novel NER method based on a two-stage strategy and lightweight LLMs, achieving higher accuracy on fine-grained tasks while requiring fewer computational resources. Experiments show that DynamicNER serves as a robust and effective benchmark for LLM-based NER methods. Furthermore, we also conduct analysis for traditional methods and LLM-based methods on our dataset. Our code and dataset are openly available at https://github.com/Astarojth/DynamicNER.

UniEDU: Toward Unified and Efficient Large Multimodal Models for Educational Tasks
Zhendong Chu | Jian Xie | Shen Wang | Zichao Wang | Qingsong Wen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Education materials for K-12 students often consist of multiple modalities, such as text and images, posing challenges for models to fully understand nuanced information in these materials. In this paper, we propose a unified language and vision assistant UniEDU designed for various educational applications, including knowledge recommendation, knowledge tracing, time cost prediction, and user answer prediction, all within a single model. Unlike conventional task-specific models, UniEDU offers a unified solution that excels across multiple educational tasks while maintaining strong generalization capabilities. Its adaptability makes it well-suited for real-world deployment in diverse learning environments. Furthermore, UniEDU is optimized for industry-scale deployment by significantly reducing computational overhead—achieving approximately a 300% increase in efficiency—while maintaining competitive performance with minimal degradation compared to fully fine-tuned models. This work represents a significant step toward creating versatile AI systems tailored to the evolving demands of education.

Ask-Before-Detection: Identifying and Mitigating Conformity Bias in LLM-Powered Error Detector for Math Word Problem Solutions
Hang Li | Tianlong Xu | Kaiqi Yang | Yucheng Chu | Yanling Chen | Yichi Song | Qingsong Wen | Hui Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The rise of large language models (LLMs) offers new opportunities for automatic error detection in education, particularly for math word problems (MWPs). While prior studies demonstrate the promise of LLMs as error detectors, they overlook the presence of multiple valid solutions for a single MWP. Our preliminary analysis reveals a significant performance gap between conventional and alternative solutions in MWPs, a phenomenon we term conformity bias in this work. To mitigate this bias, we introduce the Ask-Before-Detect (AskBD) framework, which generates adaptive reference solutions using LLMs to enhance error detection. Experiments on 200 examples of GSM8K show that AskBD effectively mitigates bias and improves performance, especially when combined with reasoning-enhancing techniques like chain-of-thought prompting.

Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement
Yaxuan Kong | Yiyuan Yang | Yoontae Hwang | Wenjie Du | Stefan Zohren | Zhangyang Wang | Ming Jin | Qingsong Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Time series data are foundational in finance, healthcare, and energy domains. However, most existing methods and datasets remain focused on a narrow spectrum of tasks, such as forecasting or anomaly detection. To bridge this gap, we introduce Time Series Multi-Task Question Answering (Time-MQA), a unified framework that enables natural language queries across multiple time series tasks - numerical analytical tasks and open-ended question answering with reasoning. Central to Time-MQA is the TSQA dataset, a large-scale dataset containing ~200k question-answer pairs derived from diverse time series spanning environment, traffic, etc. This comprehensive resource covers various time series lengths and promotes robust model development. We further demonstrate how continually pre-training large language models (Mistral 7B, Llama-3 8B, and Qwen-2.5 7B) on the TSQA dataset enhanced time series reasoning capabilities, moving beyond mere numeric tasks and enabling more advanced and intuitive interactions with temporal data. The complete TSQA dataset, models, user study questionnaires for evaluation, and other related materials have been open-sourced here.

MathAgent: Leveraging a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection
Yibo Yan | Shen Wang | Jiahao Huo | Philip S. Yu | Xuming Hu | Qingsong Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Mathematical error detection in educational settings presents a significant challenge for Multimodal Large Language Models (MLLMs), requiring a sophisticated understanding of both visual and textual mathematical content along with complex reasoning capabilities. Though effective in mathematical problem-solving, MLLMs often struggle with the nuanced task of **identifying and categorizing student errors in multimodal mathematical contexts**. Therefore, we introduce **MathAgent, a novel Mixture-of-Math-Agent framework** specifically designed to address these challenges. Our approach decomposes error detection into three phases with specialized agents: an image-text consistency validator, a visual semantic interpreter, and an integrative error analyzer. This architecture enables more accurate processing of multimodal mathematical content by explicitly modeling the relationships between multimodal problems and student solution steps. We evaluate MathAgent on real-world educational data, demonstrating approximately 5% higher accuracy in error step identification and 3% improvement in error categorization compared to baseline models. Furthermore, MathAgent has been successfully deployed in an educational platform serving over one million K-12 students, achieving nearly 90% student satisfaction while generating significant cost savings by reducing manual error detection.

2024

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation
Xuanwang Zhang | Yun-Ze Song | Yidong Wang | Shuyun Tang | Xinfeng Li | Zhengran Zeng | Zhen Wu | Wei Ye | Wenyuan Xu | Yue Zhang | Xinyu Dai | Shikun Zhang | Qingsong Wen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Large Language Models (LLMs) demonstrate human-level capabilities in dialogue, reasoning, and knowledge retention. However, even the most advanced LLMs face challenges such as hallucinations and real-time updating of their knowledge. Current research addresses this bottleneck by equipping LLMs with external knowledge, a technique known as Retrieval Augmented Generation (RAG). However, two key issues constrained the development of RAG. First, there is a growing lack of comprehensive and fair comparisons between novel RAG algorithms. Second, open-source tools such as LlamaIndex and LangChain employ high-level abstractions, which results in a lack of transparency and limits the ability to develop novel algorithms and evaluation metrics. To close this gap, we introduce RAGLAB, a modular and research-oriented open-source library. RAGLAB reproduces 6 existing algorithms and provides a comprehensive ecosystem for investigating RAG algorithms. Leveraging RAGLAB, we conduct a fair comparison of 6 RAG algorithms across 10 benchmarks. With RAGLAB, researchers can efficiently compare the performance of various algorithms and develop novel algorithms.

Co-authors

Moayad Aloqaily 2

Zhenhong Zhou 2

Joshua Thomas Bamford 1

Carla P Gomes 1

Yoontae Hwang 1

Yuanhuiyi Lyu 1

Prakhar Mehrotra 1

Svitlana Vyetrenko 1

Zhangyang Wang 1

Zhengran Zeng 1

Xuanwang Zhang 1

Hai-Tao Zheng 1

Stefan Zohren 1

Venues