2025
pdf
bib
abs
Ask-Before-Detection: Identifying and Mitigating Conformity Bias in LLM-Powered Error Detector for Math Word Problem Solutions
Hang Li
|
Tianlong Xu
|
Kaiqi Yang
|
Yucheng Chu
|
Yanling Chen
|
Yichi Song
|
Qingsong Wen
|
Hui Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rise of large language models (LLMs) offers new opportunities for automatic error detection in education, particularly for math word problems (MWPs). While prior studies demonstrate the promise of LLMs as error detectors, they overlook the presence of multiple valid solutions for a single MWP. Our preliminary analysis reveals a significant performance gap between conventional and alternative solutions in MWPs, a phenomenon we term conformity bias in this work. To mitigate this bias, we introduce the Ask-Before-Detect (AskBD) framework, which generates adaptive reference solutions using LLMs to enhance error detection. Experiments on 200 examples of GSM8K show that AskBD effectively mitigates bias and improves performance, especially when combined with reasoning-enhancing techniques like chain-of-thought prompting.
pdf
bib
abs
Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement
Yaxuan Kong
|
Yiyuan Yang
|
Yoontae Hwang
|
Wenjie Du
|
Stefan Zohren
|
Zhangyang Wang
|
Ming Jin
|
Qingsong Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Time series data are foundational in finance, healthcare, and energy domains. However, most existing methods and datasets remain focused on a narrow spectrum of tasks, such as forecasting or anomaly detection. To bridge this gap, we introduce Time Series Multi-Task Question Answering (Time-MQA), a unified framework that enables natural language queries across multiple time series tasks - numerical analytical tasks and open-ended question answering with reasoning. Central to Time-MQA is the TSQA dataset, a large-scale dataset containing ~200k question-answer pairs derived from diverse time series spanning environment, traffic, etc. This comprehensive resource covers various time series lengths and promotes robust model development. We further demonstrate how continually pre-training large language models (Mistral 7B, Llama-3 8B, and Qwen-2.5 7B) on the TSQA dataset enhanced time series reasoning capabilities, moving beyond mere numeric tasks and enabling more advanced and intuitive interactions with temporal data. The complete TSQA dataset, models, user study questionnaires for evaluation, and other related materials have been open-sourced here.
pdf
bib
abs
MathAgent: Leveraging a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection
Yibo Yan
|
Shen Wang
|
Jiahao Huo
|
Philip S. Yu
|
Xuming Hu
|
Qingsong Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Mathematical error detection in educational settings presents a significant challenge for Multimodal Large Language Models (MLLMs), requiring a sophisticated understanding of both visual and textual mathematical content along with complex reasoning capabilities. Though effective in mathematical problem-solving, MLLMs often struggle with the nuanced task of **identifying and categorizing student errors in multimodal mathematical contexts**. Therefore, we introduce **MathAgent, a novel Mixture-of-Math-Agent framework** specifically designed to address these challenges. Our approach decomposes error detection into three phases with specialized agents: an image-text consistency validator, a visual semantic interpreter, and an integrative error analyzer. This architecture enables more accurate processing of multimodal mathematical content by explicitly modeling the relationships between multimodal problems and student solution steps. We evaluate MathAgent on real-world educational data, demonstrating approximately 5% higher accuracy in error step identification and 3% improvement in error categorization compared to baseline models. Furthermore, MathAgent has been successfully deployed in an educational platform serving over one million K-12 students, achieving nearly 90% student satisfaction while generating significant cost savings by reducing manual error detection.
pdf
bib
abs
NetSafe: Exploring the Topological Safety of Multi-agent System
Miao Yu
|
Shilong Wang
|
Guibin Zhang
|
Junyuan Mao
|
Chenlong Yin
|
Qijiong Liu
|
Kun Wang
|
Qingsong Wen
|
Yang Wang
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) have fueled significant progress in intelligent Multi-agent Systems (MAS), with expanding academic and industrial applications. However, safeguarding these systems from malicious queries receives relatively little attention, while methods for single-agent safety are challenging to transfer. In this paper, we explore MAS safety from a topological perspective, aiming at identifying structural properties that enhance security. To this end, we propose NetSafe framework, unifying diverse MAS workflows via iterative RelCom interactions to enable generalized analysis. We identify several critical phenomena for MAS under attacks (misinformation, bias, and harmful content), termed as Agent Hallucination, Aggregation Safety and Security Bottleneck. Furthermore, we verify that highly connected and larger systems are more vulnerable to adversarial spread, with task performance in a Star Graph Topology decreasing by 29.7%. In conclusion, our work introduces a new perspective on MAS safety and discovers unreported phenomena, offering insights and posing challenges to the community.
pdf
bib
abs
A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
Yibo Yan
|
Jiamin Su
|
Jianxiang He
|
Fangteng Fu
|
Xu Zheng
|
Yuanhuiyi Lyu
|
Kun Wang
|
Shen Wang
|
Qingsong Wen
|
Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025
Mathematical reasoning, a core aspect of human cognition, is vital across many domains, from educational problem-solving to scientific advancements. As artificial general intelligence (AGI) progresses, integrating large language models (LLMs) with mathematical reasoning tasks is becoming increasingly significant. This survey provides **the first comprehensive analysis of mathematical reasoning in the era of multimodal large language models (MLLMs)**. We review over 200 studies published since 2021, and examine the state-of-the-art developments in Math-LLMs, with a focus on multimodal settings. We categorize the field into three dimensions: benchmarks, methodologies, and challenges. In particular, we explore multimodal mathematical reasoning pipeline, as well as the role of (M)LLMs and the associated methodologies. Finally, we identify five major challenges hindering the realization of AGI in this domain, offering insights into the future direction for enhancing multimodal reasoning capabilities. This survey serves as a critical resource for the research community in advancing the capabilities of LLMs to tackle complex multimodal reasoning tasks.
2024
pdf
bib
abs
RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation
Xuanwang Zhang
|
Yun-Ze Song
|
Yidong Wang
|
Shuyun Tang
|
Xinfeng Li
|
Zhengran Zeng
|
Zhen Wu
|
Wei Ye
|
Wenyuan Xu
|
Yue Zhang
|
Xinyu Dai
|
Shikun Zhang
|
Qingsong Wen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Large Language Models (LLMs) demonstrate human-level capabilities in dialogue, reasoning, and knowledge retention. However, even the most advanced LLMs face challenges such as hallucinations and real-time updating of their knowledge. Current research addresses this bottleneck by equipping LLMs with external knowledge, a technique known as Retrieval Augmented Generation (RAG). However, two key issues constrained the development of RAG. First, there is a growing lack of comprehensive and fair comparisons between novel RAG algorithms. Second, open-source tools such as LlamaIndex and LangChain employ high-level abstractions, which results in a lack of transparency and limits the ability to develop novel algorithms and evaluation metrics. To close this gap, we introduce RAGLAB, a modular and research-oriented open-source library. RAGLAB reproduces 6 existing algorithms and provides a comprehensive ecosystem for investigating RAG algorithms. Leveraging RAGLAB, we conduct a fair comparison of 6 RAG algorithms across 10 benchmarks. With RAGLAB, researchers can efficiently compare the performance of various algorithms and develop novel algorithms.