Hang Li
Other people with similar names: Hang Li
Unverified author pages with similar names: Hang Li
2026
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
Yibo Yan | Shen Wang | Jiahao Huo | Hang Li | Boyan Li | Jiamin Su | Xiong Gao | YiFan Zhang | Tianlong Xu | Zhendong Chu | Aoxiao Zhong | Kun Wang | Hui Xiong | Philip S. Yu | Xuming Hu | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026
Yibo Yan | Shen Wang | Jiahao Huo | Hang Li | Boyan Li | Jiamin Su | Xiong Gao | YiFan Zhang | Tianlong Xu | Zhendong Chu | Aoxiao Zhong | Kun Wang | Hui Xiong | Philip S. Yu | Xuming Hu | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026
As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to handle mathematical reasoning tasks is promising, as they can handle multimodal questions via cross-modal understanding capabilities compared to text-only LLMs. Current mathematical benchmarks predominantly focus on evaluating MLLMs’ problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task — multimodal error detection, and introduce **ErrorRadar, the first benchmark designed to assess MLLMs’ capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization**, providing a framework for evaluating MLLMs’ complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with expert-based annotation and metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate challenges still remain, as GPT-4o with best model performance is still around 10% behind human evaluation
2025
Ask-Before-Detection: Identifying and Mitigating Conformity Bias in LLM-Powered Error Detector for Math Word Problem Solutions
Hang Li | Tianlong Xu | Kaiqi Yang | Yucheng Chu | Yanling Chen | Yichi Song | Qingsong Wen | Hui Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hang Li | Tianlong Xu | Kaiqi Yang | Yucheng Chu | Yanling Chen | Yichi Song | Qingsong Wen | Hui Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rise of large language models (LLMs) offers new opportunities for automatic error detection in education, particularly for math word problems (MWPs). While prior studies demonstrate the promise of LLMs as error detectors, they overlook the presence of multiple valid solutions for a single MWP. Our preliminary analysis reveals a significant performance gap between conventional and alternative solutions in MWPs, a phenomenon we term conformity bias in this work. To mitigate this bias, we introduce the Ask-Before-Detect (AskBD) framework, which generates adaptive reference solutions using LLMs to enhance error detection. Experiments on 200 examples of GSM8K show that AskBD effectively mitigates bias and improves performance, especially when combined with reasoning-enhancing techniques like chain-of-thought prompting.
2024
Are Large Language Models (LLMs) Good Social Predictors?
Kaiqi Yang | Hang Li | Hongzhi Wen | Tai-Quan Peng | Jiliang Tang | Hui Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
Kaiqi Yang | Hang Li | Hongzhi Wen | Tai-Quan Peng | Jiliang Tang | Hui Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
With the recent advancement of Large Language Models (LLMs), efforts have been made to leverage LLMs in crucial social science study methods, including predicting human features of social life such as presidential voting. Existing works suggest that LLMs are capable of generating human-like responses. Nevertheless, it is unclear how well LLMs work and where the plausible predictions derive from. This paper critically examines the performance of LLMs as social predictors, pointing out the source of correct predictions and limitations. Based on the notion of mutability that classifies social features, we design three realistic settings and a novel social prediction task, where the LLMs make predictions with input features of the same mutability and accessibility with the response feature. We find that the promising performance achieved by previous studies is because of input shortcut features to the response, which are hard to capture in reality; the performance degrades dramatically to near-random after removing the shortcuts. With the comprehensive investigations on various LLMs, we reveal that LLMs struggle to work as expected on social prediction when given ordinarily available input features without shortcuts. We further investigate possible reasons for this phenomenon and suggest potential ways to enhance LLMs for social prediction.