Jungang Li
2026
Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs
Yubo Gao | Haotian Wu | Hong Chen | Junquan Huang | Yibo Yan | Jungang Li | Zihao Dongfang | Sicheng Tao | PS Tan | Jie Zhang | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2026
Yubo Gao | Haotian Wu | Hong Chen | Junquan Huang | Yibo Yan | Jungang Li | Zihao Dongfang | Sicheng Tao | PS Tan | Jie Zhang | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2026
Chain-of-Thought (CoT) has significantly enhanced LLM reasoning, yet often incurs substantial computational overhead due to “overthinking”: generating excessively long rationales without commensurate accuracy gains. Existing efficiency methods typically apply uniform compression, which overlooks a critical observation that reasoning complexity is heterogeneous at two distinct granularities: across different problems and within individual reasoning steps. This motivates our principle of Thinking Economically: intelligently allocating computational resources based on intrinsic task and step demands rather than pursuing uniform brevity. We propose Hierarchical Adaptive Budgeter (HAB), a training framework that operationalizes this principle through coarse-to-fine budgeting. At the inter-step level, HAB predicts the optimal reasoning depth for each problem. At the intra-step level, HAB learns step-specific token budgeting signals from PPL-derived step comparisons and an adaptive Pareto optimization objective that captures the local quality-efficiency trade-off, while a Fisher Information-based pruner further provides fine-grained training-time guidance, thereby encouraging the generator to internalize more economical reasoning patterns. Experiments on GSM8K and MATH500 show that HAB not only surpasses standard CoT in accuracy but also reduces token usage, achieving a stronger performance-efficiency trade-off than the compared baselines.
2025
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios
Yunkai Dang | Mengxi Gao | Yibo Yan | Xin Zou | Yanggan Gu | Jungang Li | Jingyu Wang | Peijie Jiang | Aiwei Liu | Jia Liu | Xuming Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yunkai Dang | Mengxi Gao | Yibo Yan | Xin Zou | Yanggan Gu | Jungang Li | Jingyu Wang | Peijie Jiang | Aiwei Liu | Jia Liu | Xuming Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding. However, existing studies have concentrated mainly on visual–textual misalignment, leaving largely unexplored the MLLMs’ ability to preserve an originally correct answer when confronted with misleading information. We reveal a response uncertainty phenomenon: across nine standard datasets, twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue. To systematically quantify this vulnerability, we propose a two-stage evaluation pipeline: (1) elicit each model’s original response on unperturbed inputs; (2) inject explicit (false-answer hints) and implicit (contextual contradictions) misleading instructions, and compute the misleading rate—the fraction of correct-to-incorrect flips. Leveraging the most susceptible examples, we curate the Multimodal Uncertainty Benchmark (MUB), a collection of image–question pairs stratified into low, medium, and high difficulty based on how many of twelve state-of-the-art MLLMs they mislead. Extensive evaluation on twelve open-source and five closed-source models reveals a high uncertainty: average misleading rates exceed 86%, with explicit cues over 67.19% and implicit cues over 80.67%. To reduce the misleading rate, we then fine-tune all open-source MLLMs on a compact 2,000-sample mixed-instruction dataset, reducing misleading rates to 6.97% (explicit) and 32.77% (implicit), boosting consistency by nearly 29.37% on highly deceptive inputs, and slightly improving accuracy on standard benchmarks.
PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions
Song Dai | Yibo Yan | Jiamin Su | Zihao Dongfang | Yubo Gao | Yonghua Hei | Jungang Li | Junyan Zhang | Sicheng Tao | Zhuoran Gao | Xuming Hu
Findings of the Association for Computational Linguistics: EMNLP 2025
Song Dai | Yibo Yan | Jiamin Su | Zihao Dongfang | Yubo Gao | Yonghua Hei | Jungang Li | Junyan Zhang | Sicheng Tao | Zhuoran Gao | Xuming Hu
Findings of the Association for Computational Linguistics: EMNLP 2025
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in diverse reasoning tasks, yet their application to complex physics reasoning remains underexplored. Physics reasoning presents unique challenges, requiring grounding in physical conditions and the interpretation of multimodal information. Current physics benchmarks are limited, often focusing on text-only inputs or solely on problem-solving, thereby overlooking the critical intermediate steps of variable identification and process formulation. To address these limitations, we introduce **PhysicsArena, the first multimodal physics reasoning benchmark designed to holistically evaluate MLLMs across three critical dimensions: variable identification, physical process formulation, and solution derivation.** PhysicsArena aims to provide a comprehensive platform for assessing and advancing the multimodal physics reasoning abilities of MLLMs.