Aoxiao Zhong
2026
ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code
Jian Xie | Zhendong Chu | Aoxiao Zhong | Kai Zhang | Mingzhe Han | Xing Fan | Jialie Shen | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026
Jian Xie | Zhendong Chu | Aoxiao Zhong | Kai Zhang | Mingzhe Han | Xing Fan | Jialie Shen | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026
Large Reasoning Models (LRMs) often suffer from the “over-thinking” problem, generating unnecessarily long reasoning on simple tasks. Some strategies have been proposed to mitigate this issue, such as length penalties or routing mechanisms, but they are typically heuristic and task-specific, lacking a general framework for adaptive reasoning. In this paper, we present ARM2, a unified model that adaptively balances reasoning performance and efficiency across multiple formats through a reinforcement learning framework augmented with length-aware optimization. Beyond conventional natural language inference, ARM2 integrates vision understanding, extending its applicability to multimodal. Moreover, ARM2 integrates executable code into reasoning, enabling substantial reductions in token cost while preserving task performance compared to long CoT. Experiments demonstrate that ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. We further conduct extensive analyses to validate the effectiveness of ARM2 and the soundness of its design.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
Yibo Yan | Shen Wang | Jiahao Huo | Hang Li | Boyan Li | Jiamin Su | Xiong Gao | YiFan Zhang | Tianlong Xu | Zhendong Chu | Aoxiao Zhong | Kun Wang | Hui Xiong | Philip S. Yu | Xuming Hu | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026
Yibo Yan | Shen Wang | Jiahao Huo | Hang Li | Boyan Li | Jiamin Su | Xiong Gao | YiFan Zhang | Tianlong Xu | Zhendong Chu | Aoxiao Zhong | Kun Wang | Hui Xiong | Philip S. Yu | Xuming Hu | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026
As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to handle mathematical reasoning tasks is promising, as they can handle multimodal questions via cross-modal understanding capabilities compared to text-only LLMs. Current mathematical benchmarks predominantly focus on evaluating MLLMs’ problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task — multimodal error detection, and introduce **ErrorRadar, the first benchmark designed to assess MLLMs’ capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization**, providing a framework for evaluating MLLMs’ complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with expert-based annotation and metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate challenges still remain, as GPT-4o with best model performance is still around 10% behind human evaluation
2025
LLM Agents for Education: Advances and Applications
Zhendong Chu | Shen Wang | Jian Xie | Tinghui Zhu | Yibo Yan | Jingheng Ye | Aoxiao Zhong | Xuming Hu | Jing Liang | Philip S. Yu | Qingsong Wen
Findings of the Association for Computational Linguistics: EMNLP 2025
Zhendong Chu | Shen Wang | Jian Xie | Tinghui Zhu | Yibo Yan | Jingheng Ye | Aoxiao Zhong | Xuming Hu | Jing Liang | Philip S. Yu | Qingsong Wen
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Model (LLM) agents are transforming education by automating complex pedagogical tasks and enhancing both teaching and learning processes. In this survey, we present a systematic review of recent advances in applying LLM agents to address key challenges in educational settings, such as feedback comment generation, curriculum design, etc. We analyze the technologies enabling these agents, including representative datasets, benchmarks, and algorithmic frameworks. Additionally, we highlight key challenges in deploying LLM agents in educational settings, including ethical issues, hallucination and overreliance, and integration with existing educational ecosystems. Beyond the core technical focus, we include in Appendix A a comprehensive overview of domain-specific educational agents, covering areas such as science learning, language learning, and professional development.
2023
An Empirical Analysis of Leveraging Knowledge for Low-Resource Task-Oriented Semantic Parsing
Mayank Kulkarni | Aoxiao Zhong | Nicolas Guenon des mesnards | Sahar Movaghati | Mukund Sridhar | He Xie | Jianhua Lu
Findings of the Association for Computational Linguistics: ACL 2023
Mayank Kulkarni | Aoxiao Zhong | Nicolas Guenon des mesnards | Sahar Movaghati | Mukund Sridhar | He Xie | Jianhua Lu
Findings of the Association for Computational Linguistics: ACL 2023
Task-oriented semantic parsing has drawn a lot of interest from the NLP community, and especially the voice assistant industry as it enables representing the meaning of user requests with arbitrarily nested semantics, including multiple intents and compound entities. SOTA models are large seq2seq transformers and require hundreds of thousands of annotated examples to be trained. However annotating such data to bootstrap new domains or languages is expensive and error-prone, especially for requests made of nested semantics. In addition large models easily break the tight latency constraints imposed in a user-facing production environment. As part of this work we explore leveraging external knowledge to improve model accuracy in low-resource and low-compute settings. We demonstrate that using knowledge-enhanced encoders inside seq2seq models does not result in performance gains by itself, but jointly learning to uncover entities in addition to the parse generation is a simple yet effective way of improving performance across the board. We show this is especially true in the low-compute scarce-data setting and for entity-rich domains, with relative gains up to 74.48% on the TOPv2 dataset.
Search
Fix author
Co-authors
- Zhendong Chu 3
- Qingsong Wen 3
- Xuming Hu 2
- Shen Wang 2
- Jian Xie 2
- Yibo Yan 2
- Philip S. Yu 2
- Xing Fan 1
- Xiong Gao 1
- Nicolas Guenon des Mesnards 1
- Mingzhe Han 1
- Jiahao Huo 1
- Mayank Kulkarni 1
- Hang Li 1
- Boyan Li 1
- Jing Liang 1
- Jianhua Lu 1
- Sahar Movaghati 1
- Jialie Shen 1
- Mukund Sridhar 1
- Jiamin Su 1
- Kun Wang 1
- He Xie 1
- Hui Xiong 1
- Tianlong Xu 1
- Jingheng Ye 1
- Kai Zhang 1
- Yifan Zhang 1
- Tinghui Zhu 1