Aonan Zhang
2025
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Guoli Yin
|
Haoping Bai
|
Shuang Ma
|
Feng Nan
|
Yanchao Sun
|
Zhaoyang Xu
|
Shen Ma
|
Jiarui Lu
|
Xiang Kong
|
Aonan Zhang
|
Dian Ang Yap
|
Yizhe Zhang
|
Karsten Ahnert
|
Vik Kamath
|
Mathias Berglund
|
Dominic Walsh
|
Tobias Gindele
|
Juergen Wiest
|
Zhengfeng Lai
|
Xiaoming Simon Wang
|
Jiulong Shan
|
Meng Cao
|
Ruoming Pang
|
Zirui Wang
Findings of the Association for Computational Linguistics: NAACL 2025
Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluate models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, and covering five essential capabilities: Understanding, Reasoning, Planning, Problem-solving, and Self-correction. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 20 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance.
2024
Divide-or-Conquer? Which Part Should You Distill Your LLM?
Zhuofeng Wu
|
Richard He Bai
|
Aonan Zhang
|
Jiatao Gu
|
V.G.Vinod Vydiswaran
|
Navdeep Jaitly
|
Yizhe Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024
Recent methods have demonstrated that Large Language Models (LLMs) can solve reasoning tasks better when they are encouraged to solve subtasks of the main task first. In this paper we devise a similar strategy that breaks down reasoning tasks into a problem decomposition phase and a problem solving phase and show that the strategy is able to outperform a single stage solution. Further, we hypothesize that the decomposition should be easier to distill into a smaller model compared to the problem solving because the latter requires large amounts of domain knowledge while the former only requires learning general problem solving strategies. We propose methods to distill these two capabilities and evaluate their impact on reasoning outcomes and inference cost. We find that we can distill the problem decomposition phase and at the same time achieve good generalization across tasks, datasets, and models. However, it is harder to distill the problem solving capability without losing performance and the resulting distilled model struggles with generalization. These results indicate that by using smaller, distilled problem decomposition models in combination with problem solving LLMs we can achieve reasoning with cost-efficient inference and local adaptation.
Search
Fix data
Co-authors
- Yizhe Zhang 2
- Karsten Ahnert 1
- Richard He Bai 1
- Haoping Bai 1
- Mathias Berglund 1
- show all...
- Meng Cao 1
- Tobias Gindele 1
- Jiatao Gu 1
- Navdeep Jaitly 1
- Vik Kamath 1
- Xiang Kong 1
- Zhengfeng Lai 1
- Jiarui Lu 1
- Shuang Ma 1
- Shen Ma 1
- Feng Nan 1
- Ruoming Pang 1
- Jiulong Shan 1
- Yanchao Sun 1
- V.G.Vinod Vydiswaran 1
- Dominic Walsh 1
- Xiaoming Simon Wang 1
- Zirui Wang 1
- Juergen Wiest 1
- Zhuofeng Wu 1
- Zhaoyang Xu 1
- Dian Ang Yap 1
- Guoli Yin 1