AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs
Xuanwen Ding, Chengjun Pan, Zejun Li, Jiwen Zhang, Siyuan Wang, Zhongyu Wei
Abstract
Evaluating multimodal large language models (MLLMs) is becoming increasingly expensive as benchmarks grow in scale and cross-modality complexity. Inspired by structuralism in cognitive psychology, we tackle this difficulty with an adaptive evaluation framework for efficient benchmarking, namely **AutoJudger**. Instead of passively scoring on a fixed test set, AutoJudger treats evaluation as an interview-like process by keeping a hypothesized ability structure of the evaluated model and actively selecting the informative questions so as to refine these ability boundaries. Specifically, AutoJudger has three core components: **ability decomposition** to organize evaluation along meaningful capability dimensions, **ability estimation** to maintain an up-to-date quantitative profile of the model competence, and **adaptive question selection** to choose the most informative questions. To operationalize this paradigm, we introduce **A2-Judger**, a novel MLLM-based **A**gentic instantiation of **A**uto**Judger** equipped with semantic-aware retrieval and dynamic memory. Experiments on four representative multimodal benchmarks show that A2-Judger significantly improves sample efficiency while maintaining reliable evaluation results.- Anthology ID:
- 2026.acl-long.685
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 15009–15034
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.685/
- DOI:
- Cite (ACL):
- Xuanwen Ding, Chengjun Pan, Zejun Li, Jiwen Zhang, Siyuan Wang, and Zhongyu Wei. 2026. AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15009–15034, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs (Ding et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.685.pdf