Zhengfeng Lai
2026
Putting Captions to the Test: Evaluating Video Caption Quality through Multiple-Choice Question Answering
Zizhen Wang | Bo Feng | Zhengfeng Lai | Shiyu Li | Yang Lu | Meng Cao | Ping Huang | Xiaoming Simon Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zizhen Wang | Bo Feng | Zhengfeng Lai | Shiyu Li | Yang Lu | Meng Cao | Ping Huang | Xiaoming Simon Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluating video captioning remains a critical challenge for Visual Large Language Models (VLLMs). Existing metrics primarily rely on matching generated text against ground-truth references. This paradigm suffers from the “one-to-many” nature of video description, where high-quality captions are often penalized for lexical mismatches or valid shifts in visual focus. Furthermore, such assessments are typically one-dimensional, failing to provide a fine-grained analysis of caption quality. To address this, we redefine caption quality through the lens of information fidelity: A caption must maximize the coverage of salient visual information while ensuring strict factuality. We introduce CapQuiz, a novel reference-free benchmark that assesses captions based on their utility in answering human-verified, fine-grained, multiple-choice questions derived from the video. CapQuiz features a hierarchical taxonomy of 10 question types (spanning Descriptive and Inferential categories) across 24 diverse video domains. Extensive experiments demonstrate that CapQuiz correlates significantly better with human judgments than existing metrics and offers interpretable insights into model performance. We will release the benchmark to facilitate reproducible research.
2025
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Guoli Yin | Haoping Bai | Shuang Ma | Feng Nan | Yanchao Sun | Zhaoyang Xu | Shen Ma | Jiarui Lu | Xiang Kong | Aonan Zhang | Dian Ang Yap | Yizhe Zhang | Karsten Ahnert | Vik Kamath | Mathias Berglund | Dominic Walsh | Tobias Gindele | Juergen Wiest | Zhengfeng Lai | Xiaoming Simon Wang | Jiulong Shan | Meng Cao | Ruoming Pang | Zirui Wang
Findings of the Association for Computational Linguistics: NAACL 2025
Guoli Yin | Haoping Bai | Shuang Ma | Feng Nan | Yanchao Sun | Zhaoyang Xu | Shen Ma | Jiarui Lu | Xiang Kong | Aonan Zhang | Dian Ang Yap | Yizhe Zhang | Karsten Ahnert | Vik Kamath | Mathias Berglund | Dominic Walsh | Tobias Gindele | Juergen Wiest | Zhengfeng Lai | Xiaoming Simon Wang | Jiulong Shan | Meng Cao | Ruoming Pang | Zirui Wang
Findings of the Association for Computational Linguistics: NAACL 2025
Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluate models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, and covering five essential capabilities: Understanding, Reasoning, Planning, Problem-solving, and Self-correction. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 20 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance.