Peter Grabowski

2025

pdf bib abs
Chatbot Arena Estimate: towards a generalized performance benchmark for LLM capabilities
Lucas Spangher | Tianle Li | William F. Arnold | Nick Masiewicki | Xerxes Dotiwalla | Rama Kumar Pasumarthi | Peter Grabowski | Eugene Ie | Daniel Gruhl
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

In industrial LLM development, evaluating large language models (LLMs) is critical for tasks like benchmarking internal models and detecting regressions during fine-tuning, but existing benchmark aggregation methods, such as Elo-based systems, can be resource-intensive, public facing, and time-consuming. Here, we describe Chatbot Arena Estimate (CAE), a practical framework for aggregating performance across diverse benchmarks. The framework, developed and widely adopted within our organization, addresses the need for quick, accurate, and cost-efficient evaluations of LLMs. CAE generates two primary metrics: a “Goodness” score (answer accuracy) and a “Fastness” score (cost or queries per second, QPS). These metrics allow for model ranking both overall and within specific subdomains, enabling informed decisions during model iteration and deployment. We demonstrate CAE’s effectiveness by comparing it with existing benchmarks, including the full Chatbot Arena and the MMLU leaderboard. Notably, our approach achieves higher Pearson correlation with Chatbot Arena Elo scores than MMLU’s correlation with Chatbot Arena Elo scores, validating its reliability for real-world LLM evaluation.

2024

Multi-agent debate has proven effective in improving large language models quality for reasoning and factuality tasks. While various role-playing strategies in multi-agent debates have been explored, in terms of the communication among agents, existing approaches adopt a brute force algorithm – each agent can communicate with all other agents. In this paper, we systematically investigate the effect of communication connectivity in multi-agent systems. Our experiments on GPT and Mistral models reveal that multi-agent debates leveraging sparse communication topology can achieve comparable or superior performance while significantly reducing computational costs. Furthermore, we extend the multi-agent debate framework to multi-modal reasoning and alignment labeling tasks, showcasing its broad applicability and effectiveness. Our findings underscore the importance of communication connectivity on enhancing the efficiency and effectiveness of the “society of minds” approach.

Co-authors

Le Hou 1

Rama Kumar Pasumarthi 1

Lucas Spangher 1

Jiageng Zhang 1

Venues

findings1
naacl1

Fix data