Language Models can Categorize System Inputs for Performance Analysis
Dominic Sobhani, Ruiqi Zhong, Edison Marrese-Taylor, Keisuke Sakaguchi, Yutaka Matsuo
Abstract
Language model systems are used to process diverse categories of input requests, ranging from improving creative writing to solving programming challenges. It would be useful to know which categories they are good at. However, existing evaluations compare model performance on pre-defined categories, failing to reflect a system’s performance on finer-grained or novel ones. We propose to automatically search for finer-grained categories based on inputs where a system performs well or poorly, and describe them in natural language. To search for these categories, we propose a large number of candidate category descriptions, e.g. “Communication Improvement”, find the subset of inputs that match the category descriptions, and calculate the performance on these categories; then we sort these categories based on their performance, thereby highlighting those that score high or low. As one application, we apply our method to compare LLaMA 3-70B and Claude 3 Opus, which have similar Elo-ratings on Chatbot Arena; our method finds the former is weaker at making text more professional and humorous while better at providing psychological insights, depicting a more nuanced picture of model performance.- Anthology ID:
- 2025.naacl-long.317
- Volume:
- Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
- Month:
- April
- Year:
- 2025
- Address:
- Albuquerque, New Mexico
- Editors:
- Luis Chiruzzo, Alan Ritter, Lu Wang
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6241–6257
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.317/
- DOI:
- Cite (ACL):
- Dominic Sobhani, Ruiqi Zhong, Edison Marrese-Taylor, Keisuke Sakaguchi, and Yutaka Matsuo. 2025. Language Models can Categorize System Inputs for Performance Analysis. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6241–6257, Albuquerque, New Mexico. Association for Computational Linguistics.
- Cite (Informal):
- Language Models can Categorize System Inputs for Performance Analysis (Sobhani et al., NAACL 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.317.pdf