Prithvi Balehannina

2025

Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks—an order of magnitude more than prior work—where frontier models exhibit systematic failures, in domains ranging from forecasting to online harassment. For example, we find that Sonnet 3.5 over-associates quantum computing and AGI and that o3-mini is prone to hallucination when fabrications are repeated in-context.

Co-authors

Davis Brown 1
Hamed Hassani 1
Shreya Havaldar 1
Helen Jin 1
Eric Wong 1

Venues

emnlp1

Fix author