Advait Gosai
2026
PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning
Afra Feyza Akyürek | Advait Gosai | Chen Bo Calvin Zhang | Vipul Gupta | Jaehwan Jeong | Anisha Gunjal | Tahseen Rabbani | Maria Mazzone | David Randolph IV | Mohammad Mahmoudi Meymand | Gurshaan Chattha | Paula Rodriguez | Diego A. Mares Buendia | Pavit Singh | Michael Liu | Subodh Chawla | Peter Cline | Lucy Ogaz | Ernesto Gabriel Hernández Montoya | Zihao Wang | Pavi Bhatter | Marcos Ayestaran | Bing Liu | Yunzhong He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Afra Feyza Akyürek | Advait Gosai | Chen Bo Calvin Zhang | Vipul Gupta | Jaehwan Jeong | Anisha Gunjal | Tahseen Rabbani | Maria Mazzone | David Randolph IV | Mohammad Mahmoudi Meymand | Gurshaan Chattha | Paula Rodriguez | Diego A. Mares Buendia | Pavit Singh | Michael Liu | Subodh Chawla | Peter Cline | Lucy Ogaz | Ernesto Gabriel Hernández Montoya | Zihao Wang | Pavi Bhatter | Marcos Ayestaran | Bing Liu | Yunzhong He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Frontier model progress is often measured using academic benchmarks that provide a limited view of performance on open-ended, economically consequential tasks in high-stakes professional domains where practical returns matter most. We introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed questions inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.
Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction
Advait Gosai | Tyler Vuong | Utkarsh Tyagi | Steven Li | Wenjia You | Miheer Bavare | Arda Uçar | Zhongwang Fang | Brian Jang | Bing Liu | Yunzhong He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Advait Gosai | Tyler Vuong | Utkarsh Tyagi | Steven Li | Wenjia You | Miheer Bavare | Arda Uçar | Zhongwang Fang | Brian Jang | Bing Liu | Yunzhong He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
End-to-end (E2E) spoken dialogue systems are replacing cascaded pipelines for voice-based human-AI interaction. Existing benchmarks primarily evaluate these systems on synthetic speech and single-turn tasks, leaving multi-turn conversational ability underexplored. We introduce Audio MultiChallenge an open-source benchmark to evaluate these systems under natural multi-turn interaction patterns. Building on the text-based MultiChallenge framework, which evaluates Inference Memory, Instruction Retention, and Self Coherence, we introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking. We augment each axis to the audio modality, such as introducing Audio-Cue challenges for Inference Memory that require recalling ambient sounds and paralinguistic signals beyond semantic content. We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid pipeline that exposes model failures at scale while preserving natural disfluencies found in unscripted human speech. Our evaluation reveals that even frontier models struggle on our benchmark, with our highest-performing model achieving a 54.65% pass rate. Error analysis shows that models are not sufficiently robust to human speech when tracking instructions, edits, and audio cues, highlighting the need for improved audio-native multi-turn interaction capabilities.
Search
Fix author
Co-authors
- Yunzhong He 2
- Bing Liu 2
- Afra Feyza Akyürek 1
- Marcos Ayestaran 1
- Miheer Bavare 1
- Pavi Bhatter 1
- Diego A. Mares Buendia 1
- Gurshaan Chattha 1
- Subodh Chawla 1
- Peter Cline 1
- Zhongwang Fang 1
- Anisha Gunjal 1
- Vipul Gupta 1
- David Randolph IV 1
- Brian Jang 1
- Jaehwan Jeong 1
- Steven Li 1
- Michael Liu 1
- Maria Mazzone 1
- Mohammad Mahmoudi Meymand 1
- Ernesto Gabriel Hernández Montoya 1
- Lucy Ogaz 1
- Tahseen Rabbani 1
- Paula Rodriguez 1
- Pavit Singh 1
- Utkarsh Tyagi 1
- Arda Uçar 1
- Tyler Vuong 1
- Zihao Wang 1
- Wenjia You 1
- Chen Bo Calvin Zhang 1
Venues
- ACL2