Anisha Gunjal
2026
Agentic Rubrics as Contextual Verifiers for SWE Agents
Mohit Raghavendra | Anisha Gunjal | Bing Liu | Yunzhong He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mohit Raghavendra | Anisha Gunjal | Bing Liu | Yunzhong He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.
PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning
Afra Feyza Akyürek | Advait Gosai | Chen Bo Calvin Zhang | Vipul Gupta | Jaehwan Jeong | Anisha Gunjal | Tahseen Rabbani | Maria Mazzone | David Randolph IV | Mohammad Mahmoudi Meymand | Gurshaan Chattha | Paula Rodriguez | Diego A. Mares Buendia | Pavit Singh | Michael Liu | Subodh Chawla | Peter Cline | Lucy Ogaz | Ernesto Gabriel Hernández Montoya | Zihao Wang | Pavi Bhatter | Marcos Ayestaran | Bing Liu | Yunzhong He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Afra Feyza Akyürek | Advait Gosai | Chen Bo Calvin Zhang | Vipul Gupta | Jaehwan Jeong | Anisha Gunjal | Tahseen Rabbani | Maria Mazzone | David Randolph IV | Mohammad Mahmoudi Meymand | Gurshaan Chattha | Paula Rodriguez | Diego A. Mares Buendia | Pavit Singh | Michael Liu | Subodh Chawla | Peter Cline | Lucy Ogaz | Ernesto Gabriel Hernández Montoya | Zihao Wang | Pavi Bhatter | Marcos Ayestaran | Bing Liu | Yunzhong He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Frontier model progress is often measured using academic benchmarks that provide a limited view of performance on open-ended, economically consequential tasks in high-stakes professional domains where practical returns matter most. We introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed questions inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.
2024
Molecular Facts: Desiderata for Decontextualization in LLM Fact Verification
Anisha Gunjal | Greg Durrett
Findings of the Association for Computational Linguistics: EMNLP 2024
Anisha Gunjal | Greg Durrett
Findings of the Association for Computational Linguistics: EMNLP 2024
Automatic factuality verification of large language model (LLM) generations is becoming more and more widely used to combat hallucinations. A major point of tension in the literature is the granularity of this fact-checking: larger chunks of text are hard to fact-check, but more atomic facts like propositions may lack context to interpret correctly. In this work, we assess the role of context in these atomic facts. We argue that fully atomic facts are not the right representation, and define two criteria for molecular facts: decontextuality, or how well they can stand alone, and minimality, or how little extra information is added to achieve decontexuality. We quantify the impact of decontextualization on minimality, then present a baseline methodology for generating molecular facts automatically, aiming to add the right amount of information. We compare against various methods of decontextualization and find that molecular facts balance minimality with fact verification accuracy in ambiguous settings.
Search
Fix author
Co-authors
- Yunzhong He 2
- Bing Liu 2
- Afra Feyza Akyürek 1
- Marcos Ayestaran 1
- Pavi Bhatter 1
- Diego A. Mares Buendia 1
- Gurshaan Chattha 1
- Subodh Chawla 1
- Peter Cline 1
- Greg Durrett 1
- Advait Gosai 1
- Vipul Gupta 1
- David Randolph IV 1
- Jaehwan Jeong 1
- Michael Liu 1
- Maria Mazzone 1
- Mohammad Mahmoudi Meymand 1
- Ernesto Gabriel Hernández Montoya 1
- Lucy Ogaz 1
- Tahseen Rabbani 1
- Mohit Raghavendra 1
- Paula Rodriguez 1
- Pavit Singh 1
- Zihao Wang 1
- Chen Bo Calvin Zhang 1