Darsh Agrawal
2026
PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics
Atharva Naik | Prakam | Yash Mathur | Darsh Agrawal | Manav Nitin Kapadnis | Yuwei An | Clayton Marr | Carolyn Rose | David R. Mortensen
Findings of the Association for Computational Linguistics: ACL 2026
Atharva Naik | Prakam | Yash Mathur | Darsh Agrawal | Manav Nitin Kapadnis | Yuwei An | Clayton Marr | Carolyn Rose | David R. Mortensen
Findings of the Association for Computational Linguistics: ACL 2026
While many benchmarks evaluate the reasoning abilities of Large Language Models (LLMs), few isolate reasoning as a capability independent of domain knowledge. We introduce a new benchmark for inductive reasoning inspired by Sound Law Induction (SLI) in historical linguistics and formulated in a simple multi-step Programming by Example (PBE) framework. The task requires inducing a cascade of string rewrite programs that transform inputs into target outputs. We present PBEBench, a fully automated evaluation approach that generates such problems with controllable difficulty and ordering constraints, enabling scalable and contamination-resistant evaluation of sequential inductive reasoning. Using this approach, we construct three datasets that show a large gap between models that leverage test-time compute or long chain-of-thought reasoning and those that do not. Although recent models such as GPT-5 and gpt-oss-120b show promise, solve rates remain below 5% on hard PBEBench instances with long program cascades, even under computationally expensive scaling strategies. Finally, we show that PBEBench scores are more predictive of performance on real SLI than are other inductive reasoning benchmarks. We will release code and data to support further research.
PRiSM: Benchmarking Phone Realization in Speech Models
Shikhar Bharadwaj | Chin-Jou Li | Yoonjae Kim | Kwanghee Choi | Eunjung Yeo | Ryan Soh-Eun Shim | Hanyu Zhou | Brendon Boldt | Karen Rosero | Kalvin Chang | Darsh Agrawal | Keer Xu | Chao-Han Huck Yang | Jian Zhu | Shinji Watanabe | David R. Mortensen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shikhar Bharadwaj | Chin-Jou Li | Yoonjae Kim | Kwanghee Choi | Eunjung Yeo | Ryan Soh-Eun Shim | Hanyu Zhou | Brendon Boldt | Karen Rosero | Kalvin Chang | Darsh Agrawal | Keer Xu | Chao-Han Huck Yang | Jian Zhu | Shinji Watanabe | David R. Mortensen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR systems still outperform LALMs. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability.
2025
Programming by Example meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction
Atharva Naik | Darsh Agrawal | Hong Sng | Clayton Marr | Kexun Zhang | Nathaniel Romney Robinson | Kalvin Chang | Rebecca Byrnes | Aravind Mysore | Carolyn Rose | David R. Mortensen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Atharva Naik | Darsh Agrawal | Hong Sng | Clayton Marr | Kexun Zhang | Nathaniel Romney Robinson | Kalvin Chang | Rebecca Byrnes | Aravind Mysore | Carolyn Rose | David R. Mortensen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Historical linguists have long written “programs” that convert reconstructed words in an ancestor language into their attested descendants via ordered string rewrite functions (called sound laws) However, writing these programs is time-consuming, motivating the development of automated Sound Law Induction (SLI) which we formulate as Programming by Examples (PBE) with Large Language Models (LLMs) in this paper. While LLMs have been effective for code generation, recent work has shown that PBE is challenging but improvable by fine-tuning, especially with training data drawn from the same distribution as evaluation data. In this paper, we create a conceptual framework of what constitutes a “similar distribution” for SLI and propose four kinds of synthetic data generation methods with varying amounts of inductive bias to investigate what leads to the best performance. Based on the results, we create a SOTA open-source model for SLI as PBE (+6% pass rate with a third of the parameters of the second-best LLM) and also highlight exciting future directions for PBE research.
Search
Fix author
Co-authors
- David R. Mortensen 3
- Kalvin Chang 2
- Clayton Marr 2
- Atharva Naik 2
- Carolyn Rose 2
- Yuwei An 1
- Shikhar Bharadwaj 1
- Brendon Boldt 1
- Rebecca Byrnes 1
- Kwanghee Choi 1
- Manav Nitin Kapadnis 1
- Yoonjae Kim 1
- Chin-Jou Li 1
- Yash Mathur 1
- Aravind Mysore 1
- Prakam 1
- Nathaniel Romney Robinson 1
- Karen Rosero 1
- Ryan Soh-Eun Shim 1
- Hong Sng 1
- Shinji Watanabe 1
- Keer Xu 1
- Chao-Han Huck Yang 1
- Eunjung Yeo 1
- Kexun Zhang 1
- Hanyu Zhou 1
- Jian Zhu 1