PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics
Atharva Naik, Prakam, Yash Mathur, Darsh Agrawal, Manav Nitin Kapadnis, Yuwei An, Clayton Marr, Carolyn Rose, David R. Mortensen
Abstract
While many benchmarks evaluate the reasoning abilities of Large Language Models (LLMs), few isolate reasoning as a capability independent of domain knowledge. We introduce a new benchmark for inductive reasoning inspired by Sound Law Induction (SLI) in historical linguistics and formulated in a simple multi-step Programming by Example (PBE) framework. The task requires inducing a cascade of string rewrite programs that transform inputs into target outputs. We present PBEBench, a fully automated evaluation approach that generates such problems with controllable difficulty and ordering constraints, enabling scalable and contamination-resistant evaluation of sequential inductive reasoning. Using this approach, we construct three datasets that show a large gap between models that leverage test-time compute or long chain-of-thought reasoning and those that do not. Although recent models such as GPT-5 and gpt-oss-120b show promise, solve rates remain below 5% on hard PBEBench instances with long program cascades, even under computationally expensive scaling strategies. Finally, we show that PBEBench scores are more predictive of performance on real SLI than are other inductive reasoning benchmarks. We will release code and data to support further research.- Anthology ID:
- 2026.findings-acl.432
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8877–8918
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.432/
- DOI:
- Cite (ACL):
- Atharva Naik, Prakam, Yash Mathur, Darsh Agrawal, Manav Nitin Kapadnis, Yuwei An, Clayton Marr, Carolyn Rose, and David R. Mortensen. 2026. PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics. In Findings of the Association for Computational Linguistics: ACL 2026, pages 8877–8918, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics (Naik et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.432.pdf