PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics

Atharva Naik, Prakam, Yash Mathur, Darsh Agrawal, Manav Nitin Kapadnis, Yuwei An, Clayton Marr, Carolyn Rose, David R. Mortensen


Abstract
While many benchmarks evaluate the reasoning abilities of Large Language Models (LLMs), few isolate reasoning as a capability independent of domain knowledge. We introduce a new benchmark for inductive reasoning inspired by Sound Law Induction (SLI) in historical linguistics and formulated in a simple multi-step Programming by Example (PBE) framework. The task requires inducing a cascade of string rewrite programs that transform inputs into target outputs. We present PBEBench, a fully automated evaluation approach that generates such problems with controllable difficulty and ordering constraints, enabling scalable and contamination-resistant evaluation of sequential inductive reasoning. Using this approach, we construct three datasets that show a large gap between models that leverage test-time compute or long chain-of-thought reasoning and those that do not. Although recent models such as GPT-5 and gpt-oss-120b show promise, solve rates remain below 5% on hard PBEBench instances with long program cascades, even under computationally expensive scaling strategies. Finally, we show that PBEBench scores are more predictive of performance on real SLI than are other inductive reasoning benchmarks. We will release code and data to support further research.
Anthology ID:
2026.findings-acl.432
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8877–8918
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.432/
DOI:
Bibkey:
Cite (ACL):
Atharva Naik, Prakam, Yash Mathur, Darsh Agrawal, Manav Nitin Kapadnis, Yuwei An, Clayton Marr, Carolyn Rose, and David R. Mortensen. 2026. PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics. In Findings of the Association for Computational Linguistics: ACL 2026, pages 8877–8918, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics (Naik et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.432.pdf
Checklist:
 2026.findings-acl.432.checklist.pdf