From Conversation to Evaluation: Benchmarking LLMs on Development Knowledge via SimpleDevQA

Jing Zhang; Lianghong Guo; Yanlin Wang; Terry Yue Zhuo; Yong Wang; Mingwei Liu; Jiachi Chen; Ensheng Shi; Yuchi Ma; Hongyu Zhang; Zibin Zheng

From Conversation to Evaluation: Benchmarking LLMs on Development Knowledge via SimpleDevQA

Jing Zhang, Lianghong Guo, Yanlin Wang, Terry Yue Zhuo, Yong Wang, Mingwei Liu, Jiachi Chen, Ensheng Shi, Yuchi Ma, Hongyu Zhang, Zibin Zheng

Abstract

The Development Knowledge Question Answering (Dev Knowledge QA) task aims to provide accurate natural language answers to knowledge-seeking questions during software development. To investigate the importance of Dev Knowledge QA in AI-assisted software development and the extent to which it has been explored, we conduct a preliminary analysis of real user–LLM dialogues from WildChat. Our findings indicate that Dev Knowledge QA plays a significant role in real-world software development scenarios, and these raw dialogues cannot be directly used to construct a Dev Knowledge QA benchmark. Existing Dev Knowledge QA benchmarks are limited in development knowledge scope and often not built from real user queries. To bridge this gap, we design a three-phase pipeline that transforms real-world dialogue into simple development knowledge-seeking QA pairs. Through this pipeline, we introduce SimpleDevQA, a multilingual Dev Knowledge QA benchmark inspired by real user dialogues. This dataset covers three languages (English, Chinese, and Russian), and focuses on questions with unique, short, and verifiable answers, making evaluation more accurate and simple. Extensive experiments with 18 mainstream LLMs show that closed-source models generally perform best on SimpleDevQA. We also find that RAG-based knowledge injection improves accuracy, and that Dev Knowledge QA performance correlates with both model confidence and code-generation capability. To facilitate the replication study, we have released our data and code at: https://github.com/DeepSoftwareAnalytics/SimpleDevQA.

Anthology ID:: 2026.findings-acl.1877
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 37649–37663
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1877/
DOI:
Bibkey:
Cite (ACL):: Jing Zhang, Lianghong Guo, Yanlin Wang, Terry Yue Zhuo, Yong Wang, Mingwei Liu, Jiachi Chen, Ensheng Shi, Yuchi Ma, Hongyu Zhang, and Zibin Zheng. 2026. From Conversation to Evaluation: Benchmarking LLMs on Development Knowledge via SimpleDevQA. In Findings of the Association for Computational Linguistics: ACL 2026, pages 37649–37663, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: From Conversation to Evaluation: Benchmarking LLMs on Development Knowledge via SimpleDevQA (Zhang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1877.pdf
Checklist:: 2026.findings-acl.1877.checklist.pdf

PDF Cite Search Checklist Fix data