Yuchi Ma


2026

Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.
The Development Knowledge Question Answering (Dev Knowledge QA) task aims to provide accurate natural language answers to knowledge-seeking questions during software development. To investigate the importance of Dev Knowledge QA in AI-assisted software development and the extent to which it has been explored, we conduct a preliminary analysis of real user–LLM dialogues from WildChat. Our findings indicate that Dev Knowledge QA plays a significant role in real-world software development scenarios, and these raw dialogues cannot be directly used to construct a Dev Knowledge QA benchmark. Existing Dev Knowledge QA benchmarks are limited in development knowledge scope and often not built from real user queries. To bridge this gap, we design a three-phase pipeline that transforms real-world dialogue into simple development knowledge-seeking QA pairs. Through this pipeline, we introduce SimpleDevQA, a multilingual Dev Knowledge QA benchmark inspired by real user dialogues. This dataset covers three languages (English, Chinese, and Russian), and focuses on questions with unique, short, and verifiable answers, making evaluation more accurate and simple. Extensive experiments with 18 mainstream LLMs show that closed-source models generally perform best on SimpleDevQA. We also find that RAG-based knowledge injection improves accuracy, and that Dev Knowledge QA performance correlates with both model confidence and code-generation capability. To facilitate the replication study, we have released our data and code at: https://github.com/DeepSoftwareAnalytics/SimpleDevQA.

2025

Large language models (LLMs) have made significant strides in code acceleration (CA) tasks. Current works typically fine-tune LLMs using slow-fast code pairs mined from online programming platforms. Although these methods are widely recognized for their effectiveness, the training data often lack clear code acceleration patterns and offer only limited speed improvements. Moreover, existing training methods, such as direct instruction fine-tuning (IFT), tend to overlook the hierarchical relationships among acceleration patterns. In this work, we introduce BITE, a novel training paradigm designed to improve LLMs’ CA capabilities through two key innovations: (1) Bidirectional tree editing, which generates high-quality training data by incrementally transforming given code into both its most efficient and least efficient variants, and (2) Progressive code acceleration learning, which enables LLMs to internalize multi-level CA strategies by learning increasingly sophisticated acceleration patterns. Additionally, we introduce a new CA evaluation benchmark and metric for comprehensive assessment of model performance on CA tasks. Extensive experiments on both our benchmark and existing benchmarks demonstrate the effectiveness of our approach. Notably, BITE enables Qwen-1.5B to outperform prompt-enhanced GPT-4 and current training-based methods on average across five programming languages.