Jia Li
Other people with similar names: Jia Li, Jia Li, Jia Li, Jia Li
Unverified author pages with similar names: Jia Li
2026
EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning for LLMs
Huanyu Liu | Jia Li | Yihong Dong | Chang Yu | Taozhi Chen | Lecheng Wang | Yongding Tao | Bin Gu | Ge Li
Findings of the Association for Computational Linguistics: ACL 2026
Huanyu Liu | Jia Li | Yihong Dong | Chang Yu | Taozhi Chen | Lecheng Wang | Yongding Tao | Bin Gu | Ge Li
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes sparse, limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on teacher models for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration.We propose EvoCoT, a self-evolving curriculum learning framework based on two-stage chain-of-thought (CoT) reasoning optimization. EvoCoT constrains the exploration space by self-generating and verifying CoT trajectories, then gradually shortens CoT steps to expand the space in a controlled way. The framework enables LLMs to stably learn from initially unsolved hard problems under sparse rewards. We apply EvoCoT to multiple LLM families, including Qwen, DeepSeek, and Llama. Experiments show that EvoCoT enables LLMs to solve previously unsolved problems, improves reasoning capability without external CoT supervision, and is compatible with various RL fine-tuning methods. We release the source code to support future research.
VulAgent: Hypothesis-Validation Driven Multi-Agent Architecture for Vulnerability Detection
Ziliang Wang | Ge Li | Jia Li | Hao Zhu | Zhi Jin
Findings of the Association for Computational Linguistics: ACL 2026
Ziliang Wang | Ge Li | Jia Li | Hao Zhu | Zhi Jin
Findings of the Association for Computational Linguistics: ACL 2026
Vulnerability detection with language models is challenging: it requires (i) precisely localizing security-sensitive code and (ii) reasoning about potential vulnerability conditions under complex, partially observed program context. We present VulAgent, a multi-agent vulnerability detection framework based on hypothesis validation. Our design is inspired by how human auditors review code: when noticing a sensitive operation, they form a hypothesis about a possible vulnerability, consider potential trigger paths, and then verify the hypothesis against the project context. Given a code unit, VulAgent first applies multi-view analyzers to identify and localize security-sensitive operations from complementary perspectives. For each sensitive operation, it then constructs an explicit vulnerability hypothesis—including triggering (or exploitation) preconditions and a candidate trigger path—and validates the hypothesis using project context together with the model’s general knowledge of commonly used APIs and security patterns. This validation-oriented design reduces speculative reports and substantially lowers false positives. Across PrimeVul and SVEN, VulAgent improves accuracy by 6.6 percentage points on average, increases vulnerable–fixed pair identification by up to 4.5x (2.46x on average), and reduces false positive rate by 36% relative to recent LLM-based baselines.
2025
Benchmarking Long-Context Language Models on Long Code Understanding
Jia Li | Xuyuan Guo | Lei Li | Kechi Zhang | Ge Li | Jia Li | Zhengwei Tao | Fang Liu | Chongyang Tao | Yuqi Zhu | Zhi Jin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jia Li | Xuyuan Guo | Lei Li | Kechi Zhang | Ge Li | Jia Li | Zhengwei Tao | Fang Liu | Chongyang Tao | Yuqi Zhu | Zhi Jin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Current advanced long-context language models offer great potential for real-world software engineering applications. However, progress in this critical domain remains hampered by a fundamental limitation: the absence of a rigorous evaluation framework for long code understanding. To gap this obstacle, we propose a long code understanding benchmark LongCodeU from four aspects (8 tasks) to evaluate LCLMs’ long code understanding ability required for practical applications, including code unit perception, intra-code unit understanding, inter-code unit relation understanding, and long code documentation understanding. We evaluate 9 popular LCLMs on LongCodeU (i.e., 6 general models and 3 code models). Our experimental results reveal key limitations in current LCLMs’ capabilities for long code understanding. Particularly, the performance of LCLMs drops dramatically when the long code length is greater than 32K, falling far short of their claimed 128K to 1M context windows. In the four aspects, inter-code unit relation understanding is the most challenging for LCLMs. Our study provides valuable insights for optimizing LCLMs and driving advancements in software engineering.
Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points
Kechi Zhang | Ge Li | Jia Li | Yihong Dong | Jia Li | Zhi Jin
Findings of the Association for Computational Linguistics: ACL 2025
Kechi Zhang | Ge Li | Jia Li | Yihong Dong | Jia Li | Zhi Jin
Findings of the Association for Computational Linguistics: ACL 2025
Code generation models have shown significant potential for automating programming tasks. However, the challenge of generating accurate and reliable code persists due to the highly complex and long-reasoning nature of the task. Even state-of-the-art models often fail in code generation due to small errors, which can drastically affect the overall functionality of code. Our study identifies that current models tend to produce errors concentrated at specific error-prone points, which significantly impacts the accuracy of the generated code. To address this issue, we introduce Focused-DPO, a framework that enhances code generation by directing preference optimization towards these critical error-prone areas. This approach builds on Direct Preference Optimization, emphasizing accuracy in parts prone to errors. Additionally, we develop a method called Error-Point Identification, which constructs a dataset that targets these problematic points without requiring costly human annotations. Our experiments on benchmarks such as HumanEval(+), MBPP(+), and LiveCodeBench demonstrate that Focused-DPO significantly improves the precision and reliability of code generation, reducing common errors and enhancing overall code quality. By focusing on error-prone points, Focused-DPO advances the accuracy and functionality of model-generated code.