Tianran Sun
2025
Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarcity
Dylan Zhang
|
Justin Wang
|
Tianran Sun
Findings of the Association for Computational Linguistics: ACL 2025
Existing LMs struggle with proof-oriented programming due to data scarcity, which manifest in two key ways: (1) a lack of sufficient corpora for proof-oriented programming languages such as F*, and (2) the absence of large-scale, project-level proof-oriented implementations that can teach the model the intricate reasoning process when performing proof-oriented programming. We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language; incorporating diverse coding data for reasoning capability elicitation and creating new proofs and repair data within existing repositories. This approach enables language models to both synthesize and repair proofs for function- and repository-level code. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming by 64% relative margin, and can improve GPT-4o’s performance by 54% by repairing its outputs over GPT-4o’s self-repair.
LastingBench: Defend Benchmarks Against Knowledge Leakage
Yixiong Fang
|
Tianran Sun
|
Yuling Shi
|
Min Wang
|
Xiaodong Gu
Findings of the Association for Computational Linguistics: EMNLP 2025
The increasing size and complexity of large language models (LLMs) raise concerns about their ability to “cheat” on standard Question Answering (QA) benchmarks by memorizing task-specific data. This undermines the validity of benchmark evaluations, as they no longer reflect genuine model capabilities but instead the effects of data leakage. While existing methods detect such leakage, they fail to address the long-term challenge of mitigating it. In this paper, we introduce LastingBench, a novel approach to reinforce and safeguard existing benchmarks against knowledge leakage. Our method involves identifying leakage points through perturbation-based detection, followed by counterfactual rewriting to disrupt memorization while preserving the benchmark’s original evaluative intent. We demonstrate that our approach significantly reduces memorization effects in long-context QA benchmarks, providing a more accurate assessment of model reasoning and generalization abilities. Our experiments show that LastingBench not only uncovers substantial leakage in benchmarks like HotpotQA but also yields a more reliable evaluation of state-of-the-art models, ensuring that benchmarks remain effective and resilient over time.
Search
Fix author
Co-authors
- Yixiong Fang 1
- Xiaodong Gu 1
- Yuling Shi 1
- Justin Wang 1
- Min Wang 1
- show all...