Chenbin Chenbin
2025
OASIS: Order-Augmented Strategy for Improved Code Search
Gao Zuchen
|
Zizheng Zhan
|
Xianming Li
|
Erxin Yu
|
Haotian Zhang
|
Chenbin Chenbin
|
Yuqun Zhang
|
Jing Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code embeddings capture the semantic representations of code and are crucial for various code-related large language model (LLM) applications, such as code search. Previous training primarily relies on optimizing the InfoNCE loss by comparing positive natural language (NL)-code pairs with in-batch negatives. However, due to the sparse nature of code contexts, training solely by comparing the major differences between positive and negative pairs may fail to capture deeper semantic nuances. To address this issue, we propose a novel order-augmented strategy for improved code search (OASIS). It leverages order-based similarity labels to train models to capture subtle differences in similarity among negative pairs. Extensive benchmark evaluations demonstrate that our OASIS model significantly outperforms previous state-of-the-art models focusing solely on major positive-negative differences. It underscores the value of exploiting subtle differences among negative pairs with order labels for effective code embedding training.
Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?
Qingyuan Liang
|
Zhao Zhang
|
Zeyu Sun
|
Zheng Lin
|
Qi Luo
|
Xiao Yueyi
|
Yizhou Chen
|
Yuqun Zhang
|
Haotian Zhang
|
Lu Zhang
|
Chenbin Chenbin
|
Yingfei Xiong
Findings of the Association for Computational Linguistics: ACL 2025
Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval (+) and MBPP (+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs’ ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.
Search
Fix author
Co-authors
- Haotian Zhang 2
- Yuqun Zhang 2
- Yizhou Chen 1
- Xianming Li 1
- Jing Li (李婧) 1
- show all...