Jiaming Wang

Other people with similar names: Jiaming Wang

Unverified author pages with similar names: Jiaming Wang

2026

The widespread availability of large-scale code datasets has accelerated the development of code large language models (CodeLLMs), raising concerns about unauthorized dataset usage. Dataset poisoning offers a proactive defense by reducing the utility of such unauthorized training. However, existing poisoning methods often require full-dataset poisoning and introduce transformations that break code compilability. In this paper, we introduce FunPoison, a functionality-preserving poisoning approach that injects short, compilable weak-use fragments into executed code paths. FunPoison leverages reusable statement-level templates with automatic repair and conservative safety checking to ensure side-effect freedom, while a type-aware synthesis module preserves type correctness, suppresses static-analysis warnings, and improves stealth. Extensive experiments across multiple CodeLLMs and code-generation benchmarks show that FunPoison achieves effective poisoning by contaminating only 10% of the dataset, while maintaining 100% compilability and functional correctness. FunPoison also remains robust against advanced code sanitization techniques, including detection, purification, rewriting, static-analysis, and formatting defenses.

Co-authors

Jun Sun 1

Venues

Findings1

Fix author