Shing-Chi Cheung
2026
Across Programming Language Silos: A Study on Cross-Lingual Retrieval-Augmented Code Generation
Qiming Zhu | Jialun Cao | Xuanang Chen | Weili Zhang | Yaojie Lu | Hongyu Lin | Xianpei Han | Le Sun | Shing-Chi Cheung
Findings of the Association for Computational Linguistics: ACL 2026
Qiming Zhu | Jialun Cao | Xuanang Chen | Weili Zhang | Yaojie Lu | Hongyu Lin | Xianpei Han | Le Sun | Shing-Chi Cheung
Findings of the Association for Computational Linguistics: ACL 2026
Current research on large language models (LLMs) with retrieval-augmented code generation (RACG) has largely focused on single-language settings, leaving their cross-lingual effectiveness underexplored. Multilingual RACG systems are increasingly important for migrating and reusing code across programming languages (PLs), a common yet challenging task in modern software development. To systematically study cross-lingual code knowledge transfer in RACG, we construct a dataset covering 13 PLs with nearly 14K instances. Our experiments reveal three key insights: (1) Knowledge transfer in RACG across PLs is non-trivial even using direct injection. (2) RACG exhibits unequal cross-lingual knowledge transfer, and its efficacy depends on linguistic affinity of PL pair and diversity of LLM pretraining corpus. (3) RACG shows limited reliance on natural language information embedded in code when equipped with a code-specific retriever. These findings provide practical guidance for designing effective multilingual RACG systems. https://github.com/icip-cas/Cross-Lingual-RACG
2025
CRUXEVAL-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution
Ruiyang Xu | Jialun Cao | Yaojie Lu | Ming Wen | Hongyu Lin | Xianpei Han | Ben He | Shing-Chi Cheung | Le Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ruiyang Xu | Jialun Cao | Yaojie Lu | Ming Wen | Hongyu Lin | Xianpei Han | Ben He | Shing-Chi Cheung | Le Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code benchmarks such as HumanEval are widely adopted to evaluate Large Language Models’ (LLMs) coding capabilities. However, there is an unignorable programming language bias in existing code benchmarks – over 95% code generation benchmarks are dominated by Python, leaving the LLMs’ capabilities in other programming languages such as Java and C/C++ unknown. Moreover, coding task bias is also crucial. Most benchmarks focus on code generation capability, while benchmarks for code reasoning (given input, reasoning output; and given output, reasoning input), an essential coding capability, are insufficient. Yet, constructing multi-lingual benchmarks can be expensive and labor-intensive, and codes in contest websites such as Leetcode suffer from data contamination during training. To fill this gap, we propose CRUXEVAL-X, a multi-lingual code reasoning benchmark that contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. In particular, the construction pipeline of CRUXEVAL-X works in a fully automated and test-guided manner, which iteratively generates and repairs based on execution feedback. Also, to cross language barriers (e.g., dynamic/static type systems in Python/C++), we formulated various transition rules between language pairs to facilitate translation. Our intensive evaluation of 24 representative LLMs reveals the correlation between language pairs. For example, TypeScript and JavaScript show a significant positive correlation, while Racket has less correlation with other languages. More interestingly, even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages, revealing the cross-language generalization of LLMs.
From Informal to Formal – Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs
Jialun Cao | Yaojie Lu | Meiziniu Li | Haoyang Ma | Haokun Li | Mengda He | Cheng Wen | Le Sun | Hongyu Zhang | Shengchao Qin | Shing-Chi Cheung | Cong Tian
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jialun Cao | Yaojie Lu | Meiziniu Li | Haoyang Ma | Haokun Li | Mengda He | Cheng Wen | Le Sun | Hongyu Zhang | Shengchao Qin | Shing-Chi Cheung | Cong Tian
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The research in AI-based formal mathematical reasoning has shown an unstoppable growth trend. These studies have excelled in mathematical competitions like IMO and have made significant progress. However, these studies intertwined multiple skills simultaneously—problem-solving, reasoning, and writing formal specifications—making it hard to precisely identify the LLMs’ strengths and weaknesses in each task. This paper focuses on formal verification, an immediate application scenario of formal reasoning, and breaks it down into sub-tasks. We constructed 18k high-quality instruction-response pairs across five mainstream formal specification languages (Coq, Lean4, Dafny, ACSL, and TLA+) in six tasks by distilling gpt-4o and evaluated against ten open-sourced LLMs, including recent popular DeepSeek-R1. We found that LLMs are good at writing proof segments when given either the code, or the detailed description of proof steps. Also, the fine-tuning brought about a nearly threefold improvement at most. And interestingly, we observed that fine-tuning with formal data also enhances abilities in mathematics, reasoning, and coding. We hope our findings inspire further research.