KE Gao

Also published as: Ke Gao

2026

The problem of surface-level pattern mapping represents a critical yet underexplored failure mode in large language model (LLM) reasoning, and is particularly acute in cross-architecture code migration of high-performance libraries. On low-resource, low-level code, insufficient coverage in pretraining data often leads LLMs to rely on superficial name- or type-based correspondences, rather than principled refactorization and reasoning grounded in core functional semantics and architecture-specific optimization intents. This tendency severely hampers the effectiveness of LLMs in complex migration scenarios.To address these challenges, we propose FSCM, a multi-agent framework for cross-architecture migration. FSCM decouples complex implementation details through functional mining and code refactoring, guiding LLMs to focus on invariant semantic anchors across architectures. By mitigating surface-level pattern traps, FSCM improves both functional correctness and performance when targeting emerging architectures. Extensive experiments on the challenging real-world OpenCV library migration tasks demonstrate substantial improvements over state-of-the-art baselines, achieving up to 22% higher correctness rates over Copilot and 43.04x speedup on RISC-V platforms. Code and data are available at: https://anonymous.4open.science/r/code-F8D4.

2025

pdf bib abs

The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance.To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs’ understanding of attention operator.Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPUs, establishing a self-optimizing paradigm for generating high-performance attention operators in attention-centric algorithms.Verified on A100, RTX8000, and T4 GPUs, the performance of our methods significantly outshines that of vanilla LLMs, achieving a speed-up of up to 35.16×.Besides, our method not only surpasses human-optimized libraries (cuDNN and official library) in most scenarios but also extends support to unsupported hardware and data types, reducing development time from months to minutes compared with human experts.