2025
pdf
bib
abs
ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation
Minghua He
|
Yue Chen
|
Fangkai Yang
|
Pu Zhao
|
Wenjie Yin
|
Yu Kang
|
Qingwei Lin
|
Saravan Rajmohan
|
Dongmei Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Code translation is a crucial activity in the software development and maintenance process, and researchers have recently begun to focus on using pre-trained large language models (LLMs) for code translation. However, existing LLMs only learn the contextual semantics of code during pre-training, neglecting executability information closely related to the execution state of the code, which results in unguaranteed code executability and unreliable automated code translation. To address this issue, we propose ExeCoder, an LLM specifically designed for code translation, aimed at utilizing executability representations such as functional semantics, syntax structures, and variable dependencies to enhance the capabilities of LLMs in code translation. To evaluate the effectiveness of ExeCoder, we manually enhanced the widely used benchmark TransCoder-test, resulting in a benchmark called TransCoder-test-X that serves LLMs. Evaluation of TransCoder-test-X indicates that ExeCoder achieves state-of-the-art performance in code translation, surpassing existing open-source code LLMs by over 10.88% to 38.78% and over 27.44% to 42.97% on two metrics, and even outperforms the renowned closed-source LLM GPT-4o. Code is available at https://aka.ms/execoder
pdf
bib
abs
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale
Linghao Zhang
|
Junhao Wang
|
Shilin He
|
Chaoyun Zhang
|
Yu Kang
|
Bowen Li
|
Jiaheng Wen
|
Chengxing Xie
|
Maoquan Wang
|
Yufan Huang
|
Elsie Nallipogu
|
Qingwei Lin
|
Yingnong Dang
|
Saravan Rajmohan
|
Dongmei Zhang
|
Qi Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs’ capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 48% execution pass rate on Python, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.
pdf
bib
abs
Curr-ReFT: Overcoming Training Bottlenecks in Small-scale Vision-Language Models via Curriculum Reinforcement Finetuning
Huilin Deng
|
Ding Zou
|
Xinghao Zhao
|
Rui Ma
|
Yanming Guo
|
Yang Cao
|
Yu Kang
Findings of the Association for Computational Linguistics: EMNLP 2025
State-of-the-art vision-language models (VLMs) require massive scaling that limits practical deployment. Small-scale VLMs offer a practical alternative but face out-of-domain (OOD) collapse when trained with traditional supervised fine-tuning (SFT). Through GeneralPoints experiments, we identify that OOD collapse is due to SFT’s tendency to induce visual hallucinations under distribution shifts, whereas Reinforcement Learning’s (RL) bidirectional reward-driven mechanism with iterative error correction refines visual perception. Although RL-based post-training effectively mitigates OOD degradation, it faces a critical sparse reward dilemma in complex visual reasoning tasks. To this end, we propose Curriculum Reinforcement Finetuning (Curr-ReFT), comprising two sequential stages: (1) Structured Curriculum Reinforcement Learning, which progressively evolves task formats and reward functions to match models’ growing capabilities; and (2) Rejected Sampling-based Self-improvement, which maintains the fundamental capabilities of VLMs through selective learning from high-quality examples. Extensive experiments demonstrate that Curr-ReFT achieves state-of-the-art performance across various visual tasks in both in- and out-of-domain settings and benchmarks.
pdf
bib
abs
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation
Xing Zhang
|
Jiaheng Wen
|
Fangkai Yang
|
Yu Kang
|
Pu Zhao
|
Junhao Wang
|
Maoquan Wang
|
Yufan Huang
|
Shengyu Fu
|
Elsie Nallipogu
|
Qingwei Lin
|
Yingnong Dang
|
Saravan Rajmohan
|
Dongmei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Code translation benchmarks are essential for evaluating the accuracy and efficiency of LLM-based systems. Existing benchmarks mainly target individual functions, overlooking repository-level challenges like intermodule coherence and dependency management. Recent repository-level efforts exist, but suffer from poor maintainability and coarse evaluation granularity. We introduce Skeleton-Guided-Translation, a framework for benchmarking Java-to-C# translation at the repository level, featuring fine-grained quality evaluation. It follows a two-step process: first translating repository “skeletons”, then refining the entire repository guided by these skeletons. Based on this, we present TRANSREPO-BENCH , the first test-driven benchmark of high-quality Java repositories paired with C# skeletons, unit tests, and build configurations. Our adaptive unit tests support multiple and incremental translations without manual tuning, enhancing automation and scalability. We also propose fine-grained metrics that evaluate translation quality per test case, overcoming limitations of binary metrics in distinguishing build failures. Evaluations using TRANSREPO-BENCH reveal issues like broken cross-file references, showing that our structured approach reduces dependency errors and preserves interface consistency.
pdf
bib
abs
UFO: A UI-Focused Agent for Windows OS Interaction
Chaoyun Zhang
|
Liqun Li
|
Shilin He
|
Xu Zhang
|
Bo Qiao
|
Si Qin
|
Minghua Ma
|
Yu Kang
|
Qingwei Lin
|
Saravan Rajmohan
|
Dongmei Zhang
|
Qi Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
We introduce UFO, a UI-Fcused agent designed to fulfill user requests tailored to Windows OS applications by observing and analyzing the GUI and control information of these applications. UFO utilizes a hierarchical dual-agent framework that decomposes user requests using a divide-and-conquer approach, enabling seamless navigation and addressing sub-tasks across multiple applications. It also incorporates a control interaction module tailored for Windows OS, which detects control elements effectively and allows for fully automated execution. As a result, UFO simplifies complex and time-consuming processes into tasks that can be completed with natural language commands.We conducted testing of UFO across 9 popular Windows applications, encompassing a variety of scenarios. The results derived from both quantitative metrics and real-case studies, underscore the superior effectiveness of UFOin fulfilling user requests. To the best of our knowledge, UFO stands as the first UI agent specifically tailored for task completion within the Windows OS.
2021
pdf
bib
abs
CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations
Hang Li
|
Wenbiao Ding
|
Yu Kang
|
Tianqiao Liu
|
Zhongqin Wu
|
Zitao Liu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Existing audio-language task-specific predictive approaches focus on building complicated late-fusion mechanisms. However, these models are facing challenges of overfitting with limited labels and low model generalization abilities. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language through two proxy tasks on a large amount of audio-and-language pairs: masked language modeling and masked cross-modal acoustic modeling. After fine-tuning our pre-trained model on multiple downstream audio-and-language tasks, we observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification. On this basis, we further propose a specially-designed fusion mechanism that can be used in fine-tuning phase, which allows our pre-trained model to achieve better performance. Lastly, we demonstrate detailed ablation studies to prove that both our novel cross-modality fusion component and audio-language pre-training methods significantly contribute to the promising results. The code and pre-trained models are available at
https://github.com/tal-ai/CTAL_EMNLP2021.