Hailong Sun

2026

Beyond Superficial Tests: Adversarial Refinement for Reliable Property-Based Testing
Xiao Li | Runlin Liu | Zhe Zhang | Xiang Gao | Hailong Sun
Findings of the Association for Computational Linguistics: ACL 2026

Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation, yet their application to Property-Based Testing (PBT) remains fraught with a superficiality gap. While LLMs can readily generate syntactically correct tests, they often struggle to bridge the semantic gap between code implementation and its intended invariant logic, resulting in weak properties that provide a false sense of security. To address this, we introduce PROBE, an agentic framework that hardens software properties through Adversarial Refinement. Unlike traditional generation approaches, PROBE treats test generation as a game of semantic asymmetry: it employs a Validator agent to actively generate counter-implementations, which are semantically incorrect codes that satisfy the generated property, to expose loopholes in the specification. Furthermore, PROBE constructs a cross-functional semantic graph to capture deep dependencies often missed by local analysis. Extensive evaluation reveals that PROBE increases mutation scores by 9.79% over baselines. In real-world deployment, PROBE identified 45 previously unknown bugs in top-tier libraries that have been confirmed by developers, demonstrating its ability to uncover deep semantic defects.

pdf bib abs

HIPO: A Hierarchical Prompt Optimization Framework with Task Awareness and Fine-Grained Debugging
Lu Qi | Lei Chai | Hongrui Yu | Binhang Qi | Hailong Sun
Findings of the Association for Computational Linguistics: ACL 2026

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their performance often hinges on carefully designed prompts, whose creation requires substantial human effort. While numerous automatic prompt optimization techniques have been proposed, existing methods typically apply the same prompt across all samples within a dataset, ignoring variation in sample difficulty. To address these limitations, we propose HIPO, a HIerarchical Prompt Optimization framework that shifts the paradigm from dataset-level to sample-level optimization. Our framework first employs a lightweight router model, trained offline, to predict the difficulty of each sample at test time. Based on this prediction, HIPO dynamically selects a prompt from a five-tiered hierarchy, tailoring complexity to sample difficulty. Furthermore, two refinement stages—Task Description Prompt Refine and Attribution-Based Prompt Refine—enhance generalizability and fine-grained optimization. Extensive experiments on 27 tasks demonstrate that HIPO outperforms all baselines, achieving state-of-the-art performance on 25% more tasks than the strongest baseline. Cost analysis further demonstrates substantial efficiency gains, reducing API calls, token consumption, and overall cost by 1.2× to 80×. Our implementation is publicly available at https://github.com/LuQiCode/HIPO.

2025

pdf bib abs

Mobile task automation is an emerging technology that leverages AI to automatically execute routine tasks by users’ commands on mobile devices like Android, thus enhancing efficiency and productivity. While large language models (LLMs) excel at general mobile tasks through training on massive datasets, they struggle with app-specific workflows. To solve this problem, we designed UI Map, a structured representation of target app’s UI information. We further propose a UI Map-guided LLM-based approach UICompass to automate mobile tasks. Specifically, UICompass first leverages static analysis and LLMs to automatically build UI Map from either source codes of apps or byte codes (i.e., APK packages). During task execution, UICompass mines the task-relevant information from UI Map to feed into the LLMs, generate a planned paths, and adaptively adjust the path based on the actual app state and action history. Experimental results demonstrate that UICompass achieves a 15.87% higher task executing success rate than SOTA approaches. Even when only APK is available, UICompass maintains superior performance, demonstrating its applicability to closed-source apps.

Co-authors

Xiao Li 1

Lu Qi 1

He Rui 1

Venues

Findings2
EMNLP1

Fix author