2025
pdf
bib
abs
ReasonerRank: Redefining Language Model Evaluation with Ground-Truth-Free Ranking Frameworks
Jiamu Zhang
|
Jiayi Yuan
|
Andrew Wen
|
Hoang Anh Duy Le
|
Yu-Neng Chuang
|
Soo-Hyun Choi
|
Rui Chen
|
Xia Hu
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) are increasingly adopted across real-world applications, yet traditional evaluations rely on expensive, domain-specific ground-truth labels that are often unavailable or infeasible. We introduce a ground-truth-free evaluation framework focused on reasoning consistency and instruction following, shifting the emphasis from correctness—which is elusive without labels—to transparent, coherent, evidence-based reasoning. Each model response must include a direct answer, a structured multi-step explanation, and supporting evidence, all assessed via semantic similarity and output adherence checks. We further propose TopK-ReRank, which refines rankings by constructing a consensus answer from the most reliable models, reducing ambiguity across diverse reasoning styles. Experiments show that our framework outperforms existing label-free methods, including majority voting, triplet ranking, and peer-review approaches, providing a more interpretable and efficient alternative for evaluating LLMs in the absence of ground-truth labels.
pdf
bib
abs
LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem
Hongyi Liu
|
Shaochen Zhong
|
Xintong Sun
|
Minghao Tian
|
Mohsen Hariri
|
Zirui Liu
|
Ruixiang Tang
|
Zhimeng Jiang
|
Jiayi Yuan
|
Yu-Neng Chuang
|
Li Li
|
Soo-Hyun Choi
|
Rui Chen
|
Vipin Chaudhary
|
Xia Hu
Findings of the Association for Computational Linguistics: EMNLP 2025
Backdoor attacks are powerful and effective, but distributing LLMs without a proven track record like ‘meta-llama‘ or ‘qwen‘ rarely gains community traction. We identify LoRA sharing as a unique scenario where users are more willing to try unendorsed assets, since such shared LoRAs allow them to enjoy personalized LLMs with negligible investment. However, this convenient share-and-play ecosystem also introduces a new attack surface, where attackers can distribute malicious LoRAs to an undefended community. Despite the high-risk potential, no prior art has comprehensively explored LoRA’s attack surface under the downstream-enhancing share-and-play context. In this paper, we investigate how backdoors can be injected into task-enhancing LoRAs and examine the mechanisms of such infections. We find that with a simple, efficient, yet specific recipe, **a backdoor LoRA can be trained once and then seamlessly merged (in a training-free fashion) with multiple task-enhancing LoRAs, retaining both its malicious backdoor and benign downstream capabilities.** This allows attackers to scale the distribution of compromised LoRAs with minimal effort by leveraging the rich pool of existing shared LoRA assets. We note that such merged LoRAs are particularly *infectious* — because their malicious intent is cleverly concealed behind improved downstream capabilities, creating a strong incentive for voluntary download — and *dangerous* — because under local deployment, no safety measures exist to intervene when things go wrong. Our work is among the first to study this new threat model of training-free distribution of downstream-capable-yet-backdoor-injected LoRAs, highlighting the urgent need for heightened security awareness in the LoRA ecosystem. **Warning: This paper contains offensive content and involves a real-life tragedy.**
pdf
bib
abs
Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration
Songyuan Sui
|
Hongyi Liu
|
Serena Liu
|
Li Li
|
Soo-Hyun Choi
|
Rui Chen
|
Xia Hu
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Table understanding requires structured, multi-step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi-agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over-reliance on execution correctness. To address these issues, we propose Chain-of-Query (CoQ), a novel multi-agent framework for SQL-aided table understanding. CoQ adopts natural-language-style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause-by-clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL-based mechanical reasoning from LLM-based logical inference, thereby reducing reliance on execution outcomes. Extensive experiments across four models and five widely used benchmarks demonstrate that CoQ achieves substantial accuracy improvements and significantly lowers invalid SQL rates compared to prior generic LLM-based, SQL-aided, and hybrid baselines, confirming its superior effectiveness in table understanding. The code is available at https://github.com/SongyuanSui/ChainofQuery.