2025
pdf
bib
abs
MasRouter: Learning to Route LLMs for Multi-Agent Systems
Yanwei Yue
|
Guibin Zhang
|
Boyang Liu
|
Guancheng Wan
|
Kun Wang
|
Dawei Cheng
|
Yiyan Qi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-agent systems (MAS) powered by Large Language Models (LLMs) have been demonstrated to push the boundaries of LLM capabilities, yet they often incur significant costs and face challenges in dynamic LLM selection. Current LLM routing methods effectively reduce overhead in single-agent scenarios by customizing LLM selection for each query, but they overlook the critical decisions regarding collaboration modes and agent roles in MAS. In response to this challenge, we first introduce the problem of Multi-Agent System Routing (MASR), which integrates all components of MAS into a unified routing framework. Toward this goal, we propose MasRouter, the first high-performing, cost-effective, and inductive MASR solution. MasRouter employs collaboration mode determination, role allocation, and LLM routing through a cascaded controller network, progressively constructing a MAS that balances effectiveness and efficiency. Extensive experiments demonstrate that MasRouter is (1) high-performing, achieving a 1.8 improvement over the state-of-the-art method on MBPP; (2) economical, reducing overhead by up to 52.07 compared to SOTA methods on HumanEval; and (3) plug-and-play, seamlessly integrating with mainstream MAS frameworks, reducing overhead by 17.21 via customized routing.
pdf
bib
abs
VLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training
Zhanpeng Chen
|
Chengjin Xu
|
Yiyan Qi
|
Xuhui Jiang
|
Jian Guo
Findings of the Association for Computational Linguistics: EMNLP 2025
Vision-language Models (VLMs) have demonstrated remarkable capabilities in processing and generating content across multiple data modalities. However, a significant drawback of VLMs is their reliance on static training data, leading to outdated information and limited contextual awareness. This static nature hampers their ability to provide accurate and up-to-date responses, particularly in dynamic or rapidly evolving contexts. To address these limitations, we propose RagVL, a novel framework with knowledge-enhanced reranking and noise-injected training. We instruction-tune the VLM with a simple yet effective instruction template to induce its ranking ability and serve it as a reranker to precisely filter the top-k retrieved images. For generation, we inject visual noise during training at the data and token levels to enhance the generator’s robustness. Extensive experiments on four datasets verify the effectiveness of our method. Code and models are available at https://anonymous.4open.science/r/RagVL-F694.
pdf
bib
abs
Beyond Function-Level Search: Repository-Aware Dual-Encoder Code Retrieval with Adversarial Verification
Aofan Liu
|
Song Shiyuan
|
Haoxuan Li
|
Cehao Yang
|
Yiyan Qi
Findings of the Association for Computational Linguistics: EMNLP 2025
The escalating complexity of modern codebases has intensified the need for code retrieval systems capable of interpreting cross-component change intents—a capability fundamentally absent in conventional function-level search paradigms. While recent research has improved alignment between queries and code snippets, retrieving contextually relevant code for certain change request remains underexplored. To bridge this gap, we present RepoAlignBench, the first benchmark designed to evaluate repository-level code retrieval for change request-driven scenarios, encompassing 52k columns. The benchmark shifts the paradigm from function-centric retrieval to holistic repository analysis. In addition, we propose ReflectCode, an adversarial reflection-augmented dual-tower architecture featuring disentangled code_encoder and doc_encoder towers. Our framework dynamically integrates syntactic patterns, function dependency, and semantic expansion intent through LLM. Comprehensive evaluations demonstrate that ReflectCode achieves 12.2% Top-5 Accuracy and 7.1% Recall improvements over state-of-the-art baselines.
pdf
bib
abs
Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models
Xiaojun Wu
|
Junxi Liu
|
Huan-Yi Su
|
Zhouchi Lin
|
Yiyan Qi
|
Chengjin Xu
|
Jiajun Su
|
Jiajie Zhong
|
Fuwei Wang
|
Saizhuo Wang
|
Fengrui Hua
|
Jia Li
|
Jian Guo
Findings of the Association for Computational Linguistics: EMNLP 2025
As large language models (LLMs) increasingly permeate the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. Existing financial benchmarks often suffer from limited language and task coverage, low-quality datasets, and inadequate adaptability for LLM evaluation. To address these limitations, we introduce Golden Touchstone, a comprehensive bilingual benchmark for financial LLMs, encompassing eight core financial NLP tasks in both Chinese and English. Developed from extensive open-source data collection and industry-specific demands, this benchmark thoroughly assesses models’ language understanding and generation capabilities. Through comparative analysis of major models such as GPT-4o, Llama3, FinGPT, and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-source Touchstone-GPT, a financial LLM trained through continual pre-training and instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks. This research provides a practical evaluation tool for financial LLMs and guides future development and optimization.The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at
https://github.com/IDEA-FinAI/Golden-Touchstone.
pdf
bib
abs
Retrieval, Reasoning, Re-ranking: A Context-Enriched Framework for Knowledge Graph Completion
Muzhi Li
|
Cehao Yang
|
Chengjin Xu
|
Xuhui Jiang
|
Yiyan Qi
|
Jian Guo
|
Ho-fung Leung
|
Irwin King
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The Knowledge Graph Completion (KGC) task aims to infer the missing entity from an incomplete triple. Existing embedding-based methods rely solely on triples in the KG, which is vulnerable to specious relation patterns and long-tail entities. On the other hand, text-based methods struggle with the semantic gap between KG triples and natural language. Apart from triples, entity contexts (e.g., labels, descriptions, aliases) also play a significant role in augmenting KGs. To address these limitations, we propose KGR3, a context-enriched framework for KGC. KGR3 is composed of three modules. Firstly, the Retrieval module gathers supporting triples from the KG, collects plausible candidate answers from a base embedding model, and retrieves context for each related entity. Then, the Reasoning module employs a large language model to generate potential answers for each query triple. Finally, the Re-ranking module combines candidate answers from the two modules mentioned above, and fine-tunes an LLM to provide the best answer. Extensive experiments on widely used datasets demonstrate that KGR3 consistently improves various KGC methods. Specifically, the best variant of KGR3 achieves absolute Hits@1 improvements of 12.3% and 5.6% on the FB15k237 and WN18RR datasets.