Yihong Huang

2026

Retrieval-Augmented Generation is a powerful tool for NLP applications. Yet, it is challenging to encode large knowledge bases as compact offline structures while simultaneously achieving accurate, low-latency online retrieval. We propose **ZoomRAG**, a coarse-to-fine, hierarchical graph inference method to tackle the challenges. ZoomRAG formulates the retrieval task as random walks across multi-scale relational graphs. *At the coarse level*, it constructs a global relational graph and performs a query-initiated random walk to quickly locate a few relevant documents over the entire corpus. *At the finer level*, it “zooms into” the selected documents to capture fine-grained semantic and temporal relations, and conducts a second random walk to pinpoint salient evidence chunks for generation. This coarse-to-fine strategy substantially reduces offline indexing costs and accelerates online retrieval. Moreover, random-walk based topological reasoning over rich, multi-scale relational structures enables ZoomRAG to effectively aggregate multi-hop evidence while suppressing noise. Finally, we address the difficulty of handling concurrent RAG queries by **algorithm-parallel ZoomRAG**. Overall, ZoomRAG avoids building expensive knowledge graphs while achieving 2.2% – 4.9% absolute gains in accuracy over SOTA RAG models, with an average online retrieval latency per-query as low as 0.019 secs by processing hundreds of queries concurrently.

pdf bib abs

Large Language Models exhibit degraded performance when extrapolating beyond training context lengths. Existing training-free methods like positional reuse or interpolation can alleviate this issue in an efficient manner. However, these strategies are semantics-agnostic by only considering relative token distances, which could indiscriminately blur semantically relevant and irrelevant tokens alike.To address this, we introduce an adaptive positional zooming method called **Relevance-Informed Positional Resource Allocation (RiPRA)**. RiPRA formulates positional encoding as a constrained resource allocation, in which a fixed positional budget is distributed across tokens in a longer context based on their semantic relevance to the query: relevant tokens get higher positional resolution, while irrelevant tokens (positions) are compressed. By doing this, RiPRA enables a dynamic and nonparametric positional zooming where the positional resolution is adaptively modulated across queries and network layers, effectively improving long-range context modeling and retrieval capacity. Besides, an isotonic smoothing is used to further enforce a global linear ordering relationship to preserve stability and generalization, together with a chunk-based hierarchical approximation to further reduce inference overhead. Extensive experiments across comprehensive benchmarks including LongBench, L-Eval, Passkey Retrieval, and PG19 demonstrate that RiPRA consistently outperforms existing training-free extrapolation methods, showing the value of relevance-conditioned positional encoding for long-context generalization.

pdf bib abs

Low-Rank Adaptation (LoRA) has achieved remarkable progress in improving the fine-tuning efficiency and downstream performance of large language models (LLMs). Although prior work has recognized that different weight update matrices 𝛥 𝐖 exhibit varying importance and therefore should be allocated different ranks, parameters within the same update matrix are still typically constrained to a uniform rank configuration, neglecting fine-grained parameter-level heterogeneity. To address this limitation, we propose G-LoRA (Global-Local Decoupled LoRA), which decomposes each update matrix into global and local adapters. The key idea is to reorganize the rows and columns of the update matrix using a first-order Taylor approximation of parameter importance, such that highly influential parameters are clustered into a local sub-block of 𝛥 𝐖. During training, the local adapter then focuses on this high-importance sub-region and is allocated a higher rank, whereas the global adapter captures the residual updates for the entire update matrix with relatively lower rank. By allocating higher representational capacity to more critical parameters, G-LoRA enables more efficient utilization of model resources. Extensive evaluations on benchmarks spanning commonsense reasoning, mathematical reasoning, and code generation demonstrate that G-LoRA achieves up to 2.7% absolute accuracy improvement over LoRA and its variants, validating its effectiveness for LLM fine-tuning.

Co-authors

Venues

Findings3

Fix author