Xinyan Guan

2025

Understanding the mechanisms underlying Large Language Model (LLM) behavior in Retrieval-Augmented Generation (RAG) systems is critical for enhancing reliability. In this paper, we leverage Sparse Autoencoders (SAEs) within the LLaMA Scope to uncover sparse, interpretable latents that govern RAG behaviors. Through systematic analysis of SAE activations, we identify specific latents associated with two fundamental RAG decisions: (1) context versus memory prioritization, and (2) response generation versus query rejection. Intervention experiments demonstrate that these latents enable precise control over model behavior and maintain generalizability across various experimental settings. Mechanistic analysis reveals that manipulating these latents influences model behavior by reconfiguring attention patterns of retrieval heads. Our findings establish SAEs as a principled tool for understanding and controlling RAG behaviors, demonstrating capabilities in precise behavior steering without architectural modifications.

Automated Alignment refers to a set of algorithms designed to align Large Language Models (LLMs) with human intentions and values while minimizing manual intervention. However, it faces challenges such as algorithmic diversity and excessively convoluted workflows. We present AutoAlign, an open-source toolkit that offers:(1) a unified framework integrating mainstream automated algorithms through a consistent interface, and(2) an accessible workflow supporting one-click execution for prompt synthesis, automatic alignment signal construction, and iterative model training. Our toolkit enables easy reproduction of existing results through extensive benchmarks and facilitates the development of novel approaches via modular components. It includes implementations for both highly efficient inference and training, as well as low-resource training. By standardizing automated alignment methodologies and providing accessible implementations, AutoAlign lowers the barriers to building customized aligned models and supports academic research.

Hallucination occurs when large language models exhibit behavior that deviates from the boundaries of their knowledge during response generation. To address this critical issue, previous learning-based methods attempt to finetune models but are limited by off-policy sampling and coarse-grained feedback. In this paper, we present Reinforcement Learning for Hallucination (RLFH), an on-policy self-alignment approach that enables LLMs to actively explore their knowledge boundaries and self-correct generation behavior through fine-grained feedback signals. RLFH introduces a self-assessment framework where the policy serves as its own judge. Through this framework, responses are automatically decomposed into atomic facts and their truthfulness and informativeness are assessed against external knowledge sources. The resulting fine-grained feedback at the statement level are then converted into token-level dense reward signals. This enables online reinforcement learning to achieve precise and timely optimization without human intervention. Comprehensive evaluations on HotpotQA, SQuADv2, and Biography benchmarks validate RLFH’s effectiveness in hallucination mitigation.

2024

Manually annotating instruction data for large language models is difficult, costly, and hard to scale. Meanwhile, current automatic annotation methods typically rely on distilling synthetic data from proprietary LLMs, which not only limits the upper bound of the quality of the instruction data but also raises potential copyright issues. In this paper, we propose REInstruct, a simple and scalable method to automatically build instruction data from an unlabeled corpus without heavy reliance on proprietary LLMs and human annotation.Specifically, REInstruct first selects a subset of unlabeled texts that potentially contain well-structured helpful and insightful content and then generates instructions for these texts. To generate accurate and relevant responses for effective and robust training, REInstruct further proposes a rewriting-based approach to improve the quality of the generated instruction data. By training Llama-7b on a combination of 3k seed data and 32k synthetic data from REInstruct, fine-tuned model achieves a 65.41% win rate on AlpacaEval leaderboard against text-davinci-003, outperforming other open-source, non-distilled instruction data construction methods. The code is publicly available at https://github.com/cs32963/REInstruct.

2022

pdf bib abs
Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change
Zhaochen Su | Zecheng Tang | Xinyan Guan | Lijun Wu | Min Zhang | Juntao Li
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Recent research has revealed that neural language models at scale suffer from poor temporal generalization capability, i.e., language model pre-trained on static data from past years performs worse over time on emerging data. Existing methods mainly perform continual training to mitigate such a misalignment. While effective to some extent but is far from being addressed on both the language modeling and downstream tasks. In this paper, we empirically observe that temporal generalization is closely affiliated with lexical semantic change, which is one of the essential phenomena of natural languages. Based on this observation, we propose a simple yet effective lexical-level masking strategy to post-train a converged language model. Experiments on two pre-trained language models, two different classification tasks, and four benchmark datasets demonstrate the effectiveness of our proposed method over existing temporal adaptation methods, i.e., continual training with new data. Our code is available at https://github.com/zhaochen0110/LMLM.