2025
pdf
bib
abs
MEPT: Mixture of Expert Prompt Tuning as a Manifold Mapper
Runjia Zeng
|
Guangyan Sun
|
Qifan Wang
|
Tong Geng
|
Sohail Dianat
|
Xiaotian Han
|
Raghuveer Rao
|
Xueling Zhang
|
Cheng Han
|
Lifu Huang
|
Dongfang Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Considering deep neural networks as manifold mappers, the pretrain-then-fine-tune paradigm can be interpreted as a two-stage process: pretrain establishes a broad knowledge base, and fine-tune adjusts the model parameters to activate specific neural pathways to align with the target manifold. Although prior fine-tuning approaches demonstrate success, their rigid parameter space limits their ability to dynamically activate appropriate neural pathways, rendering them ill-equipped to adapt flexibly to the diverse and evolving data distributions. In light of this view, we propose a novel approach, Mixture of Expert Prompt Tuning (MEPT), as an effective and efficient manifold-mapping framework. MEPT leverages the Mixture of Experts architecture by integrating multiple prompt experts to adaptively learn diverse and non-stationary data distributions. Empirical evaluations demonstrate that MEPT outperforms several state-of-the-art parameter efficient baselines on SuperGLUE, achieving notable improvements in mean accuracy (e.g., 1.94%) while significantly reducing activated prompts by 79.25%. The effectiveness of MEPT is further supported by theoretical insights from manifold learning and validated through neural activation pathway visualization results. Our code is avaliable at https://runjia.tech/emnlp_mept/.
pdf
bib
abs
When Truthful Representations Flip Under Deceptive Instructions?
Xianxuan Long
|
Yao Fu
|
Runchao Li
|
Mu Sheng
|
Haotian Yu
|
Xiaotian Han
|
Pan Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) tend to follow maliciously crafted instructions to generate deceptive responses, posing safety challenges. How deceptive instructions alter the internal representations of LLM compared to truthful ones remains poorly understood beyond output analysis. To bridge this gap, we investigate when and how these representations “flip”, such as from truthful to deceptive, under deceptive versus truthful/neutral instructions. Analyzing the internal representations of Llama-3.1-8B-Instruct and Gemma-2-9B-Instruct on a factual verification task, we find the model’s instructed True/False output is predictable via linear probes across all conditions based on the internal representation. Further, we use Sparse Autoencoders (SAEs) to show that the Deceptive instructions induce significant representational shifts compared to Truthful/Neutral representations (which are similar), concentrated in early-to-mid layers and detectable even on complex datasets. We also identify specific SAE features highly sensitive to deceptive instruction and use targeted visualizations to confirm distinct truthful/deceptive representational subspaces.
pdf
bib
abs
Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs
Yao Fu
|
Xianxuan Long
|
Runchao Li
|
Haotian Yu
|
Mu Sheng
|
Xiaotian Han
|
Yu Yin
|
Pan Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Quantization enables efficient deployment of large language models (LLMs) in resource-constrained environments by significantly reducing memory and computation costs. While quantized LLMs often maintain performance on perplexity and zero-shot tasks, their impact on truthfulness—whether generating truthful or deceptive responses—remains largely unexplored. In this work, we introduce TruthfulnessEval, a comprehensive evaluation framework for assessing the truthfulness of quantized LLMs across three dimensions: (1) Truthfulness on Logical Reasoning; (2) Truthfulness on Common Sense; and (3) Truthfulness on Imitative Falsehoods. Using this framework, we examine mainstream quantization techniques (ranging from 4-bit to extreme 2-bit) across several open-source LLMs. Surprisingly, we find that while quantized models retain internally truthful representations, they are more susceptible to producing false outputs under misleading prompts. To probe this vulnerability, we test 15 rephrased variants of “honest”, “neutral” and “deceptive” prompts and observe that “deceptive” prompts can override truth-consistent behavior, whereas “honest” and “neutral” prompts maintain stable outputs. Further, we reveal that quantized models “know” the truth internally yet still produce false outputs when guided by “deceptive” prompts via layer-wise probing and PCA visualizations. Our findings provide insights into future designs of quantization-aware alignment and truthfulness interventions.
pdf
bib
abs
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
Van Yang
|
Hongye Jin
|
Shaochen Zhong
|
Song Jiang
|
Qifan Wang
|
Vipin Chaudhary
|
Xiaotian Han
Findings of the Association for Computational Linguistics: ACL 2025
Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM shall enable its users to effortlessly process many originally exhausting tasks — e.g., digesting a long-form document to find answers v.s., directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have a few major shortcomings. For instance, some Needle-in-a-Haystack-like benchmarks are too synthetic, and therefore do not represent the real world usage of LLMs. While some real-task-based benchmarks like LongBench avoid this problem, such benchmarks are often formed in a way where each data sample has a fixed sequence length, which not only makes them solely suitable for models with a certain range of context windows, but also lacks a proxy to know at what length the model/method-of-interest would fail. Last, most benchmarks tend to not provide proper metrics to separate long-context performance from the model’s baseline ability, so when conducting a cross-model/recipe comparison, such conflation makes the user unable to understand how exactly one model or recipe excels at the long-context task in relation to its baseline ability. To address these issues, we introduce a length-controllable, real-life reflective benchmark with a novel metric that disentangles baseline knowledge from long-context capabilities. Experiments demonstrate the superiority of our datasets in effectively evaluating LLMs. All assets are available at https://github.com/uservan/100-LongBench.git.
pdf
bib
abs
CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation
Nengbo Wang
|
Xiaotian Han
|
Jagdip Singh
|
Jing Ma
|
Vipin Chaudhary
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) have revolutionized natural language processing (NLP), particularly through Retrieval-Augmented Generation (RAG), which enhances LLM capabilities by integrating external knowledge. However, traditional RAG systems face critical limitations, including disrupted contextual integrity due to text chunking, and over-reliance on semantic similarity for retrieval. To address these issues, we propose CausalRAG, a novel framework that incorporates causal graphs into the retrieval process. By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision, leading to more accurate and interpretable responses. We evaluate CausalRAG against regular RAG and graph-based RAG approaches, demonstrating its superiority across multiple metrics. Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.
2024
pdf
bib
abs
PokeMQA: Programmable knowledge editing for Multi-hop Question Answering
Hengrui Gu
|
Kaixiong Zhou
|
Xiaotian Han
|
Ninghao Liu
|
Ruobing Wang
|
Xin Wang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-hop question answering (MQA) is one of the challenging tasks to evaluate machine’s comprehension and reasoning abilities, where large language models (LLMs) have widely achieved the human-comparable performance. Due to the dynamics of knowledge facts in real world, knowledge editing has been explored to update model with the up-to-date facts while avoiding expensive re-training or fine-tuning. Starting from the edited fact, the updated model needs to provide cascading changes in the chain of MQA. The previous art simply adopts a mix-up prompt to instruct LLMs conducting multiple reasoning tasks sequentially, including question decomposition, answer generation, and conflict checking via comparing with edited facts. However, the coupling of these functionally-diverse reasoning tasks inhibits LLMs’ advantages in comprehending and answering questions while disturbing them with the unskilled task of conflict checking. We thus propose a framework, Programmable knowledge editing for Multi-hop Question Answering (PokeMQA), to decouple the jobs. Specifically, we prompt LLMs to decompose knowledge-augmented multi-hop question, while interacting with a detached trainable scope detector to modulate LLMs behavior depending on external conflict signal. The experiments on three LLM backbones and two benchmark datasets validate our superiority in knowledge editing of MQA, outperforming all competitors by a large margin in almost all settings and consistently producing reliable reasoning process.
pdf
bib
abs
InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model
Haogeng Liu
|
Quanzeng You
|
Yiqi Wang
|
Xiaotian Han
|
Bohan Zhai
|
Yongfei Liu
|
Wentao Chen
|
Yiren Jian
|
Yunzhe Tao
|
Jianbo Yuan
|
Ran He
|
Hongxia Yang
Findings of the Association for Computational Linguistics: ACL 2024
In this work, we present InfiMM, an advanced Multimodal Large Language Model that adapts to intricate vision-language tasks. InfiMM, inspired by the Flamingo architecture, distinguishes itself through the utilization of large-scale training data, comprehensive training strategies, and diverse large language models. This approach ensures the preservation of Flamingo’s foundational strengths while simultaneously introducing augmented capabilities. Empirical evaluations across a variety of benchmarks underscore InfiMM’s remarkable capability in multimodal understanding. The code can be found at: https://anonymous.4open.science/r/infimm-zephyr-F60C/.
2017
pdf
bib
abs
DMGroup at EmoInt-2017: Emotion Intensity Using Ensemble Method
Song Jiang
|
Xiaotian Han
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
In this paper, we present a novel ensemble learning architecture for emotion intensity analysis, particularly a novel framework of ensemble method. The ensemble method has two stages and each stage includes several single machine learning models. In stage1, we employ both linear and nonlinear regression models to obtain a more diverse emotion intensity representation. In stage2, we use two regression models including linear regression and XGBoost. The result of stage1 serves as the input of stage2, so the two different type models (linear and non-linear) in stage2 can describe the input in two opposite aspects. We also added a method for analyzing and splitting multi-words hashtags and appending them to the emotion intensity corpus before feeding it to our model. Our model achieves 0.571 Pearson-measure for the average of four emotions.