Xin Zou
2026
Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods
Chenfei Liao | Wensong Wang | Zichen Wen | Xu Zheng | Yiyu Wang | Haocong He | Yuanhuiyi Lyu | Lutao Jiang | Xin Zou | Yuqian Fu | Bin Ren | Linfeng Zhang | Xuming Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chenfei Liao | Wensong Wang | Zichen Wen | Xu Zheng | Yiyu Wang | Haocong He | Yuanhuiyi Lyu | Lutao Jiang | Xin Zou | Yuqian Fu | Bin Ren | Linfeng Zhang | Xuming Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent efforts to accelerate inference in Multimodal Large Language Models (MLLMs) have largely focused on visual token compression. The effectiveness of these methods is commonly evaluated by measuring the accuracy drop on existing MLLM benchmarks before and after compression. However, these benchmarks are originally designed to assess general perception and reasoning abilities, rather than the specific challenges posed by visual token compression, leading to a fundamental task mismatch. In this work, we uncover a counterintuitive yet consistent phenomenon: simple image downsampling outperforms many advanced visual token compression methods across multiple widely used benchmarks. Through a comprehensive empirical study spanning eight popular benchmarks and multiple state-of-the-art compression techniques, we show that (i) current benchmarks contain substantial noise (task-irrelevant samples) for evaluating visual token compression, and (ii) downsampling can act as an effective data filter that distinguishes between simple and difficult samples with respect to compression sensitivity. Motivated by these findings, we propose VTC-Bench, an evaluation framework that explicitly leverages downsampling as a discriminator to denoise existing benchmarks, enabling a fairer and more meaningful additional assessment of visual token compression methods.
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
Yibo Yan | Mingdong Ou | Yi Cao | Xin Zou | Jiahao Huo | Shuliang Liu | James Kwok | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2026
Yibo Yan | Mingdong Ou | Yi Cao | Xin Zou | Jiahao Huo | Shuliang Liu | James Kwok | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2026
Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce **Prune-then-Merge**, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. **Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.**
Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering
Shuliang Liu | Songbo Yang | Dong Fang | Sihang Jia | Yuqi Tang | Lingfeng Su | Ruoshui Peng | Yibo Yan | Xin Zou | Xuming Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shuliang Liu | Songbo Yang | Dong Fang | Sihang Jia | Yuqi Tang | Lingfeng Su | Ruoshui Peng | Yibo Yan | Xin Zou | Xuming Hu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Object hallucination critically undermines the reliability of Multimodal Large Language Models (MLLMs), often stemming from a fundamental failure in cognitive introspection—where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.
2025
Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models
Yanggan Gu | Junzhuo Li | Sirui Huang | Xin Zou | Zhenghua Li | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025
Yanggan Gu | Junzhuo Li | Sirui Huang | Xin Zou | Zhenghua Li | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025
Aligning small language models (SLMs) with human values typically involves distilling preference knowledge from large language models (LLMs). However, existing distillation methods model preference knowledge in teacher LLMs by comparing pairwise responses, overlooking the extent of difference between responses. This limitation hinders student SLMs from capturing the nuanced preferences for multiple responses. In this paper, we propose a Preference-Aligned Distillation (PAD) framework, which models teacher’s preference knowledge as a probability distribution over all potential preferences, thereby providing more nuanced supervisory signals. Our insight in developing PAD is rooted in the demonstration that language models can serve as reward functions, reflecting their intrinsic preferences. Based on this, PAD comprises three key steps: (1) sampling diverse responses using high-temperature; (2) computing rewards for both teacher and student to construct their intrinsic preference; and (3) training the student’s intrinsic preference distribution to align with the teacher’s. Experiments on four mainstream alignment benchmarks demonstrate that PAD consistently and significantly outperforms existing approaches, achieving over 20% improvement on AlpacaEval 2 and Arena-Hard, indicating superior alignment with human preferences. Notably, on MT-Bench, using the Gemma model family, the student trained by PAD surpasses its teacher, further validating the effectiveness of our PAD.
Empowering Persuasion Detection in Slavic Texts through Two-Stage Generative Reasoning
Xin Zou | Chuhan Wang | Dailin Li | Yanan Wang | Jian Wang | Hongfei Lin
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
Xin Zou | Chuhan Wang | Dailin Li | Yanan Wang | Jian Wang | Hongfei Lin
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
This paper presents our submission to Subtask 2 (multi-label classification of persuasion techniques) of the Shared Task on Detection and Classification of Persuasion Techniques in Slavic Languages at SlavNLP 2025. Our method leverages a teacher–student framework based on large language models (LLMs): a Qwen3 32B teacher model generates natural language explanations for annotated persuasion techniques, and a Qwen2.5 32B student model is fine-tuned to replicate both the teacher’s rationales and the final label predictions. We train our models on the official shared task dataset, supplemented by annotated resources from SemEval 2023 Task 3 and CLEF 2024 Task 3 covering English, Russian, and Polish to improve cross-lingual robustness. Our final system ranks 4th on BG, SI, and HR, and 5th on PL in terms of micro-F1 score among all participating teams.
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios
Yunkai Dang | Mengxi Gao | Yibo Yan | Xin Zou | Yanggan Gu | Jungang Li | Jingyu Wang | Peijie Jiang | Aiwei Liu | Jia Liu | Xuming Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yunkai Dang | Mengxi Gao | Yibo Yan | Xin Zou | Yanggan Gu | Jungang Li | Jingyu Wang | Peijie Jiang | Aiwei Liu | Jia Liu | Xuming Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding. However, existing studies have concentrated mainly on visual–textual misalignment, leaving largely unexplored the MLLMs’ ability to preserve an originally correct answer when confronted with misleading information. We reveal a response uncertainty phenomenon: across nine standard datasets, twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue. To systematically quantify this vulnerability, we propose a two-stage evaluation pipeline: (1) elicit each model’s original response on unperturbed inputs; (2) inject explicit (false-answer hints) and implicit (contextual contradictions) misleading instructions, and compute the misleading rate—the fraction of correct-to-incorrect flips. Leveraging the most susceptible examples, we curate the Multimodal Uncertainty Benchmark (MUB), a collection of image–question pairs stratified into low, medium, and high difficulty based on how many of twelve state-of-the-art MLLMs they mislead. Extensive evaluation on twelve open-source and five closed-source models reveals a high uncertainty: average misleading rates exceed 86%, with explicit cues over 67.19% and implicit cues over 80.67%. To reduce the misleading rate, we then fine-tune all open-source MLLMs on a compact 2,000-sample mixed-instruction dataset, reducing misleading rates to 6.97% (explicit) and 32.77% (implicit), boosting consistency by nearly 29.37% on highly deceptive inputs, and slightly improving accuracy on standard benchmarks.
Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis
Haoming Huang | Yibo Yan | Jiahao Huo | Xin Zou | Xinfeng Li | Kun Wang | Xuming Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Haoming Huang | Yibo Yan | Jiahao Huo | Xin Zou | Xinfeng Li | Kun Wang | Xuming Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs), despite their remarkable capabilities, are hampered by hallucinations. A particularly challenging variant, knowledge overshadowing, occurs when one piece of activated knowledge inadvertently masks another relevant piece, leading to erroneous outputs even with high-quality training data. Current understanding of overshadowing is largely confined to inference-time observations, lacking deep insights into its origins and internal mechanisms during model training. Therefore, we introduce **PhantomCircuit, a novel framework designed to comprehensively analyze and detect knowledge overshadowing.** By innovatively employing knowledge circuit analysis, PhantomCircuit dissects the function of key components in the circuit and how the attention pattern dynamics contribute to the overshadowing phenomenon and its evolution throughout the training process. Extensive experiments demonstrate PhantomCircuit’s effectiveness in identifying such instances, offering novel insights into this elusive hallucination and providing the research community with a new methodological lens for its potential mitigation. Our code can be found in https://github.com/halfmorepiece/PhantomCircuit.
Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models
Kening Zheng | Junkai Chen | Yibo Yan | Xin Zou | Huiyu Zhou | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025
Kening Zheng | Junkai Chen | Yibo Yan | Xin Zou | Huiyu Zhou | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025
Hallucination issues continue to affect multimodal large language models (MLLMs), with existing research mainly addressing object-level or attribute-level hallucinations, neglecting the more complex relation hallucinations that require advanced reasoning. Current benchmarks for relation hallucinations lack detailed evaluation and effective mitigation, and their datasets often suffer from biases due to systematic annotation processes. To address these challenges, we introduce Reefknot, a comprehensive benchmark targeting relation hallucinations, comprising over 20,000 real-world samples. We provide a systematic definition of relation hallucinations, integrating perceptive and cognitive perspectives, and construct a relation-based corpus using the Visual Genome scene graph dataset. Our comparative evaluation reveals significant limitations in current MLLMs’ ability to handle relation hallucinations. Additionally, we propose a novel confidence-based mitigation strategy, which reduces the hallucination rate by an average of 9.75% across three datasets, including Reefknot. Our work offers valuable insights for achieving trustworthy multimodal intelligence. The dataset and code are released at https://github.com/JackChen-seu/Reefknot.
MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models
Jiahao Huo | Yibo Yan | Xu Zheng | Yuanhuiyi Lyu | Xin Zou | Zhihua Wei | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025
Jiahao Huo | Yibo Yan | Xu Zheng | Yuanhuiyi Lyu | Xin Zou | Zhihua Wei | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025
Recent progress in Machine Unlearning (MU) has introduced solutions for the selective removal of private or sensitive information encoded within deep neural networks. Nonetheless, MU for Multimodal Large Language Models (MLLMs) remains in its nascent phase. Therefore, we propose to **reformulate the task of multimodal MU in the era of MLLMs**, which aims to erase only the visual patterns associated with a given entity while preserving the corresponding textual knowledge encoded within the original parameters of the language model backbone. Furthermore, we **develop a novel geometry-constrained gradient ascent method MMUnlearner**. It updates the weights of MLLMs with a weight saliency map jointly restricted by the remaining concepts and textual knowledge during unlearning, thereby preserving parameters essential for non-target knowledge. Extensive experiments demonstrate that MMUnlearner surpasses baselines that finetuning MLLMs with VQA data directly through Gradient Ascent (GA) or Negative Preference Optimization (NPO), across all evaluation dimensions. Our code will be released upon acceptance.
2024
CoT-based Data Augmentation Strategy for Persuasion Techniques Detection
Dailin Li | Chuhan Wang | Xin Zou | Junlong Wang | Peng Chen | Jian Wang | Liang Yang | Hongfei Lin
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Dailin Li | Chuhan Wang | Xin Zou | Junlong Wang | Peng Chen | Jian Wang | Liang Yang | Hongfei Lin
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Detecting persuasive communication is an important topic in Natural Language Processing (NLP), as it can be useful in identifying fake information on social media. We have developed a system to identify applied persuasion techniques in text fragments across four languages: English, Bulgarian, North Macedonian, and Arabic. Our system uses data augmentation methods and employs an ensemble strategy that combines the strengths of both RoBERTa and DeBERTa models. Due to limited resources, we concentrated solely on task 1, and our solution achieved the top ranking in the English track during the official assessments. We also analyse the impact of architectural decisions, data constructionand training strategies.
Search
Fix author
Co-authors
- Xuming Hu 8
- Yibo Yan 6
- Jiahao Huo 3
- Yanggan Gu 2
- Dailin Li 2
- Hongfei Lin (林鸿飞) 2
- Shuliang Liu 2
- Yuanhuiyi Lyu 2
- Chuhan Wang 2
- Jian Wang 2
- Yi Cao 1
- Junkai Chen 1
- Peng Chen 1
- Yunkai Dang 1
- Dong Fang 1
- Yuqian Fu 1
- Mengxi Gao 1
- Haocong He 1
- Sirui Huang 1
- Haoming Huang 1
- Sihang Jia 1
- Lutao Jiang 1
- Peijie Jiang 1
- James Kwok 1
- Junzhuo Li 1
- Zhenghua Li (李正华) 1
- Jungang Li 1
- Xinfeng Li 1
- Chenfei Liao 1
- Aiwei Liu 1
- Jia Liu 1
- Mingdong Ou 1
- Ruoshui Peng 1
- Bin Ren 1
- Lingfeng Su 1
- Yuqi Tang 1
- Yanan Wang 1
- Wensong Wang 1
- Yiyu Wang 1
- Jingyu Wang 1
- Kun Wang 1
- Junlong Wang 1
- Zhihua Wei 1
- Zichen Wen 1
- Songbo Yang 1
- Liang Yang (杨亮) 1
- Linfeng Zhang 1
- Xu Zheng 1
- Kening Zheng 1
- Xu Zheng 1
- Huiyu Zhou 1