Zhipeng Wang
2026
You Only Need One Single Token to Refine Safety Alignment
Wenqian Yu | Shuo Chen | Zhijiang Li | Zhipeng Wang | Jindong Gu
Findings of the Association for Computational Linguistics: ACL 2026
Wenqian Yu | Shuo Chen | Zhijiang Li | Zhipeng Wang | Jindong Gu
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) face a critical alignment challenge: balancing safety with helpfulness. Excessive safety can lead to over-refusal, where models reject harmful-looking yet benign queries, severely limiting utility.Existing training-free interventions offer an efficient way to mitigate over-refusal without re-training, but suffer from high inference overhead and architecture dependency. Our work explores a complementary direction: rather than applying post-hoc corrections to model outputs, our goal is to intrinsically reshape the distributions of harmful and benign samples within the model’s decision space. In this paper, we argue that a lightweight training-based approach can more effectively distinguish between harmful and benign samples. We propose Single Token Alignment (STA), which optimizes only a single-token prefix (e.g., 4,096 parameters) while keeping the base model frozen. To address the inherent challenge of achieving robust refinement through such a minimal parameter interface, STA employs a mixed weighting mechanism integrated with its optimization objective. This mechanism incorporates hard weighting via stringent data filtering to provide clear, unbiased learning signals, and soft weighting through a focal mechanism to prioritize challenging cases.Extensive experiments across 9 models and 10 datasets demonstrate that STA achieves a superior safety-helpfulness balance for LLMs, MLLMs, and reasoning models, offering a highly efficient and generalizable solution for refining safety alignment.
AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives
Yanxi Chen | Wenhui Zhu | Xiwen Chen | Zhipeng Wang | Xin Li | Peijie Qiu | Hao Wang | Xuanzhao Dong | Yujian Xiong | Anderson Schneider | Yuriy Nevmyvaka | Yalin Wang
Findings of the Association for Computational Linguistics: ACL 2026
Yanxi Chen | Wenhui Zhu | Xiwen Chen | Zhipeng Wang | Xin Li | Peijie Qiu | Hao Wang | Xuanzhao Dong | Yujian Xiong | Anderson Schneider | Yuriy Nevmyvaka | Yalin Wang
Findings of the Association for Computational Linguistics: ACL 2026
Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g., generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods.
CoDA: Restoring Contextual Dominance via Copy-Encouraged Attention Intervention for Mitigating RAG Hallucinations
JinWei Shi | Qizhuo Xie | Qianzi Hou | Zhipeng Wang | Wanting Su | Jianhua Zhao | Tao Zheng | Tieke He
Findings of the Association for Computational Linguistics: ACL 2026
JinWei Shi | Qizhuo Xie | Qianzi Hou | Zhipeng Wang | Wanting Su | Jianhua Zhao | Tao Zheng | Tieke He
Findings of the Association for Computational Linguistics: ACL 2026
Retrieval-augmented generation reduces hallucination by grounding model outputs in external evidence, yet hallucinations can still occur even when the retrieved context is accurate and sufficient. From the perspective of information routing in the residual stream, this reflects an imbalance where internal parametric knowledge overwhelms external context during generation. We present an attention-centric analysis of RAG hallucination under valid evidence, showing that hallucinated and factual tokens diverge in mid-to-late Transformer layers as context-selective attention routing weakens, allowing parametric influence to dominate the residual stream. Motivated by prior studies showing that some attention heads—often referred to as copying heads—exhibit stronger information transport capacity, we aim to extend similar evidence-carrying behavior to a broader set of attention heads. To this end, we introduce CoDA, a lightweight inference-time attention intervention that amplifies evidence-aligned value states, enabling more attention heads to transport reliable external evidence in a copy-encouraged manner. Experiments demonstrate that CoDA improves contextual faithfulness, reduces hallucination, and remains robust under long and noisy contexts with modest and stable inference overhead.
Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation
Wei-Rui Chen | Vignesh Kothapalli | Ata Fatahibaarzi | Hejian Sang | Shao Tang | Qingquan Song | Zhipeng Wang | Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: ACL 2026
Wei-Rui Chen | Vignesh Kothapalli | Ata Fatahibaarzi | Hejian Sang | Shao Tang | Qingquan Song | Zhipeng Wang | Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: ACL 2026
Distilling the capabilities from a large reasoning model (LRM) to a smaller student model often involves training on substantial amounts of reasoning data. However, knowledge distillation (KD) over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) sections makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different sections (P, CoT, A) affects student performance. Our analysis shows that selective KD over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that beyond a specific length, longer training sequences provide marginal returns for downstream performance but require substantially higher memory and FLOPs. To this end, training on only the first 50% of tokens of every training sequence can retain, on average, ≈91% of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about 50% each. Codes are available at https://github.com/weiruichen01/distilling-the-essence.
2025
Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems
Kayhan Behdin | Ata Fatahibaarzi | Qingquan Song | Yun Dai | Aman Gupta | Zhipeng Wang | Hejian Sang | Shao Tang | Gregory Dexter | Sirou Zhu | Siyu Zhu | Tejas Dharamsi | Vignesh Kothapalli | Zhoutong Fu | Yihan Cao | Pin-Lun Hsu | Fedor Borisyuk | Natesh S. Pillai | Luke Simon | Rahul Mazumder
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Kayhan Behdin | Ata Fatahibaarzi | Qingquan Song | Yun Dai | Aman Gupta | Zhipeng Wang | Hejian Sang | Shao Tang | Gregory Dexter | Sirou Zhu | Siyu Zhu | Tejas Dharamsi | Vignesh Kothapalli | Zhoutong Fu | Yihan Cao | Pin-Lun Hsu | Fedor Borisyuk | Natesh S. Pillai | Luke Simon | Rahul Mazumder
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendation systems to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present a comprehensive set of insights for training and deploying small language models (SLMs) that deliver high performance for a variety of industry use cases. We focus on two key techniques: (1) knowledge distillation and (2) model compression via structured pruning and quantization. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training/serving costs and latency. We detail the impact of these techniques on a variety of use cases in a large professional social network platform and share deployment lessons, including hardware optimization strategies that improve speed and throughput for both predictive and reasoning-based applications in Recommendation Systems.
2023
BIT’s System for Multilingual Track
Zhipeng Wang | Yuhang Guo | Shuoying Chen
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
Zhipeng Wang | Yuhang Guo | Shuoying Chen
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
This paper describes the system we submitted to the IWSLT 2023 multilingual speech translation track, with input being English speech and output being text in 10 target languages. Our system consists of CNN and Transformer, convolutional neural networks downsample speech features and extract local information, while transformer extract global features and output the final results. In our system, we use speech recognition tasks to pre-train encoder parameters, and then use speech translation corpus to train the multilingual speech translation model. We have also adopted other methods to optimize the model, such as data augmentation, model ensemble, etc. Our system can obtain satisfactory results on test sets of 10 languages in the MUST-C corpus.
2021
BIT’s system for AutoSimulTrans2021
Mengge Liu | Shuoying Chen | Minqin Li | Zhipeng Wang | Yuhang Guo
Proceedings of the Second Workshop on Automatic Simultaneous Translation
Mengge Liu | Shuoying Chen | Minqin Li | Zhipeng Wang | Yuhang Guo
Proceedings of the Second Workshop on Automatic Simultaneous Translation
In this paper we introduce our Chinese-English simultaneous translation system participating in AutoSimulTrans2021. In simultaneous translation, translation quality and delay are both important. In order to reduce the translation delay, we cut the streaming-input source sentence into segments and translate the segments before the full sentence is received. In order to obtain high-quality translations, we pre-train a translation model with adequate corpus and fine-tune the model with domain adaptation and sentence length adaptation. The experimental results on the evaluation data show that our system performs better than the baseline system.
Search
Fix author
Co-authors
- Shuoying Chen 2
- Ata Fatahibaarzi 2
- Yuhang Guo (郭宇航) 2
- Vignesh Kothapalli 2
- Hejian Sang 2
- Qingquan Song 2
- Shao Tang 2
- Muhammad Abdul-Mageed 1
- Kayhan Behdin 1
- Fedor Borisyuk 1
- Yihan Cao 1
- Shuo Chen 1
- Yanxi Chen 1
- Xiwen Chen 1
- Wei-Rui Chen 1
- Yun Dai 1
- Gregory Dexter 1
- Tejas Dharamsi 1
- Xuanzhao Dong 1
- Zhoutong Fu 1
- Jindong Gu 1
- Aman Gupta 1
- Tieke He 1
- Qianzi Hou 1
- Pin-Lun Hsu 1
- Zhijiang Li 1
- Xin Li 1
- Minqin Li 1
- Mengge Liu 1
- Rahul Mazumder 1
- Yuriy Nevmyvaka 1
- Natesh S. Pillai 1
- Peijie Qiu 1
- Anderson Schneider 1
- JinWei Shi 1
- Luke Simon 1
- Wanting Su 1
- Hao Wang 1
- Yalin Wang 1
- Qizhuo Xie 1
- Yujian Xiong 1
- Wenqian Yu 1
- Jianhua Zhao 1
- Tao Zheng 1
- Wenhui Zhu 1
- Sirou Zhu 1
- Siyu Zhu 1