Feifei Feng
2025
On Support Samples of Next Word Prediction
Yuqian Li
|
Yupei Du
|
Yufang Liu
|
Feifei Feng
|
Mou Xiao Feng
|
Yuanbin Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language models excel in various tasks by making complex decisions, yet understanding the rationale behind these decisions remains a challenge. This paper investigates data-centric interpretability in language models, focusing on the next-word prediction task. Using representer theorem, we identify two types of support samples—those that either promote or deter specific predictions. Our findings reveal that being a support sample is an intrinsic property, predictable even before training begins. Additionally, while non-support samples are less influential in direct predictions, they play a critical role in preventing overfitting and shaping generalization and representation learning. Notably, the importance of non-support samples increases in deeper layers, suggesting their significant role in intermediate representation formation.These insights shed light on the interplay between data and model decisions, offering a new dimension to understanding language model behavior and interpretability.
ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model
Zhongyi Zhou
|
Yichen Zhu
|
Minjie Zhu
|
Junjie Wen
|
Ning Liu
|
Zhiyuan Xu
|
Weibin Meng
|
Yaxin Peng
|
Chaomin Shen
|
Feifei Feng
|
Yi Xu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Humans possess a unified cognitive ability to perceive, comprehend, and interact with the physical world. Why can’t large language models replicate this holistic understanding? Through a systematic analysis of existing training paradigms in vision-language-action models (VLA), we identify two key challenges: spurious forgetting, where robot training overwrites crucial visual-text alignments, and task interference, where competing control and understanding tasks degrade performance when trained jointly. To overcome these limitations, we propose ChatVLA, a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference. ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art vision-language-action (VLA) methods on multimodal understanding benchmarks. Notably, it achieves a six times higher performance on MMMU and scores 47.2% on MMStar with a more parameter-efficient design than ECoT. Furthermore, ChatVLA demonstrates superior performance on 25 real-world robot manipulation tasks compared to existing VLA methods like OpenVLA. Our findings highlight the potential of our unified framework for achieving both robust multimodal understanding and effective robot control.