Zoe Wanying He
2026
Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling
Yujie Chen | Tailai Chen | Yifeng Gao | Zoe Wanying He | Yijue Xu | Shaobo Wang | Linfeng Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yujie Chen | Tailai Chen | Yifeng Gao | Zoe Wanying He | Yijue Xu | Shaobo Wang | Linfeng Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward semantic fixing points, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git .
2025
Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models
Zoe Wanying He | Sean Trott | Meenakshi Khosla
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zoe Wanying He | Sean Trott | Meenakshi Khosla
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Recent studies show that deep vision-only and language-only models—trained on disjoint modalities—nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of _where_ in each network this convergence emerges, _what_ visual or linguistic cues support it, _whether_ it captures human preferences in many-to-many image-text scenarios, and _how_ aggregating exemplars of the same concept affects alignment. Here, we systematically investigate these questions. We find that alignment peaks in mid-to-late layers of both model types, reflecting a shift from modality-specific to conceptually shared representations. This alignment is robust to appearance-only changes but collapses when semantics are altered (e.g., object removal or word-order scrambling), highlighting that the shared code is truly semantic. Moving beyond the one-to-one image-caption paradigm, a forced-choice “Pick-a-Pic” task shows that human preferences for image-caption matches are mirrored in the embedding spaces across all vision-language model pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions akin to human judgments. Surprisingly, averaging embeddings across exemplars amplifies alignment rather than blurring detail. Together, our results demonstrate that unimodal networks converge on a shared semantic code that aligns with human judgments and strengthens with exemplar aggregation.