Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering

Yumeng Shi, Quanyu Long, Wenya Wang


Abstract
Video question answering benefits from the rich information in videos, enabling various applications. However, the large volume of tokens generated from long videos presents challenges to memory efficiency and model performance. To alleviate this, existing works propose to compress video inputs, but often overlook the varying importance of static and dynamic information across different queries, leading to inefficient token usage within limited budgets. We propose a novel token selection strategy, explore-then-select, that adaptively adjusts static and dynamic information based on question requirements. Our framework first explores different token allocations between key frames, which preserve spatial details, and delta frames, which capture temporal changes. Then it employs a query-aware attention-based metric to select the optimal token combination without model updates. Our framework is plug-and-play and can be seamlessly integrated within diverse video language models. Extensive experiments show that our method achieves significant performance improvements (up to 5.8%) on multiple video question answering benchmarks. Our code is available at *https://github.com/ANDgate99/Explore-Then-Select*.
Anthology ID:
2025.emnlp-main.545
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10770–10782
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.545/
DOI:
Bibkey:
Cite (ACL):
Yumeng Shi, Quanyu Long, and Wenya Wang. 2025. Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 10770–10782, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering (Shi et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.545.pdf
Checklist:
 2025.emnlp-main.545.checklist.pdf