VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment

Yogesh Kulkarni; Pooyan Fazli

VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment

Abstract

Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity. To address these limitations, we introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a framework that enhances Video-LLMs through targeted preference optimization. VideoPASTA trains models to distinguish accurate video representations from carefully crafted adversarial examples that deliberately violate spatial, temporal, or cross-frame relationships. With only 7,020 preference pairs and Direct Preference Optimization, VideoPASTA enables models to learn robust representations that capture fine-grained spatial details and long-range temporal dynamics. Experiments demonstrate that VideoPASTA is model agnostic and significantly improves performance, for example, achieving gains of up to + 3.8 percentage points on LongVideoBench, +4.1 on VideoMME, and +4.0 on MVBench, when applied to various state-of-the-art Video-LLMs. These results demonstrate that targeted alignment, rather than massive pretraining or architectural modifications, effectively addresses core video-language challenges. Notably, VideoPASTA achieves these improvements without any human annotation or captioning, relying solely on 32-frame sampling. This efficiency makes our approach a scalable plug-and-play solution that seamlessly integrates with existing models while preserving their original capabilities.

Anthology ID:: 2025.emnlp-main.1647
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 32342–32367
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1647/
DOI:
Bibkey:
Cite (ACL):: Yogesh Kulkarni and Pooyan Fazli. 2025. VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32342–32367, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment (Kulkarni & Fazli, EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1647.pdf
Checklist:: 2025.emnlp-main.1647.checklist.pdf

PDF Cite Search Checklist Fix data