Xuande Feng
2025
PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction
Hammad Ayyubi
|
Xuande Feng
|
Junzhang Liu
|
Xudong Lin
|
Zhecan Wang
|
Shih-Fu Chang
Findings of the Association for Computational Linguistics: NAACL 2025
The task of predicting time and location from images is challenging and requires complex human-like puzzle-solving ability over different clues. In this work, we formalize this ability into core skills and implement them using different modules in an expert pipeline called PuzzleGPT. PuzzleGPT consists of a perceiver to identify visual clues, a reasoner to deduce prediction candidates, a combiner to combinatorially combine information from different clues, a web retriever to get external knowledge if the task can’t be solved locally, and a noise filter for robustness. This results in a zero-shot, interpretable, and robust approach that records state-of-the-art performance on two datasets – TARA and WikiTilo. PuzzleGPT outperforms large VLMs such as BLIP-2, InstructBLIP, LLaVA, and even GPT-4V, as well as automatically generated reasoning pipelines like VisProg, by at least 32% and 38%, respectively. It even rivals or surpasses finetuned models.
2024
VIEWS: Entity-Aware News Video Captioning
Hammad Ayyubi
|
Tianqi Liu
|
Arsha Nagrani
|
Xudong Lin
|
Mingda Zhang
|
Anurag Arnab
|
Feng Han
|
Yukun Zhu
|
Xuande Feng
|
Kevin Zhang
|
Jialu Liu
|
Shih-Fu Chang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Existing popular video captioning benchmarks and models often produce generic captions for videos that lack specific identification of individuals, locations, or organizations (named entities). However, in the case of news videos, the setting is more demanding, requiring the inclusion of such named entities for meaningful summarization. Therefore, we introduce the task of directly summarizing news videos into captions that are entity-aware. To facilitate research in this area, we have collected a large-scale dataset named VIEWS (VIdeo NEWS). Within this task, we face challenges inherent to recognizing named entities and navigating diverse, dynamic contexts, all while relying solely on visual cues. To address these challenges, we propose a model-agnostic approach that enriches visual information extracted from videos with context sourced from external knowledge, enabling the generation of entity-aware captions. We validate the effectiveness of our approach across three video captioning models. Additionally, we conduct a critical analysis of our methodology to gain insights into the complexity of the task, the challenges it presents, and potential avenues for future research.
Search
Fix data
Co-authors
- Hammad Ayyubi 2
- Shih-Fu Chang 2
- Xudong Lin 2
- Anurag Arnab 1
- Feng Han 1
- show all...