Arpan Phukan


2025

pdf bib
VideoChain: A Transformer-Based Framework for Multi-hop Video Question Generation
Arpan Phukan | Anupam Pandey | Deepjyoti Bodo | Asif Ekbal
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Multi-hop Question Generation (QG) effectively evaluates reasoning but remains confined to text; Video Question Generation (VideoQG) is limited to zero-hop questions over single segments. To address this, we introduce VideoChain, a novel Multi-hop Video Question Generation (MVQG) framework designed to generate questions that require reasoning across multiple, temporally separated video segments. VideoChain features a modular architecture built on a modified BART backbone enhanced with video embeddings, capturing textual and visual dependencies. Using the TVQA+ dataset, we automatically construct the large-scale MVQ-60 dataset by merging zero-hop QA pairs, ensuring scalability and diversity. Evaluations show VideoChain’s strong performance across standard generation metrics: ROUGE-L (0.6454), ROUGE-1 (0.6854), BLEU-1 (0.6711), BERTScore-F1 (0.7967), and semantic similarity (0.8110). These results highlight the model’s ability to generate coherent, contextually grounded, and reasoning-intensive questions. To facilitate future research, we publicly release our code and dataset.

2024

pdf bib
ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos
Arpan Phukan | Manish Gupta | Asif Ekbal
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Previous studies on question generation from videos have mostly focused on generating questions about common objects and attributes and hence are not entity-centric. In this work, we focus on the generation of entity-centric information-seeking questions from videos. Such a system could be useful for video-based learning, recommending “People Also Ask” questions, video-based chatbots, and fact-checking. Our work addresses three key challenges: identifying question-worthy information, linking it to entities, and effectively utilizing multimodal signals. Further, to the best of our knowledge, there does not exist a large-scale dataset for this task. Most video question generation datasets are on TV shows, movies, or human activities or lack entity-centric information-seeking questions. Hence, we contribute a diverse dataset of YouTube videos, VideoQuestions, consisting of 411 videos with 2265 manually annotated questions. We further propose a model architecture combining Transformers, rich context signals (titles, transcripts, captions, embeddings), and a combination of cross-entropy and contrastive loss function to encourage entity-centric question generation. Our best method yields BLEU, ROUGE, CIDEr, and METEOR scores of 71.3, 78.6, 7.31, and 81.9, respectively, demonstrating practical usability. We make the code and dataset publicly available.

pdf bib
Hope ‘The Paragraph Guy’ explains the rest : Introducing MeSum, the Meme Summarizer
Anas Anwarul Haq Khan | Tanik Saikh | Arpan Phukan | Asif Ekbal
Findings of the Association for Computational Linguistics: EMNLP 2024

2023

pdf bib
QeMMA: Quantum-Enhanced Multi-Modal Sentiment Analysis
Arpan Phukan | Asif Ekbal
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Multi-modal data analysis presents formidable challenges, as developing effective methods to capture correlations among different modalities remains an ongoing pursuit. In this study, we address multi-modal sentiment analysis through a novel quantum perspective. We propose that quantum principles, such as superposition, entanglement, and interference, offer a more comprehensive framework for capturing not only the cross-modal interactions between text, acoustics, and visuals but also the intricate relations within each modality. To empirically evaluate our approach, we employ the CMUMOSEI dataset as our testbed and utilize Qiskit by IBM to run our experiments on a quantum computer. Our proposed Quantum-Enhanced Multi-Modal Analysis Framework (QeMMA) showcases its significant potential by surpassing the baseline by 3.52% and 10.14% in terms of accuracy and F1 score, respectively, highlighting the promise of quantum-inspired methodologies.