Mehrab Tanjim


2025

pdf bib
VISIAR: Empower MLLM for Visual Story Ideation
Zhaoyang Xia | Somdeb Sarkhel | Mehrab Tanjim | Stefano Petrangeli | Ishita Dasgupta | Yuxiao Chen | Jinxuan Xu | Di Liu | Saayan Mitra | Dimitris N. Metaxas
Findings of the Association for Computational Linguistics: ACL 2025

Ideation, the process of forming ideas from concepts, is a big part of the content creation process. However, the noble goal of helping visual content creators by suggesting meaningful sequences of visual assets from a limited collection is challenging. It requires a nuanced understanding of visual assets and the integration of open-world knowledge to support creative exploration. Despite its importance, this task has yet to be explored fully in existing literature. To fill this gap, we propose Visual Story Ideation, a novel and underexplored task focused on the automated selection and arrangement of visual assets into coherent sequences that convey expressive storylines.We also present VISIAR, Visual Ideation through Sequence Integration and Asset Rearrangement, a robust framework leveraging Multimodal Large Language Models (MLLMs), and a novel Story Graph mechanism. Our framework operates in three key stages: visual content understanding, candidate asset selection, and asset rearrangement via MLLMs. In addition, we curated a new benchmark dataset, called VTravel, to evaluate our methods both qualitatively and quantitatively.User studies and GPT-as-the-judge evaluation show that our approach surpasses GPT-4o based baseline by an average of 33.5% and 18.5% across three different metrics, demonstrating the effectiveness of our framework for generating compelling visual stories.

pdf bib
GUI Agents: A Survey
Dang Nguyen | Jian Chen | Yu Wang | Gang Wu | Namyong Park | Zhengmian Hu | Hanjia Lyu | Junda Wu | Ryan Aponte | Yu Xia | Xintong Li | Jing Shi | Hongjie Chen | Viet Dac Lai | Zhouhang Xie | Sungchul Kim | Ruiyi Zhang | Tong Yu | Mehrab Tanjim | Nesreen K. Ahmed | Puneet Mathur | Seunghyun Yoon | Lina Yao | Branislav Kveton | Jihyung Kil | Thien Huu Nguyen | Trung Bui | Tianyi Zhou | Ryan A. Rossi | Franck Dernoncourt
Findings of the Association for Computational Linguistics: ACL 2025

Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.

pdf bib
Diversify-verify-adapt: Efficient and Robust Retrieval-Augmented Ambiguous Question Answering
Yeonjun In | Sungchul Kim | Ryan A. Rossi | Mehrab Tanjim | Tong Yu | Ritwik Sinha | Chanyoung Park
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The retrieval augmented generation (RAG) framework addresses an ambiguity in user queries in QA systems by retrieving passages that cover all plausible interpretations and generating comprehensive responses based on the passages. However, our preliminary studies reveal that a single retrieval process often suffers from low-quality results, as the retrieved passages frequently fail to capture all plausible interpretations. Although the iterative RAG approach has been proposed to address this problem, it comes at the cost of significantly reduced efficiency. To address these issues, we propose the diversify-verify-adapt (DIVA) framework. DIVA first diversifies the retrieved passages to encompass diverse interpretations. Subsequently, DIVA verifies the quality of the passages and adapts the most suitable approach tailored to their quality. This approach improves the QA systems’ accuracy and robustness by handling low quality retrieval issue in ambiguous questions, while enhancing efficiency.

pdf bib
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes
Isabel O. Gallegos | Ryan Aponte | Ryan A. Rossi | Joe Barrow | Mehrab Tanjim | Tong Yu | Hanieh Deilamsalehy | Ruiyi Zhang | Sungchul Kim | Franck Dernoncourt | Nedim Lipka | Deonna Owens | Jiuxiang Gu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Large language models (LLMs) have shown remarkable advances in language generation and understanding but are also prone to exhibiting harmful social biases. While recognition of these behaviors has generated an abundance of bias mitigation techniques, most require modifications to the training data, model parameters, or decoding strategy, which may be infeasible without access to a trainable model. In this work, we leverage the zero-shot capabilities of LLMs to reduce stereotyping in a technique we introduce as zero-shot self-debiasing. With two approaches, self-debiasing via explanation and self-debiasing via reprompting, we show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups while relying only on the LLM itself and a simple prompt, with explanations correctly identifying invalid assumptions and reprompting delivering the greatest reductions in bias. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.