Vernon Toh

2025

pdf bib abs
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
Qi Sun | Pengfei Hong | Tej Deep Pala | Vernon Toh | U-Xuan Tan | Deepanway Ghosal | Soujanya Poria
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Traditional reinforcement learning-based robotic control methods are often task-specific and fail to generalize across diverse environments or unseen objects and instructions. Visual Language Models (VLMs) demonstrate strong scene understanding and planning capabilities but lack the ability to generate actionable policies tailored to specific robotic embodiments. To address this, Visual-Language-Action (VLA) models have emerged, yet they face challenges in long-horizon spatial reasoning and grounded task planning. In this work, we propose the Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning, EMMA-X. EMMA-X leverages our constructed hierarchical embodiment dataset based on BridgeV2, containing 60,000 robot manipulation trajectories auto-annotated with grounded task reasoning and spatial guidance. Additionally, we introduce a trajectory segmentation strategy based on gripper states and motion trajectories, which can help mitigate hallucination in grounding subtask reasoning generation. Experimental results demonstrate that EMMA-X achieves superior performance over competitive baselines, particularly in real-world robotic tasks requiring spatial reasoning.

pdf bib abs
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
Tej Deep Pala | Vernon Toh | Rishabh Bhardwaj | Soujanya Poria
Findings of the Association for Computational Linguistics: EMNLP 2025

As large language models (LLMs) are increasingly integrated into real-world applications, ensuring their safety and robustness is critical. Automated red-teaming methods generate adversarial attacks to identify vulnerabilities, but existing approaches often face challenges like slow performance, limited categorical diversity, and high resource demands. We propose Ferret, a novel method that enhances the baseline, Rainbow Teaming by generating multiple adversarial prompt mutations per iteration and ranking them using scoring functions such as reward models, Llama Guard, and LLM-as-a-judge. Ferret achieves a 95% attack success rate (ASR), a 46% improvement over baseline, and reduces time to a 90% ASR by 15.2%. Additionally, it generates transferable adversarial prompts effective on larger LLMs. Our code is available at https://github.com/declare-lab/ferret

pdf bib abs
AlgoPuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Algorithmic Multimodal Puzzles
Deepanway Ghosal | Vernon Toh | Yew Ken Chia | Soujanya Poria
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning. We create the puzzles to encompass a diverse array of mathematical and algorithmic topics such as boolean logic, combinatorics, graph theory, optimization, search, etc., aiming to evaluate the gap between visual data interpretation and algorithmic problem-solving skills. The dataset is generated automatically from code authored by humans. All our puzzles have exact solutions that can be found from the algorithm without tedious human calculations. It ensures that our dataset can be scaled up arbitrarily in terms of reasoning complexity and dataset size. Our investigation reveals that multimodal language models such as GPT4V and Gemini exhibit limited performance in puzzle-solving tasks. We find that their performance is near random in a multi-choice question-answering setup for a significant number of puzzles. The findings emphasize the challenges of integrating visual, language, and algorithmic knowledge for solving complex reasoning problems.

2024

pdf bib abs
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns
Yew Ken Chia | Vernon Toh | Deepanway Ghosal | Lidong Bing | Soujanya Poria
Findings of the Association for Computational Linguistics: ACL 2024

Large multimodal models extend the impressive capabilities of large language models by integrating multimodal understanding abilities. However, it is not clear how they can emulate the general intelligence and reasoning ability of humans. As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of 2000 puzzle instances based on abstract patterns. With this dataset, we evaluate large multimodal models with abstract patterns based on fundamental concepts, including colors, numbers, sizes, and shapes. Through our experiments on state-of-the-art large multimodal models, we find that they are not able to generalize well to simple abstract patterns. Notably, GPT-4V achieves a score of 46.4% on single-concept puzzles, which shows that state-of-the-art models struggle on our dataset. To diagnose the reasoning challenges in large multimodal models, we progressively guide the models with our ground truth reasoning explanations for visual perception, inductive reasoning, and deductive reasoning. Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities. Through this work, we hope to shed light on the limitations of large multimodal models and how they can better emulate human cognitive processes in the future.

Co-authors

Lidong Bing 1

Pengfei Hong 1

Qi Sun 1

U-Xuan Tan 1

Venues

Fix data

Vernon Toh

Fixing paper assignments

2025

2024

Co-authors

Venues