Jiayi Kuang
2025
Browsing Like Human: A Multimodal Web Agent with Experiential Fast-and-Slow Thinking
Haohao Luo
|
Jiayi Kuang
|
Wei Liu
|
Ying Shen
|
Jian Luan
|
Yang Deng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automating web navigation which aims to build a web agent that follows user instructions to complete tasks like booking flights by interacting with websites, has received increasing attention due to its practical value. Although existing web agents are mostly equipped with visual perception, planning, and memory abilities, their reasoning process are still deviate from human cognition. In this work, we study the human thought pattern to empower agent with more human-like abilities in web navigation. To tackle this problem, we propose a novel multimodal web agent framework called WebExperT, which is designed to emulate the human planning process of “thinking fast and slow” to effectively decompose complex user instructions. Furthermore, WebExperT leverages experiential learning by reflecting from failure for continuously refining planning and decision-making outcomes. Experimental results on the Mind2Web benchmark demonstrate the superiority of WebExperT in both supervised and unsupervised settings.
Express What You See: Can Multimodal LLMs Decode Visual Ciphers with Intuitive Semiosis Comprehension?
Jiayi Kuang
|
Yinghui Li
|
Chen Wang
|
Haohao Luo
|
Ying Shen
|
Wenhao Jiang
Findings of the Association for Computational Linguistics: ACL 2025
Bridging the gap between visual and language remains a pivotal challenge for the multimodal community. Traditional VQA benchmarks encounter a modality gap and over-reliance on language priors, whereas human cognition excels at intuitive semiosis, associating abstract visual symbols to linguistic semantics. Inspired by this neurocognitive mechanism, we focus on emojis, the visual cipher conveying abstract textual semantics. Specifically, we propose a novel task of generating abstract linguistics from emoji sequence images, where such reasoning underpins critical applications in cryptography, thus challenging MLLMs’ reasoning of decoding complex semantics of visual ciphers. We introduce eWe-bench (Express What you SeE), assessing MLLMs’ capability of intuitive semiosis like humans. Our data construction framework ensures high visual sensitivity and data quality, which can be extended to future data enhancement. Evaluation results on advanced MLLMs highlight critical deficiencies in visual intuitive symbolic reasoning. We believe our interesting insights for advancing visual semiosis in MLLMs will pave the way for cryptographic analysis and high-level intuitive cognition intelligence of MLLMs.