Yaonan Gu
2025
IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web
Hongcheng Guo
|
Wei Zhang
|
Junhao Chen
|
Yaonan Gu
|
Jian Yang
|
Junjia Du
|
Shaosheng Cao
|
Binyuan Hui
|
Tianyu Liu
|
Jianxin Ma
|
Chang Zhou
|
Zhoujun Li
Findings of the Association for Computational Linguistics: ACL 2025
Recently, advancements in large multimodal models have led to significant strides in image comprehension capabilities. Despite these advancements, there is a lack of a robust benchmark specifically for assessing the image‐to‐web conversion proficiency of these large models. It is essential to ensure the integrity of the web elements generated, which comprise both visible and invisible categories. Previous evaluation methods (e.g., BLEU) are notably susceptible to significant alterations due to the presence of invisible elements. Furthermore, it is crucial to measure the layout information of web pages—i.e., the positional relationships between elements—which has been overlooked by prior work. To address these challenges, we have curated and aligned a benchmark of images and corresponding web codes (IW-bench). Specifically, we propose Element Accuracy, which tests the completeness of elements by parsing the Document Object Model (DOM) tree. We also introduce Layout Accuracy to analyze positional relationships by converting the DOM tree into a common subsequence. In addition, we design a five‐hop multimodal Chain‐of‐Thought prompting strategy for improved performance, consisting of: 1) SoM prompt injection, 2) inferring elements, 3) inferring layout, 4) inferring web code, and 5) reflection. Our benchmark comprises 1,200 image–code pairs with varying levels of difficulty. We have conducted extensive experiments on existing large multimodal models, providing insights into their performance and identifying areas for improvement in the image‐to‐web domain.
Search
Fix author
Co-authors
- Shaosheng Cao 1
- Junhao Chen 1
- Junjia Du 1
- Hongcheng Guo 1
- Binyuan Hui 1
- show all...