See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers

Ding Xia, Xinyue Gui, Mark Colley, Fan Gao, Zhongyi Zhou, Dongyuan Li, Renhe Jiang, Takeo Igarashi


Abstract
Automated vehicles lack natural communication channels with other road users, making external Human-Machine Interfaces (eHMIs) essential for conveying intent and maintaining trust in shared environments. However, most eHMI studies rely on developer-crafted message-action pairs, which are difficult to adapt to diverse and dynamic traffic contexts. A promising alternative is to use Large Language Models (LLMs) as action designers that generate context-conditioned eHMI actions, yet such designers lack perceptual verification and typically depend on fixed prompts or costly human-annotated feedback for improvement.We present See2Refine, a human-free, closed-loop framework that uses vision-language models (VLMs) for perceptual evaluation as automated visual feedback to improve an LLM-based eHMI action designer. Given a driving context and a candidate eHMI action, the VLM evaluates the perceived appropriateness of the action, and this feedback is used to iteratively revise the designer’s outputs, enabling systematic refinement without human supervision.We evaluate our framework across three eHMI modalities (lightbar, eyes, and arm) and multiple LLM model sizes. Across settings, our framework consistently outperforms prompt-only LLM designers and manually specified baselines in both VLM-based metrics and human-subject evaluations. The results further indicate that the improvements are generalized across modalities and that VLM evaluations are reasonably aligned with human preferences in our controlled settings, supporting the robustness and effectiveness of See2Refine for scalable action design.
Anthology ID:
2026.acl-long.1044
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22805–22822
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1044/
DOI:
Bibkey:
Cite (ACL):
Ding Xia, Xinyue Gui, Mark Colley, Fan Gao, Zhongyi Zhou, Dongyuan Li, Renhe Jiang, and Takeo Igarashi. 2026. See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22805–22822, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers (Xia et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1044.pdf
Checklist:
 2026.acl-long.1044.checklist.pdf