Hannah Brown
2026
ComicVQA: A Benchmark for Visual Reasoning in Multimodal LLMs
Esther Gan | Hannah Brown | David Herel | Kenji Kawaguchi | Min-Yen Kan | Michael Qizhe Shieh
Findings of the Association for Computational Linguistics: ACL 2026
Esther Gan | Hannah Brown | David Herel | Kenji Kawaguchi | Min-Yen Kan | Michael Qizhe Shieh
Findings of the Association for Computational Linguistics: ACL 2026
We introduce Comic Visual Question Answering (ComicVQA), a comics-based benchmark for evaluating MLLMs on visual reasoning. ComicVQA comprises of (i) Missing Panel Prediction, testing fine-grained visual grounding and (ii) Panel Sorting, which evaluates sequential narrative understanding. Proprietary models achieve up to 62.6% on Missing Panel Prediction and 46.4% on Panel Sorting, whereas open-source models reach only 47.7% and 26.9%, respectively. In contrast, human annotators achieve over 83% accuracy on both tasks, revealing a large gap between current models and human-level multimodal understanding in comics. Through controlled ordering ablations and a detailed error taxonomy, we show that current MLLMs rely primarily on coarse temporal cues and struggle with fine-grained visual reasoning. These findings demonstrate ComicVQA as a diagnostic benchmark for advancing multimodal visual reasoning in comics.
2024
Prompt Optimization via Adversarial In-Context Learning
Xuan Long Do | Yiran Zhao | Hannah Brown | Yuxi Xie | James Xu Zhao | Nancy F. Chen | Kenji Kawaguchi | Michael Shieh | Junxian He
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xuan Long Do | Yiran Zhao | Hannah Brown | Yuxi Xie | James Xu Zhao | Nancy F. Chen | Kenji Kawaguchi | Michael Shieh | Junxian He
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We propose a new method, Adversarial In-Context Learning (adv-ICL), to optimize prompts for in-context learning (ICL). Inspired by adversarial learning, adv-ICL is implemented as a two-player game between a generator and discriminator, with LLMs acting as both. In each round, given an input prefixed by task instructions and several exemplars, the generator produces an output. The discriminator then classifies the generator’s input-output pair as model-generated or real data. Based on the discriminator’s loss, a prompt modifier LLM proposes possible edits to the generator and discriminator prompts, and the edits that most improve the adversarial loss are selected. We show that applying adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques for both open and closed-source models on 13 generation and classification tasks including summarization, arithmetic reasoning, machine translation, data-to-text generation, and the MMLU and big-bench hard benchmarks. In addition, our method is computationally efficient, easily extensible to other LLMs and tasks, and effective in low-resource settings.