Winston H. Hsu
2026
ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints
Pei-An Chen | Yongching Liang | Jia-Fong Yeh | Hung-Ting Su | Yi-Ting Chen | Min Sun | Winston H. Hsu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pei-An Chen | Yongching Liang | Jia-Fong Yeh | Hung-Ting Su | Yi-Ting Chen | Min Sun | Winston H. Hsu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT (Affordance-Driven Adaptive Planning and Task execution), a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.
VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions
Hung-Ting Su | Ting-Jun Wang | Jia-Fong Yeh | Min Sun | Winston H. Hsu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hung-Ting Su | Ting-Jun Wang | Jia-Fong Yeh | Min Sun | Winston H. Hsu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Conventional Vision-and-Language Navigation (VLN) benchmarks assume instructions are feasible and the referenced target exists, leaving agents ill-equipped to handle false-premise goals. We introduce VLN-NF, a benchmark with false-premise instructions where the target is absent from the specified area and agents must navigate, gather evidence through in-room exploration, and explicitly output . VLN-NF is constructed via a scalable pipeline that rewrites VLN instructions using an LLM and verifies target absence with a VLM, producing plausible yet factually incorrect goals. We further propose REV-SPL to jointly evaluate room reaching, exploration coverage, and decision correctness. To address this challenge, we present ROAM, a two-stage hybrid that combines supervised room-level navigation with LLM/VLM-driven in-room exploration guided by a free-space clearance prior. ROAM achieves the best REV-SPL among compared methods, while baselines often under-explore and terminate prematurely under unreliable instructions. Code and data will be released upon acceptance.
2025
MovieCORE: COgnitive REasoning in Movies
Gueter Josmy Faure | Min-Hung Chen | Jia-Fong Yeh | Ying Cheng | Hung-Ting Su | Yung-Hao Tang | Shang-Hong Lai | Winston H. Hsu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Gueter Josmy Faure | Min-Hung Chen | Jia-Fong Yeh | Ying Cheng | Hung-Ting Su | Yung-Hao Tang | Shang-Hong Lai | Winston H. Hsu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.
Attention Tracker: Detecting Prompt Injection Attacks in LLMs
Kuo-Han Hung | Ching-Yun Ko | Ambrish Rawat | I-Hsin Chung | Winston H. Hsu | Pin-Yu Chen
Findings of the Association for Computational Linguistics: NAACL 2025
Kuo-Han Hung | Ching-Yun Ko | Ambrish Rawat | I-Hsin Chung | Winston H. Hsu | Pin-Yu Chen
Findings of the Association for Computational Linguistics: NAACL 2025
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.
2024
Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses
Hung-Ting Su | Ya-Ching Hsu | Xudong Lin | Xiang-Qian Shi | Yulei Niu | Han-Yuan Hsu | Hung-yi Lee | Winston H. Hsu
Findings of the Association for Computational Linguistics: EMNLP 2024
Hung-Ting Su | Ya-Ching Hsu | Xudong Lin | Xiang-Qian Shi | Yulei Niu | Han-Yuan Hsu | Hung-yi Lee | Winston H. Hsu
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4’s performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT’s heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.