Jincai Huang
2026
SAM3-I: Segment Anything with Instructions
Jingjing Li | Yue Feng | Yuchen Guo | Jincai Huang | Wei Ji | Qi Bi | Yongri Piao | Miao Zhang | Xiaoqi Zhao | Qiang Chen | Shihao Zou | Huchuan Lu | Li Cheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingjing Li | Yue Feng | Yuchen Guo | Jincai Huang | Wei Ji | Qi Bi | Yongri Piao | Miao Zhang | Xiaoqi Zhao | Qiang Chen | Shihao Zou | Huchuan Lu | Li Cheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Segment Anything Model 3 (SAM3) advances open-vocabulary segmentation through promptable concept segmentation, enabling users to segment all instances associated with a given concept using short noun-phrase (NP) prompts. While effective for concept-level grounding, real-world interactions often involve far richer natural-language instructions that combine attributes, relations, actions, states, or implicit reasoning. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and conducts iterative mask filtering, leading to coarse representations and limited instance specificity. In this work, we present SAM3-I, an instruction-following extension of the SAM family that unifies concept-level grounding and instruction-level reasoning within a single segmentation framework. Built upon SAM3, SAM3-I introduces an instruction-aware cascaded adaptation mechanism with dedicated alignment losses that progressively aligns expressive instruction semantics with SAM3’s vision-language representations, enabling direct interpretation of natural-language instructions while preserving its strong concept recall ability. To enable instruction-following learning, we introduce HMPL-Instruct, a large-scale instruction-centric dataset that systematically covers hierarchical instruction semantics and diverse target granularities. Experiments demonstrate that SAM3-I achieves appealing performance across referring and reasoning-based segmentation, showing that SAM3 can be effectively extended to follow complex natural-language instructions without sacrificing its original concept-driven strengths. Code and dataset are available at https://github.com/debby-0527/SAM3-I.
2025
CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space
Yong Zhao | Kai Xu | Zhengqiu Zhu | Yue Hu | Zhiheng Zheng | Yingfeng Chen | Yatai Ji | Chen Gao | Yong Li | Jincai Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yong Zhao | Kai Xu | Zhengqiu Zhu | Yue Hu | Zhiheng Zheng | Yingfeng Chen | Yatai Ji | Chen Gao | Yong Li | Jincai Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Embodied Question Answering (EQA) has primarily focused on indoor environments, leaving the complexities of urban settings—spanning environment, action, and perception—largely unexplored. To bridge this gap, we introduce CityEQA, a new task where an embodied agent answers open-vocabulary questions through active exploration in dynamic city spaces. To support this task, we present CityEQA-EC, the first benchmark dataset featuring 1,412 human-annotated tasks across six categories, grounded in a realistic 3D urban simulator. Moreover, we propose -Manager-Actor (PMA), a novel agent tailored for CityEQA. PMA enables long-horizon planning and hierarchical task execution: the Planner breaks down the question answering into sub-tasks, the Manager maintains an object-centric cognitive map for spatial reasoning during the process control, and the specialized Actors handle navigation, exploration, and collection sub-tasks. Experiments demonstrate that PMA achieves 60.7% of human-level answering accuracy, significantly outperforming frontier-based baselines. While promising, the performance gap compared to humans highlights the need for enhanced visual reasoning in CityEQA. This work paves the way for future advancements in urban spatial intelligence. Dataset and code are available at https://github.com/tsinghua-fib-lab/CityEQA.git.