Seokgyu Jang
2026
Superficial Success vs. Internal Breakdown: An Empirical Study of Generalization in Adaptive Multi-Agent Systems
Namyeong So | Seokgyu Jang | Taeuk Kim
Findings of the Association for Computational Linguistics: ACL 2026
Namyeong So | Seokgyu Jang | Taeuk Kim
Findings of the Association for Computational Linguistics: ACL 2026
Adaptive multi-agent systems (MAS) are increasingly adopted as solutions to complex problems. However, their optimization for narrow task ranges leaves it unclear whether they can function as general-purpose systems. To fill this gap, we conduct an extensive empirical study on adaptive MAS, revealing two key findings: (1) they are prone to topological overfitting, defined as failures in domain transfer; and (2) they exhibit illusory coordination, where surface-level accuracy is high but underlying agent coordination deviates from ideal MAS behavior, raising concerns about their practical effectiveness. These observations highlight the urgent need to prioritize generalization in MAS development and motivate more thorough evaluation beyond correctness of the final answer.
OMHBench: Benchmarking Balanced and Grounded Omni-Modal Multi-Hop Reasoning
Seunghee Kim | Ingyu Bang | Seokgyu Jang | Changhyeon Kim | Sanghwan Bae | Jihun Choi | Richeng Xuan | Taeuk Kim
Findings of the Association for Computational Linguistics: ACL 2026
Seunghee Kim | Ingyu Bang | Seokgyu Jang | Changhyeon Kim | Sanghwan Bae | Jihun Choi | Richeng Xuan | Taeuk Kim
Findings of the Association for Computational Linguistics: ACL 2026
Multimodal Large Language Models (MLLMs) have increasingly supported omni-modal processing across text, vision, and speech. However, existing evaluation frameworks for such models suffer from critical limitations, including modality shortcuts and biased reasoning paths. To address these challenges, we propose OMHBench, a novel benchmark designed to rigorously evaluate omni-modal multi-hop reasoning. It consists of 6,144 questions with balanced reasoning paths that are jointly grounded across all three modalities. Extensive evaluation of 13 state-of-the-art models reveals that (1) a large performance gap exists between proprietary and open-source MLLMs and (2) even proprietary models exhibit high sensitivity to reasoning path variations, resulting in asymmetric omni-modal grounding. Notably, models struggle when processing the speech modality, underscoring the need for balanced, multi-hop evaluation of omni-modal intelligence.