Yutong Bai
2026
Probing Audio-Visual Reasoning in Multimodal Language Models through the Lens of Audio
Kaixiong Gong | Kaituo Feng | Bohao Li | Yibing Wang | Mofan Cheng | Shijia Yang | Jiaming Han | Benyou Wang | Yutong Bai | Zhuoran Yang | Xiangyu Yue
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kaixiong Gong | Kaituo Feng | Bohao Li | Yibing Wang | Mofan Cheng | Shijia Yang | Jiaming Han | Benyou Wang | Yutong Bai | Zhuoran Yang | Xiangyu Yue
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5/2.5 Pro, and Reka Core, have advanced audio-visual reasoning capabilities, achieving strong performance in tasks like cross-modal understanding and generation. However, our DeafTest uncovers unanticipated failures: most of the state-of-the-art MLLMs struggle with very simple audio tasks, such as distinguishing louder sounds or sound counting. This raises a fundamental question—does a deficiency in low-level audio perception constrain higher-level audio-visual reasoning? To address this, we introduce AV-Odyssey Bench—a comprehensive benchmark of 4,555 meticulously designed problems that integrate text, audio, and visual modalities. Each task requires models to unify cross-modal reasoning, leveraging synchronized audio-visual cues to infer solutions. By structuring questions as multiple-choice, we ensure objective, reproducible evaluations without reliance on subjective human or LLM-based judgments. Through comprehensive benchmarking of closed-source and open-source models, we showcase: (i) current MLLMs lack robust audio-visual integration ability and (ii) performance on DeafTest (Pearson’s r = 0.945) strongly correlates with AV-Odyssey accuracy. These findings challenge assumptions about models’ multimodal proficiency and highlight fundamental audio perception as a reasoning bottleneck. We believe that our results provide concrete guidance for future dataset design, alignment strategies, and architectures.
2025
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Junyu Zhang | Runpei Dong | Han Wang | Xuying Ning | Haoran Geng | Peihao Li | Xialin He | Yutong Bai | Jitendra Malik | Saurabh Gupta | Huan Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Junyu Zhang | Runpei Dong | Han Wang | Xuying Ning | Haoran Geng | Peihao Li | Xialin He | Yutong Bai | Jitendra Malik | Saurabh Gupta | Huan Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
This paper presents AlphaOne (𝛼1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. 𝛼1 first introduces 𝛼 moment, which represents the scaled thinking phase with a universal parameter 𝛼.Within this scaled pre-𝛼 moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as a Bernoulli stochastic process. After the 𝛼 moment, 𝛼1 deterministically terminates slow thinking with the end-of-thinking token, thereby fostering fast reasoning and efficient answer generation. This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation. Extensive empirical studies on various challenging benchmarks across mathematical, coding, and scientific domains demonstrate 𝛼1‘s superior reasoning capability and efficiency. Project page: https://alphaone-project.github.io/.
2024
Learning Dynamic Multi-attribute Interest for Personalized Product Search
Yutong Bai | Zhicheng Dou | Ji-Rong Wen
Findings of the Association for Computational Linguistics: EMNLP 2024
Yutong Bai | Zhicheng Dou | Ji-Rong Wen
Findings of the Association for Computational Linguistics: EMNLP 2024
Personalized product search aims to learn personalized preferences from search logs and adjust the ranking lists returned by engines. Previous studies have extensively explored excavating valuable features to build accurate interest profiles. However, they overlook that the user’s attention varies on product attributes(e.g., brand, category). Users may especially prefer specific attributes or switch their preferences between attributes dynamically. Instead, existing approaches mix up all attribute features and let the model automatically extract useful ones from rather complex scenarios. To solve this problem, in this paper, we propose a dynamic multi-attribute interest learning model to tackle the influences from attributes to user interests. Specifically, we design two interest profiling modules: attribute-centered and attribute-aware profiling. The former focuses on capturing the user’s preferences on a single attribute, while the latter focuses on addressing the interests correlated with multi-attribute within the search history. Besides, we devise a dynamic contribution weights strategy that sends explicit signals to the model to determine the impacts of different attributes better. Experimental results on large-scale datasets illustrate that our model significantly improves the results of existing methods.