Yuhao Li
2026
HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
Chengyu Du | Xintao Wang | Aili Chen | Weiyuan Li | Rui Xu | Junteng Liu | Zishan Huang | Rong Tian | Zijun Sun | Yuhao Li | Liheng Feng | Deming Ding | Pengyu Zhao | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Chengyu Du | Xintao Wang | Aili Chen | Weiyuan Li | Rui Xu | Junteng Liu | Zishan Huang | Rong Tian | Zijun Sun | Yuhao Li | Liheng Feng | Deming Ding | Pengyu Zhao | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
LLM role-playing, i.e., using large language models (LLMs) to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a non-trivial challenge. Towards cognitive simulation in LLM role-play, previous efforts have mainly suffered from two critical deficiencies: the lack of high-quality datasets with explicit reasoning traces and the absence of reliable reward signals aligned with human preferences. In this paper, we propose HER (Human Emulation Reasoning), a unified framework for cognitive-level persona simulation. HER introduces a dual-layer thinking mechanism that strictly distinguishes characters’ first-person thinking processes from LLMs’ third-person reasoning. To bridge the aforementioned gaps, we curate a reasoning-augmented role-playing dataset via a reverse engineering strategy for supervised learning, and construct human-aligned evaluation principles and preference-based reward models for role-play reinforcement learning. Leveraging these resources, we train HER models based on the Qwen3-32B backbone via a hybrid paradigm of supervised learning (SL) and reinforcement learning from human feedback (RLHF). Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26% on the CoSER benchmark and a 14.97% on the MiniMax Benchmark. Our datasets, evaluation principles, and trained models will be released to facilitate future research in cognitive-level LLM role-playing.
Example Quality Matters: Multi-Aspects Example Augmentation for Private Library Programming
Yuhao Li | Haifeng Sun | Xuesong Zhang | Shu Yao | Haoyu Zheng | Yvchuan Wang | Huazheng Wang | Zirui Zhuang | Qi Qi | Jianxin Liao | Jingyu Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuhao Li | Haifeng Sun | Xuesong Zhang | Shu Yao | Haoyu Zheng | Yvchuan Wang | Huazheng Wang | Zirui Zhuang | Qi Qi | Jianxin Liao | Jingyu Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in large language models (LLMs) have significantly improved code-generation capabilities, particularly through retrieval-augmented generation (RAG) for private libraries. While RAG leverages API documentation to address the scarcity of private code corpora, its performance critically depends on the quality of retrieved examples. Existing approaches often overlook the intrinsic characteristics of these examples, particularly how factors such as complexity, readability, and correctness impact their effectiveness. In this study, we systematically investigate these three critical aspects—complexity, readability, and correctness—and find that optimal examples should exhibit moderate complexity, semantic correctness, and step-by-step execution patterns. Based on these findings, we propose ComboPrompt, a novel example enhancement method that strategically combines existing API examples to improve complexity, refines code structure for readability, and incorporates automated validation ensuring correctness. Extensive evaluations across five private library benchmarks and different LLMs demonstrate that ComboPrompt achieves up to 22% accuracy improvement over baseline approaches. Code is available at [Anonymous Github](https://github.com/FireAndWin/ComboPrompt_ExampleQualityMatters).
2025
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Omkar Thawakar | Dinura Dissanayake | Ketan Pravin More | Ritesh Thawkar | Ahmed Heakl | Noor Ahsan | Yuhao Li | Ilmuz Zaman Mohammed Zumri | Jean Lahoud | Rao Muhammad Anwer | Hisham Cholakkal | Ivan Laptev | Mubarak Shah | Fahad Shahbaz Khan | Salman Khan
Findings of the Association for Computational Linguistics: ACL 2025
Omkar Thawakar | Dinura Dissanayake | Ketan Pravin More | Ritesh Thawkar | Ahmed Heakl | Noor Ahsan | Yuhao Li | Ilmuz Zaman Mohammed Zumri | Jean Lahoud | Rao Muhammad Anwer | Hisham Cholakkal | Ivan Laptev | Mubarak Shah | Fahad Shahbaz Khan | Salman Khan
Findings of the Association for Computational Linguistics: ACL 2025
Step-by-step reasoning is crucial for solving complex visual tasks, yet existing approaches lack a comprehensive framework for evaluating this capability and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing multi-step visual reasoning in large multimodal models (LMMs) through three key contributions. First, we introduce a Visual Reasoning Chain Benchmark, the most comprehensive benchmark for multi-step visual reasoning, covering eight diverse categories and over 4k reasoning steps. This enables rigorous evaluation of LMMs’ ability to reason accurately and interpretably across multiple steps. Second, we propose a fine-grained reasoning metric that evaluates correctness and logical coherence at each step, providing deeper insights beyond traditional accuracy metrics. Third, we introduce LlamaV-o1, a state-of-the-art multimodal reasoning model trained using a multi-step curriculum learning approach. LlamaV-o1 is optimized for structured, step-by-step reasoning and significantly outperforms existing open-source models. It surpasses Llava-CoT with a 3.8% absolute gain across six benchmarks, achieving an average score of 67.3 while being 5x faster during inference scaling. Our benchmark, model, and code is available at https://github.com/mbzuai-oryx/LlamaV-o1.
A Culturally-diverse Multilingual Multimodal Video Benchmark & Model
Bhuiyan Sanjid Shafique | Ashmal Vayani | Muhammad Maaz | Hanoona Abdul Rasheed | Dinura Dissanayake | Mohammed Irfan Kurpath | Yahya Hmaiti | Go Inoue | Jean Lahoud | Md. Safirur Rashid | Shadid Intisar Quasem | Maheen Fatima | Franco Vidal | Mykola Maslych | Ketan Pravin More | Sanoojan Baliah | Hasindri Watawana | Yuhao Li | Fabian Farestam | Leon Schaller | Roman Tymtsiv | Simon Weber | Hisham Cholakkal | Ivan Laptev | Shin’ichi Satoh | Michael Felsberg | Mubarak Shah | Salman Khan | Fahad Shahbaz Khan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Bhuiyan Sanjid Shafique | Ashmal Vayani | Muhammad Maaz | Hanoona Abdul Rasheed | Dinura Dissanayake | Mohammed Irfan Kurpath | Yahya Hmaiti | Go Inoue | Jean Lahoud | Md. Safirur Rashid | Shadid Intisar Quasem | Maheen Fatima | Franco Vidal | Mykola Maslych | Ketan Pravin More | Sanoojan Baliah | Hasindri Watawana | Yuhao Li | Fabian Farestam | Leon Schaller | Roman Tymtsiv | Simon Weber | Hisham Cholakkal | Ivan Laptev | Shin’ichi Satoh | Michael Felsberg | Mubarak Shah | Salman Khan | Fahad Shahbaz Khan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: Arabic, Bengali, Chinese, English, French, German, Hindi, Japanese, Russian, Sinhala, Spanish, Swedish, Tamil, and Urdu. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released.
Search
Fix author
Co-authors
- Hisham Cholakkal 2
- Dinura Dissanayake 2
- Fahad Shahbaz Khan 2
- Salman Khan 2
- Jean Lahoud 2
- Ivan Laptev 2
- Ketan Pravin More 2
- Mubarak Shah 2
- Noor Ahsan 1
- Rao Muhammad Anwer 1
- Sanoojan Baliah 1
- Aili Chen 1
- Deming Ding 1
- Chengyu Du 1
- Fabian Farestam 1
- Maheen Fatima 1
- Michael Felsberg 1
- Liheng Feng 1
- Ahmed Heakl 1
- Yahya Hmaiti 1
- Zishan Huang 1
- Go Inoue 1
- Mohammed Irfan Kurpath 1
- Weiyuan Li 1
- Jianxin Liao 1
- Junteng Liu 1
- Muhammad Maaz 1
- Mykola Maslych 1
- Qi Qi 1
- Shadid Intisar Quasem 1
- Hanoona Abdul Rasheed 1
- Md. Safirur Rashid 1
- Shin’ichi Satoh 1
- Leon Schaller 1
- Bhuiyan Sanjid Shafique 1
- Zijun Sun 1
- Haifeng Sun 1
- Omkar Thawakar 1
- Ritesh Thawkar 1
- Rong Tian 1
- Roman Tymtsiv 1
- Ashmal Vayani 1
- Franco Vidal 1
- Xintao Wang 1
- Yvchuan Wang 1
- Huazheng Wang 1
- Jingyu Wang 1
- Hasindri Watawana 1
- Simon Weber 1
- Yanghua Xiao 1
- Rui Xu 1
- Shu Yao 1
- Xuesong Zhang 1
- Pengyu Zhao 1
- Haoyu Zheng 1
- Zirui Zhuang 1
- Ilmuz Zaman Mohammed Zumri 1