Xuefeng Chen
2026
ShopSimulator: Evaluating and Exploring RL-Driven LLM Agent for Shopping Assistants
Pei Wang | Yanan Wu | Xiaoshuai Song | Weixun Wang | Gengru Chen | Zhongwen Li | Kezhong Yan | Qi Liu | Ken Deng | Shuaibing Zhao | Shaopan Xiong | Xuepeng Liu | Xuefeng Chen | Wanxi Deng | Wenbo Su | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pei Wang | Yanan Wu | Xiaoshuai Song | Weixun Wang | Gengru Chen | Zhongwen Li | Kezhong Yan | Qi Liu | Ken Deng | Shuaibing Zhao | Shaopan Xiong | Xuepeng Liu | Xuefeng Chen | Wanxi Deng | Wenbo Su | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language model (LLM)-based agents are increasingly deployed in e-commerce shopping. To perform thorough, user-tailored product searches, agents should interpret personal preferences, engage in multi-turn dialogues, and ultimately retrieve and discriminate among highly similar products. However, existing research has yet to provide a unified simulation environment that consistently captures all of these aspects, and always focuses solely on evaluation benchmarks without training support. In this paper, we introduce ShopSimulator, a large-scale and challenging Chinese shopping environment. Leveraging ShopSimulator, we evaluate LLMs across diverse scenarios, finding that even the best-performing models achieve less than 40% full-success rate. Error analysis reveals that agents struggle with deep search and product selection in long trajectories, fail to balance the use of personalization cues, and to effectively engage with users. Further training exploration provides practical guidance for overcoming these weaknesses, with the combination of supervised fine-tuning (SFT) and reinforcement learning (RL) yielding significant performance improvements.
MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation
Yanghao Zhou | Haitian Li | Rexar Lin | Heyan Huang | Jinxing Zhou | Changsen Yuan | Tian Lan | Ziqin Zhou | Yudong Li | Jiajun Xu | Jingyun Liao | YiMing Cheng | Xuefeng Chen | Xian-Ling Mao | Yousheng Feng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yanghao Zhou | Haitian Li | Rexar Lin | Heyan Huang | Jinxing Zhou | Changsen Yuan | Tian Lan | Ziqin Zhou | Yudong Li | Jiajun Xu | Jingyun Liao | YiMing Cheng | Xuefeng Chen | Xian-Ling Mao | Yousheng Feng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in text-to-audio-video (T2AV) generation have enabled models to synthesize audio-visual videos with multi-participant dialogues. However, existing evaluation benchmarks remain largely designed for human-recorded videos or single-speaker settings. As a result, structural failures in generated multi-talker dialogue videos, such as identity drift, unnatural turn transitions, and audio-visual misalignment, cannot be effectively diagnosed. To address this issue, we introduce MTAVG-Bench, a failure-driven diagnostic benchmark for multi-talker dialogue-centric audio-video generation. MTAVG-Bench is built via a semi-automatic pipeline, where 1.8k videos are generated using mainstream T2AV models with carefully designed prompts, yielding 2.4k manually annotated QA pairs for fine-grained failure diagnosis. The benchmark evaluates multi-speaker dialogue generation at four levels: audio-visual signal fidelity, temporal attribute consistency, social interaction, and cinematic expression. Built on a hierarchical failure taxonomy and a targeted QA protocol, MTAVG-Bench is primarily designed to evaluate whether proprietary and open-source omni-models can reliably identify failure modes in multi-speaker T2AV outputs. We benchmark 12 proprietary and open-source omni-models on MTAVG-Bench, with Gemini 3 Pro achieving the strongest overall performance, while leading open-source models remain competitive in signal fidelity and consistency. Overall, MTAVG-Bench enables fine-grained failure analysis for rigorous model comparison and targeted video generation refinement.
Search
Fix author
Co-authors
- Gengru Chen 1
- Yiming Cheng 1
- Ken Deng 1
- Wanxi Deng 1
- Yousheng Feng 1
- He-Yan Huang (黄河燕) 1
- Tian Lan 1
- Zhongwen Li 1
- Haitian Li 1
- Yudong Li 1
- Jingyun Liao 1
- Rexar Lin 1
- Qi Liu 1
- Xuepeng Liu 1
- Xian-Ling Mao 1
- Xiaoshuai Song 1
- Wenbo Su 1
- Pei Wang 1
- Weixun Wang 1
- Yanan Wu 1
- Shaopan Xiong 1
- Jiajun Xu 1
- Kezhong Yan 1
- Changsen Yuan 1
- Shuaibing Zhao 1
- Bo Zheng 1
- Yanghao Zhou (周杨浩) 1
- Jinxing Zhou 1
- Ziqin Zhou 1
Venues
- ACL2