Ye Shen
2026
One Battle After Another: Probing LLMs’ Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework
Qi Jia | Ye Shen | Xiujie Song | Kaiwei Zhang | Shibo Wang | Dun Pei | Xiangyang Zhu | Guangtao Zhai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qi Jia | Ye Shen | Xiujie Song | Kaiwei Zhang | Shibo Wang | Dun Pei | Xiangyang Zhu | Guangtao Zhai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluating LLMs’ instruction-following ability in multi-topic dialogues is essential yet challenging. Existing benchmarks are limited to a fixed number of turns, susceptible to saturation and failing to account for users’ interactive experience. In this work, we propose a novel framework featuring a three-layer tracking mechanism and a query synthesis agent to mimic sequential user behaviors. Grounded in Flow Theory, we introduce process-centric metrics and terminate a conversational evaluation only upon exhausting user patience. Leveraging this framework, we present EvolIF, an evolving benchmark covering 12 constraint groups. Our analysis reveals deficiencies in failure recovery and fine-grained instruction following, with performance stratification becoming evident as conversational depth increases. GPT-5 demonstrates the most sustained resilience, maintaining a 66.40% stability score, outperforming Gemini-3-Pro by 5.59%, while other models lag behind.
2025
When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs
Ammar Khairi | Daniel D’souza | Ye Shen | Julia Kreutzer | Sara Hooker
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Ammar Khairi | Daniel D’souza | Ye Shen | Julia Kreutzer | Sara Hooker
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Recent advancements in large language models (LLMs) have shifted focus toward scaling inference-time compute—improving performance without retraining the model. A common approach is to sample multiple outputs in parallel, and select one of these as the final output. While existing work has focused on English and specific domains, we study how to robustly scale inference-time compute in a multilingual, multi-task setting: spanning open-ended generations, math and translation tasks, for open models at 8B and 111B scale, across seven languages. Our findings highlight the need for tailored sampling and selection strategies. We propose novel solutions tailored for this multi-faceted inference scenario, demonstrating notable gains across languages and tasks. Our methods achieve an average +6.8 jump in win-rates for 8B models on m-ArenaHard-v2.0 prompts in non-English languages against proprietary models like Gemini. At larger scale, our 111B model shows a +9.0 improvement with just five samples compared to single-sample decoding. These results emphasize the importance of language- and task-aware approaches to democratize inference-time improvements.