Yudai Yamazaki
2025
When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following
Keno Harada
|
Yudai Yamazaki
|
Masachika Taniguchi
|
Edison Marrese-Taylor
|
Takeshi Kojima
|
Yusuke Iwasawa
|
Yutaka Matsuo
Findings of the Association for Computational Linguistics: EMNLP 2025
As large language models (LLMs) are increasingly applied to real-world scenarios, it becomes crucial to understand their ability to follow multiple instructions simultaneously. To systematically evaluate these capabilities, we introduce two specialized benchmarks for fundamental domains where multiple instructions following is important: Many Instruction-Following Eval (ManyIFEval) for text generation with up to ten instructions, and Style-aware Mostly Basic Programming Problems (StyleMBPP) for code generation with up to six instructions. Our experiments with the created benchmarks across ten LLMs reveal that performance consistently degrades as the number of instructions increases. Furthermore, given the fact that evaluating all the possible combinations of multiple instructions is computationally impractical in actual use cases, we developed three types of regression models that can estimate performance on both unseen instruction combinations and different numbers of instructions which are not used during training. We demonstrate that a logistic regression model using instruction count as an explanatory variable can predict performance of following multiple instructions with approximately 10% error, even for unseen instruction combinations. We show that relatively modest sample sizes (500 for ManyIFEval and 300 for StyleMBPP) are sufficient for performance estimation, enabling efficient evaluation of LLMs under various instruction combinations.
2024
Zero-shot Persuasive Chatbots with LLM-Generated Strategies and Information Retrieval
Kazuaki Furumai
|
Roberto Legaspi
|
Julio Cesar Vizcarra Romero
|
Yudai Yamazaki
|
Yasutaka Nishimura
|
Sina Semnani
|
Kazushi Ikeda
|
Weiyan Shi
|
Monica Lam
Findings of the Association for Computational Linguistics: EMNLP 2024
Persuasion plays a pivotal role in a wide range of applications from health intervention to the promotion of social good. Persuasive chatbots employed responsibly for social good can be an enabler of positive individual and social change. Existing methods rely on fine-tuning persuasive chatbots with task-specific training data which is costly, if not infeasible, to collect. Furthermore, they employ only a handful of pre-defined persuasion strategies. We propose PersuaBot, a zero-shot chatbot based on Large Language Models (LLMs) that is factual and more persuasive by leveraging many more nuanced strategies. PersuaBot uses an LLM to first generate a natural responses, from which the strategies used are extracted. To combat hallucination of LLMs, Persuabot replace any unsubstantiated claims in the response with retrieved facts supporting the extracted strategies. We applied our chatbot, PersuaBot, to three significantly different domains needing persuasion skills: donation solicitation, recommendations, and health intervention. Our experiments on simulated and human conversations show that our zero-shot approach is more persuasive than prior work, while achieving factual accuracy surpassing state-of-the-art knowledge-oriented chatbots.