Quoc V Le


2025

pdf bib
BIG-Bench Extra Hard
Mehran Kazemi | Bahare Fatemi | Hritik Bansal | John Palowitch | Chrysovalantis Anastasiou | Sanket Vaibhav Mehta | Lalit K Jain | Virginia Aglietti | Disha Jindal | Peter Chen | Nishanth Dikkala | Gladys Tyen | Xin Liu | Uri Shalit | Silvia Chiappa | Kate Olszewska | Yi Tay | Vinh Q. Tran | Quoc V Le | Orhan Firat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Current benchmarks for large language model (LLM) reasoning predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various general-purpose and reasoning-specialized models on BBEH and observe an accuracy of 23.9% for the best general-purpose model and 54.2% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.

pdf bib
Towards Robust Mathematical Reasoning
Thang Luong | Dawsen Hwang | Hoang H Nguyen | Golnaz Ghiasi | Yuri Chervonyi | Insuk Seo | Junsu Kim | Garrett Bingham | Jonathan Lee | Swaroop Mishra | Alex Zhai | Huiyi Hu | Henryk Michalewski | Jimin Kim | Jeonghyun Ahn | Junhwi Bae | Xingyou Song | Trieu Hoang Trinh | Quoc V Le | Junehyuk Jung
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Finding the right north-star metrics is highly critical for advancing mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focusing on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMOAnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-ProofBench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-ProofBench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://github.com/google-deepmind/superhuman/imobench.