Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

Daman Arora; Himanshu Singh; Mausam -

doi:10.18653/v1/2023.emnlp-main.468

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

Abstract

The performance of large language models (LLMs) on existing reasoning benchmarks has significantly improved over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem solving abilities of LLMs. We curate 515 challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT JEE-Advanced exam. Long-horizon reasoning on top of deep in-domain knowledge is essential for solving problems in this benchmark. Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%. The typical failure modes of GPT-4, the best model, are errors in algebraic manipulation, difficulty in grounding abstract concepts into mathematical equations accurately and failure in retrieving relevant domain-specific concepts. We also observe that by mere prompting, GPT-4 is unable to assess risk introduced by negative marking for incorrect answers. For this, we develop a post-hoc confidence-thresholding method over self-consistency, which enables effective response selection. We hope that our challenging benchmark will guide future re-search in problem-solving using LLMs.

Anthology ID:: 2023.emnlp-main.468
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7527–7543
Language:
URL:: https://aclanthology.org/2023.emnlp-main.468
DOI:: 10.18653/v1/2023.emnlp-main.468
Bibkey:
Cite (ACL):: Daman Arora, Himanshu Singh, and Mausam. 2023. Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7527–7543, Singapore. Association for Computational Linguistics.
Cite (Informal):: Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models (Arora et al., EMNLP 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/emnlp-22-attachments/2023.emnlp-main.468.pdf
Video:: https://preview.aclanthology.org/emnlp-22-attachments/2023.emnlp-main.468.mp4

PDF Search Video