Abstract
As the performance of large language models rapidly improves, benchmarks are getting larger and more complex as well.We present LMentry, a benchmark that avoids this “arms race” by focusing on a compact set of tasks that are trivial to humans, e.g. writing a sentence containing a specific word, identifying which words in a list belong to a specific category, or choosing which of two words is longer.LMentry is specifically designed to provide quick and interpretable insights into the capabilities and robustness of large language models.Our experiments reveal a wide variety of failure cases that, while immediately obvious to humans, pose a considerable challenge for large language models, including OpenAI’s latest 175B-parameter instruction-tuned model, TextDavinci002.LMentry complements contemporary evaluation approaches of large language models, providing a quick, automatic, and easy-to-run “unit test”, without resorting to large benchmark suites of complex tasks.- Anthology ID:
- 2023.findings-acl.666
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 10476–10501
- Language:
- URL:
- https://aclanthology.org/2023.findings-acl.666
- DOI:
- 10.18653/v1/2023.findings-acl.666
- Cite (ACL):
- Avia Efrat, Or Honovich, and Omer Levy. 2023. LMentry: A Language Model Benchmark of Elementary Language Tasks. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10476–10501, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- LMentry: A Language Model Benchmark of Elementary Language Tasks (Efrat et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2023.findings-acl.666.pdf