LMentry: A Language Model Benchmark of Elementary Language Tasks
Citations Over TimeTop 10% of 2023 papers
Abstract
As the performance of large language models rapidly improves, benchmarks are getting larger and more complex as well.We present LMentry, a benchmark that avoids this "arms race" by focusing on a compact set of tasks that are trivial to humans, e.g. writing a sentence containing a specific word, identifying which words in a list belong to a specific category, or choosing which of two words is longer.LMentry is specifically designed to provide quick and interpretable insights into the capabilities and robustness of large language models.Our experiments reveal a wide variety of failure cases that, while immediately obvious to humans, pose a considerable challenge for large language models, including OpenAI's latest 175B-parameter instruction-tuned model, TextDavinci002.LMentry complements contemporary evaluation approaches of large language models, providing a quick, automatic, and easy-to-run "unit test", without resorting to large benchmark suites of complex tasks.
Related Papers
- → A Benchmark Test Structure for Experimental Dynamic Substructuring(2011)9 cited
- → Solutions to the Third Benchmark Control Problem(1991)3 cited
- Theoretical Analysis of the Benchmark for Choosing Manipulative Instruments of Monetary Policies(2009)
- → Exploring disk performance benchmarks(2017)
- → Support Structure Performance Benchmark(2023)