The AI Evaluation Crisis: Why Current Benchmarks Fail and What’s Next

The Challenge of Measuring AI Performance

As a tech reporter, I often get questions like “Is DeepSeek actually better than ChatGPT?” or “Is the Anthropic model any good?” These questions seem straightforward but highlight a deeper issue: defining what makes an AI model "good." Most people don’t specify precise criteria, which makes evaluation complicated.

The Limits of Benchmarks

Traditionally, AI models have been assessed using benchmarks—standardized tests that measure how many questions a model answers correctly. However, like standardized exams such as the SAT, these benchmarks don’t always capture deeper intelligence or reasoning skills. New AI models often tout improved benchmark scores, yet these do not necessarily translate to genuine improvements in capabilities.

Why Benchmarks Are Becoming Obsolete

Benchmarks have grown stale for several reasons. First, models can be trained specifically to perform well on tests, a phenomenon known as "teaching to the test." Second, data contamination means models might have seen benchmark questions during training, skewing results. Third, many benchmarks have reached a ceiling, with models achieving over 90% accuracy, making further improvement on paper less meaningful. This is especially true in complex tasks like coding, reasoning, and STEM problem-solving.

New Approaches to AI Evaluation

To tackle these issues, new benchmarks are emerging. LiveCodeBench Pro, for example, uses challenging problems from international programming olympiads. Top AI models currently score around 53% on medium difficulty and 0% on the hardest problems — tasks where human experts excel. This benchmark highlights that while AI can plan and execute, nuanced algorithmic reasoning remains difficult.

Another perspective focuses on evaluating AI agents based on riskiness rather than just performance, emphasizing the importance of reliability and safety in real-world applications.

Dynamic and Real-World Benchmarks

Some benchmarks, like ARC-AGI, keep parts of their data secret to prevent overfitting. Meta’s LiveBench updates questions every six months to test adaptability. Xbench, developed in China, assesses both technical reasoning and practical usefulness in fields like recruitment and marketing, with plans to expand into finance, law, and design.

Beyond Technical Skills: Creativity and Human Preferences

Hardcore reasoning ability doesn’t always equate to creative or enjoyable user experiences. There is limited research on evaluating AI creativity, such as for writing or art. Human preference testing platforms like LMarena allow users to compare responses but can be biased toward more flattering or agreeable answers.

Cultural Challenges in AI Research

At the CVPR conference, NYU professor Saining Xie criticized the hypercompetitive, "finite game" culture in AI research, where rapid publication and short-term wins dominate over long-term insight. This mindset may also affect how AI evaluation is approached.

The Road Ahead

Currently, no comprehensive scoreboard exists to fully measure an AI model’s capabilities, especially in social, emotional, or interdisciplinary dimensions. However, the emergence of new benchmarks suggests a gradual shift toward more meaningful evaluation. Healthy skepticism and ongoing innovation in testing methods are essential as AI continues to evolve.