Chinese Firm Unveils Dynamic AI Benchmark Set to Revolutionize Model Testing

Introducing Xbench: A New Era in AI Benchmarking

Hongshan Capital Global, a Chinese venture capital firm, has developed Xbench, a unique AI benchmark designed to continuously evolve. Unlike traditional benchmarks that primarily test AI models on static tasks, Xbench evaluates both the ability to pass arbitrary tests and to execute real-world tasks. This approach aims to better assess whether AI models truly reason or simply recall training data.

Open-Source Question Set and Leaderboard

Part of Xbench's question set has been made open-source and is free for public use. The company also released a leaderboard showcasing how various mainstream AI models perform on Xbench. ChatGPT o3 leads across all categories, with ByteDance’s Doubao, Gemini 2.5 Pro, Grok, and Claude Sonnet following closely.

Development History and Methodology

The benchmark was initiated in 2022 by Hongshan Capital after the success of ChatGPT, originally as an internal tool to evaluate promising AI models for investment. Led by partner Gong Yuan, the development team expanded the system with input from external researchers and professionals, eventually releasing it publicly.

Xbench uses two complementary systems:

A traditional academic-style test assessing aptitude across subjects.
A real-world task-based evaluation resembling a technical job interview, measuring economic value delivery.

Components of Xbench Intelligence Assessment

Xbench's intelligence evaluation consists of two main parts:

Xbench-ScienceQA: This component covers STEM questions at a postgraduate level, spanning biochemistry to orbital mechanics. Questions are prepared by graduate students and verified by professors. Scoring rewards both correct answers and the reasoning process.
Xbench-DeepResearch: This section assesses a model’s ability to research using Chinese-language web sources. Developed by ten experts, it includes 100 questions in fields like music, history, finance, and literature. Answers require in-depth research rather than simple searches. Scoring values source diversity, factual accuracy, and honesty about insufficient data. For example, one question asks how many Chinese cities in three northwestern provinces border foreign countries; only 33% of tested models answered correctly.

Future Plans and Updates

The team plans to add more dimensions to testing, such as creativity, collaboration among models, and reliability. They commit to quarterly updates of the test questions and maintaining a dataset split between public and private data.

Real-World Task Evaluations

To measure practical readiness, Xbench includes tasks based on real workflows, initially focusing on recruitment and marketing. Examples include sourcing qualified battery engineer candidates and matching advertisers with suitable short-video creators from a large influencer pool.

Upcoming categories like finance, legal, accounting, and design are in development but not yet open-source.

Performance Highlights

ChatGPT o3 ranks highest in professional task categories. In recruiting, Perplexity Search and Claude 3.5 Sonnet follow in second and third places. For marketing, Claude, Grok, and Gemini perform well.

Expert Opinions

Zihan Zheng, lead researcher on a new benchmark called LiveCodeBench Pro, acknowledges the difficulty of quantifying some aspects of AI but regards Xbench as a promising development in the field.