GDPval: OpenAI's New Test for AI on Real, Billable Work

What GDPval is measuring

OpenAI launched GDPval, an evaluation suite that measures how AI models perform on real-world, economically valuable tasks. The suite covers 1,320 tasks across 44 occupations in nine GDP-dominant US sectors. Instead of relying on narrow academic problems, GDPval focuses on authentic deliverables such as presentations, spreadsheets, briefs, CAD artifacts, and audio or video files.

Task design and the gold subset

Tasks were sourced from industry professionals averaging 14 years of experience and mapped to O*NET work activities. Inputs and outputs are multi-modal and file-centric, often involving many reference files per task. OpenAI published a 220-task gold subset that includes public prompts and references, but primary scoring still relies on blinded pairwise comparisons by occupational experts because of subjectivity and diverse formats.

How models stack up against experts

On the gold subset, frontier models approach expert-level quality on a substantial fraction of tasks under blind expert review. Reported model-versus-human win and tie rates are near parity for top models, and progress appears roughly linear across model releases. Common error patterns include instruction-following mistakes, formatting issues, improper data usage, and hallucinations. Providing stronger scaffolding and encouraging more reasoning effort yields predictable performance gains.

Time and cost implications

GDPval includes scenario analyses that compare human-only workflows to model-assisted workflows with expert review. The evaluation quantifies human completion time and wage-based cost, reviewer time and cost, model latency and API cost, and empirically observed win rates. Results show potential time and cost reductions for many task classes once reviewer overhead is accounted for.

Automated judging: strengths and limits

OpenAI provides an experimental automated pairwise grader on the gold subset that agrees with human experts about 66% of the time, roughly 5 percentage points below human–human agreement. The automated grader is intended as an accessibility proxy for rapid iteration rather than a replacement for expert review.

Scope, limitations, and future directions

GDPval-v0 targets computer-mediated knowledge work. It excludes physical labor, long-horizon interactivity, and organization-specific tooling. Tasks are one-shot and precisely specified, and performance degrades when context is reduced. Construction and grading are resource-intensive, which motivated the automated grader and suggests future expansion to broaden coverage and realism.

How GDPval complements other evaluations

GDPval augments existing OpenAI evals with occupational breadth, multi-modal, file-centric tasks, and human preference outcomes. It also reports time and cost trade-offs and ablation studies on reasoning effort and agent scaffolding. The v0 release is versioned and intended to provide a reproducible baseline for tracking real-world capability gains across occupations.