OpenAI Says AI Models Match Humans in Nearly Half of Work Tasks

OpenAI unveiled a groundbreaking benchmark Thursday that demonstrates artificial intelligence models are rapidly approaching human-level performance in professional work across major industries. The company’s GDPval evaluation system shows AI models can now match or exceed human experts in nearly half of all tested tasks, marking the closest machines have come to matching human economic output.

Dramatic Performance Leap Signals AI’s Economic Impact

The results represent a stunning acceleration in AI capabilities. OpenAI’s earlier GPT-4o model, released in spring 2024, managed only a 13.7% success rate on similar tasks—meaning current models have tripled their performance in just 15 months. “The rate of progress is really encouraging,” OpenAI evaluations lead Tejal Patwardhan told reporters, highlighting the rapid trajectory toward human-level artificial general intelligence.

The GDPval benchmark tested leading AI systems against seasoned professionals across 44 occupations spanning nine industries that contribute most heavily to U.S. gross domestic product, including healthcare, finance, manufacturing, and government. Anthropic’s Claude Opus 4.1 emerged as the top performer, achieving a 47.6% win or tie rate against human experts, while OpenAI’s own GPT-5 scored 40.6%.

Unlike traditional AI benchmarks focused on academic tests, GDPval evaluates authentic workplace deliverables. Professional evaluators with an average of 14 years experience compared AI-generated reports, legal briefs, engineering plans, and nursing care strategies against human-produced work without knowing which was created by machines. Tasks were designed to reflect realistic workplace outputs rather than theoretical problems.

The evaluation revealed distinct strengths between the leading models. Claude Opus 4.1 excelled “particularly on aesthetics” such as document formatting and slide layout, while GPT-5 demonstrated superior “accuracy” in finding and applying domain-specific knowledge. This specialization suggests different AI systems may serve complementary roles in professional environments.

Beyond matching quality, AI models demonstrated remarkable efficiency advantages. OpenAI found that frontier models complete GDPval tasks approximately 100 times faster and 100 times cheaper than industry experts, though these figures exclude necessary human oversight and integration steps. “However, these figures reflect pure model inference time and API billing rates, and therefore do not capture the human oversight, iteration, and integration steps required in real workplace settings,” OpenAI acknowledged.

OpenAI chief economist Dr. Aaron Chatterji emphasized that the results point toward AI augmenting rather than replacing human workers. “People in those jobs can now use the model, increasingly as capabilities get better, to offload some of their work and do potentially higher value things,” he explained. The company positions GDPval results as evidence that AI can handle routine, well-specified tasks while freeing humans for creative and judgment-intensive work.

The benchmark currently tests only one-shot evaluations, meaning it cannot measure AI’s ability to handle iterative work such as revising documents based on feedback or building context over time. OpenAI plans to expand GDPval to include more occupations, industries, and interactive task types, with the long-term goal of better measuring progress on diverse knowledge work.

Market reactions to the benchmark results highlight growing investor confidence in AI’s commercial potential, even as questions remain about implementation timelines and workforce adaptation strategies across affected industries.