AI Impact Research

Comprehensive research on AI platform usage from frontier labs (2025)

View the Project on GitHub vishalsachdev/ai-impact

← AI Impact Research · AI Capabilities Research


Real-World Task Performance

Research Question

How well do AI models perform on authentic professional tasks across occupations?

Hypothesis

AI models are approaching or exceeding expert-level performance on a significant portion of economically valuable professional tasks, with implications for workforce preparation and curriculum design.


Key Findings

1. GDPval Benchmark Overview

OpenAI’s GDPval (September 2025) provides the most comprehensive evaluation of AI on real-world professional tasks:

2. Cross-Lab Model Performance

On the 220-task GDPval gold set:

Model Strength Expert Parity Rate
Claude Opus 4.1 Aesthetics (formatting, layout) Leading
GPT-5 Accuracy (domain knowledge) Strong
GPT-5 Thinking Balanced Moderate
Gemini 3 Varies 33-50%
Grok Varies 20-33%

3. Speed and Cost Advantage

4. Task Categories Where AI Excels

5. Task Categories Where Humans Lead


Implications for Universities

Curriculum Design

Workforce Preparation

Assessment


Data Quality Notes



Explore This Research


Next: Capability Trajectory →