GDPval Benchmark Design
Task Construction Methodology
Source Professionals:
- Average 14 years of industry experience
- Drawn from 44 occupations across 9 sectors
- Create tasks based on representative work
Sectors Covered:
- Professional, Scientific, and Technical Services
- Finance and Insurance
- Information Technology
- Healthcare and Social Assistance
- Educational Services
- Manufacturing
- Retail Trade
- Administrative Services
- Arts, Entertainment, and Recreation
Task Types:
- Written deliverables (reports, briefs, emails)
- Visual deliverables (presentations, charts)
- Technical deliverables (spreadsheets, code)
- Creative deliverables (designs, content)
- Audio/video deliverables
Evaluation Methodology
Blinded Pairwise Comparison:
- Expert graders from same occupation
- Cannot identify which output is AI vs. human
- Rate on quality dimensions:
- Accuracy
- Completeness
- Formatting/aesthetics
- Appropriateness for context
- Overall quality
Scoring:
- Win: AI output rated better than expert
- Tie: AI output rated equivalent to expert
- Loss: AI output rated worse than expert
GDPval Gold Set Results (220 tasks)
Win + Tie Rates (Expert Parity or Better):
| Model |
Win+Tie Rate |
Primary Strength |
| Claude Opus 4.1 |
~50%+ |
Aesthetics, formatting |
| GPT-5 |
~45-50% |
Accuracy, domain knowledge |
| GPT-5 Thinking |
~40-45% |
Balanced |
| Gemini 3 Pro |
~33-40% |
Varies by task |
| Grok |
~20-33% |
Varies by task |
Performance by Dimension:
| Dimension |
Best Model |
Notes |
| Accuracy |
GPT-5 |
Domain-specific knowledge |
| Formatting |
Claude Opus 4.1 |
Document layout, slide design |
| Completeness |
Claude Opus 4.1 |
Covers all requirements |
| Prompt Following |
GPT-5 |
Instruction adherence |
Tasks Where AI Performs Best:
- Structured document creation
- Data transformation and analysis
- Research synthesis
- Routine correspondence
- Technical documentation
Tasks Where AI Performs Worst:
- Highly ambiguous requirements
- Tasks requiring external knowledge
- Client relationship judgment
- Novel strategic decisions
- Multi-stakeholder negotiations
Speed and Cost Analysis
Time Comparison
| Task Type |
Human Expert |
AI |
Ratio |
| Report Writing |
4-8 hours |
2-5 minutes |
~100x |
| Spreadsheet Analysis |
2-4 hours |
1-3 minutes |
~80x |
| Presentation Creation |
3-6 hours |
3-8 minutes |
~60x |
| Email Drafting |
15-30 minutes |
10-30 seconds |
~60x |
Cost Comparison
| Factor |
Human Expert |
AI (API) |
| Hourly rate equivalent |
$50-200/hr |
$0.50-5/hr |
| Per-task cost (complex) |
$100-500 |
$1-5 |
| Per-task cost (simple) |
$25-100 |
$0.10-0.50 |
Note: AI costs assume API pricing; consumer subscriptions have different economics.
Trajectory Analysis
| Model |
Release |
Win+Tie Rate |
Delta from Previous |
| GPT-4o |
Spring 2024 |
~25% |
Baseline |
| GPT-4o (updated) |
Fall 2024 |
~30% |
+5% |
| GPT-5 |
Summer 2025 |
~50% |
+20% |
Implication: Performance more than doubled in 14 months.
Projected Trajectory
If current trend continues:
- Late 2026: 70-80% expert parity
- 2027: Majority of professional tasks at expert level
Caveats:
- Improvement may not be linear
- Remaining tasks may be harder to crack
- New task types may emerge
Occupation-Specific Findings
- Technical Writer: High accuracy, formatting strength
- Data Analyst: Spreadsheet and analysis excellence
- Marketing Coordinator: Content generation, campaign materials
- Administrative Assistant: Correspondence, scheduling, documentation
- Junior Software Developer: Code generation, debugging
- Executive/Senior Manager: Strategic judgment required
- Sales Professional: Relationship and negotiation focus
- Healthcare Provider: Physical examination, patient interaction
- Legal Counsel: High-stakes judgment, liability concerns
- Creative Director: Vision and direction (vs. execution)
Implications by Sector
Professional Services
- Junior-level task automation potential high
- Senior judgment and client relationships remain human
- Training pathway disruption likely
Finance and Insurance
- Analysis and reporting highly automatable
- Regulatory judgment requires human oversight
- Risk assessment increasingly AI-augmented
- Development productivity gains significant
- Architecture and design judgment human-led
- Code review and debugging automated
Healthcare
- Documentation and administrative tasks automatable
- Diagnosis support (not replacement) emerging
- Patient interaction remains human-centric
Education
- Content creation highly automatable
- Assessment design disrupted
- Student interaction and mentorship remain human
Data Gaps
- Non-US occupations: GDPval focused on US GDP sectors
- Non-English tasks: Limited multilingual evaluation
- Physical tasks: Not covered by current benchmarks
- Long-horizon projects: Tasks limited to single-session completion
- Team collaboration: Individual task focus only
Key Takeaways for Universities
- ~50% of professional tasks approaching AI parity in quality
- 100x cost/speed advantage makes AI economically compelling
- Quality improving rapidly (~2x in 14 months)
- Judgment and relationships remain human advantages
- Workforce preparation must include AI collaboration skills