← AI Impact Research · AI Capabilities Research

Academic Task Benchmarks

Research Question

How do AI models perform on academic and intellectual tasks relevant to universities?

Hypothesis

AI models are achieving expert-level performance on many academic tasks, requiring universities to reconsider assessment methods, learning objectives, and the nature of expertise development.

Key Findings

1. Graduate-Level Science (GPQA Diamond)

Benchmark: PhD-level questions in physics, biology, and chemistry (Google)

Gemini 3 Deep Think: 93.8%
Human PhD students: ~65-70% (in their domain)
Significance: AI exceeds average PhD student in multi-domain science

2. Mathematics (AIME 2025)

Benchmark: American Invitational Mathematics Examination (Google)

Gemini 3: 95% raw, 100% with code execution
Human qualification threshold: ~50%
Significance: Near-perfect on competition-level math

3. Frontier Knowledge (Humanity’s Last Exam)

Benchmark: Questions at the edge of human knowledge (Google)

Gemini 2.5 Pro: 18.8% (without tools)
Expert humans: ~40-50%
Significance: AI approaching half of frontier knowledge performance

4. Undergraduate Coursework

Estimated AI Performance by Subject:

STEM problem sets: 80-95%
Essay writing: 70-85%
Research synthesis: 75-90%
Foreign language: 70-85%
Creative work: 50-70%

Implications for Universities

Assessment Redesign

Take-home exams fundamentally compromised
Proctored exams gain importance
Process-based assessment essential
Oral examinations more reliable

Learning Objectives

Knowledge recall less valuable
Critical thinking and synthesis emphasized
AI collaboration as new skill
Meta-cognitive skills essential

Course Design

AI-augmented assignments normalized
Explicit AI policy required for every course
Focus on what AI cannot (yet) do

RQ01: Real-World Task Performance - Professional context
RQ04: Coding and Research - Technical capabilities
RQ07: Educational Implications - Recommendations

Explore This Research

Detailed Data & Analysis → — Full benchmark results and performance data
All Sources → — Primary and secondary sources with links

← Previous: Capability Trajectory

Next: Coding & Research →