← AI Impact Research · AI Capabilities Research
Academic Task Benchmarks
Research Question
How do AI models perform on academic and intellectual tasks relevant to universities?
Hypothesis
AI models are achieving expert-level performance on many academic tasks, requiring universities to reconsider assessment methods, learning objectives, and the nature of expertise development.
Key Findings
1. Graduate-Level Science (GPQA Diamond)
Benchmark: PhD-level questions in physics, biology, and chemistry (Google)
- Gemini 3 Deep Think: 93.8%
- Human PhD students: ~65-70% (in their domain)
- Significance: AI exceeds average PhD student in multi-domain science
2. Mathematics (AIME 2025)
Benchmark: American Invitational Mathematics Examination (Google)
- Gemini 3: 95% raw, 100% with code execution
- Human qualification threshold: ~50%
- Significance: Near-perfect on competition-level math
3. Frontier Knowledge (Humanity’s Last Exam)
Benchmark: Questions at the edge of human knowledge (Google)
- Gemini 2.5 Pro: 18.8% (without tools)
- Expert humans: ~40-50%
- Significance: AI approaching half of frontier knowledge performance
4. Undergraduate Coursework
Estimated AI Performance by Subject:
- STEM problem sets: 80-95%
- Essay writing: 70-85%
- Research synthesis: 75-90%
- Foreign language: 70-85%
- Creative work: 50-70%
Implications for Universities
Assessment Redesign
- Take-home exams fundamentally compromised
- Proctored exams gain importance
- Process-based assessment essential
- Oral examinations more reliable
Learning Objectives
- Knowledge recall less valuable
- Critical thinking and synthesis emphasized
- AI collaboration as new skill
- Meta-cognitive skills essential
Course Design
- AI-augmented assignments normalized
- Explicit AI policy required for every course
- Focus on what AI cannot (yet) do
Explore This Research