Academic Task Benchmarks: Detailed Data
GPQA (Graduate-Level Science)
Benchmark Overview
GPQA Diamond tests PhD-level reasoning in:
- Physics
- Chemistry
- Biology
- Cross-domain science
Question Characteristics:
- Require deep domain expertise
- Multi-step reasoning
- Cannot be easily searched
- Verified by domain experts
| Model |
GPQA Diamond |
Date |
| GPT-4 |
~55% |
2023 |
| Claude 3 Opus |
~60% |
Early 2024 |
| Gemini 2.5 Pro |
~80% |
March 2025 |
| Gemini 3 Deep Think |
93.8% |
November 2025 |
Human Baselines
| Group |
Estimated Score |
| PhD students (own domain) |
65-70% |
| PhD students (cross-domain) |
40-50% |
| Undergraduate (advanced) |
30-40% |
| General population |
20-25% |
Implications
- AI exceeds average PhD student in cross-domain science
- Approaching expert-level in single domains
- Science education assessment fundamentally affected
AIME (Mathematics Competition)
Benchmark Overview
American Invitational Mathematics Examination:
- 15 questions, 3 hours
- Integer answers (0-999)
- Invitational (top ~5% of AMC qualifiers)
- Tests: algebra, geometry, number theory, combinatorics
| Model |
AIME 2025 Raw |
With Code |
Date |
| GPT-4 |
~30% |
~40% |
2023 |
| Claude 3.5 |
~50% |
~60% |
2024 |
| Gemini 2.5 Pro |
~70% |
~80% |
March 2025 |
| Gemini 3 |
95% |
100% |
November 2025 |
Human Baselines
| Group |
Typical Score |
| AIME qualifier (avg) |
6-7/15 (40-47%) |
| USAMO qualifier |
10+/15 (67%+) |
| IMO-level student |
12+/15 (80%+) |
Implications
- AI exceeds most human math competitors
- Near-perfect with computational tools
- Math education emphasis shifts from computation to intuition
Humanity’s Last Exam
Benchmark Overview
Design Principle: Questions at frontier of human knowledge
- Crowdsourced from domain experts
- Not searchable online
- Require novel reasoning
- Intended to be very difficult for AI
| Model |
Score (No Tools) |
Date |
| GPT-4 |
~8% |
2024 |
| Claude 3.5 Sonnet |
~12% |
2024 |
| Gemini 2.5 Pro |
18.8% |
March 2025 |
Human Baselines
| Group |
Estimated Score |
| Domain experts |
70-90% (in domain) |
| PhD researchers |
40-50% (cross-domain) |
| Advanced undergrad |
15-25% |
Implications
- AI approaching half of expert cross-domain performance
- Frontier knowledge still largely human domain
- Gap closing faster than expected
Writing and Analysis Benchmarks
Essay Quality (Various Benchmarks)
| Task Type |
AI Performance |
Human Parity |
| Argumentative essay |
75-85% |
Approaching |
| Research synthesis |
80-90% |
Achieved |
| Creative writing |
60-75% |
Below |
| Technical writing |
85-95% |
Exceeded |
Analysis Tasks
| Task Type |
AI Performance |
| Literature review |
High (with sources) |
| Data interpretation |
Very high |
| Critical analysis |
Moderate-high |
| Original argumentation |
Moderate |
STEM Fields
| Subject |
AI Capability Level |
| Mathematics (computational) |
Expert |
| Mathematics (proof-based) |
Advanced |
| Physics (problem-solving) |
Expert |
| Chemistry |
Advanced-Expert |
| Biology (factual) |
Expert |
| Computer Science |
Expert |
| Engineering (analysis) |
Advanced |
Humanities
| Subject |
AI Capability Level |
| History (factual) |
Expert |
| History (interpretation) |
Advanced |
| Philosophy (analysis) |
Advanced |
| Literature (analysis) |
Moderate-Advanced |
| Languages (translation) |
Expert |
| Languages (nuance) |
Advanced |
Social Sciences
| Subject |
AI Capability Level |
| Economics (quantitative) |
Expert |
| Psychology (research) |
Advanced |
| Sociology |
Advanced |
| Political Science |
Advanced |
Professional Fields
| Subject |
AI Capability Level |
| Law (research) |
Expert |
| Law (judgment) |
Moderate |
| Medicine (diagnosis support) |
Advanced |
| Business (analysis) |
Expert |
| Education (content) |
Expert |
Assessment Vulnerability Analysis
High Vulnerability (AI can likely complete)
- Take-home exams - all subjects
- Research papers - with proper prompting
- Problem sets - especially STEM
- Short-answer questions - most types
- Code assignments - most levels
Moderate Vulnerability
- In-class essays - depends on proctoring
- Lab reports - without actual lab work
- Project proposals - conceptual portions
- Presentations - slide content (not delivery)
Lower Vulnerability (Currently)
- Oral examinations - real-time interaction
- Lab practicals - physical presence required
- Live demonstrations - real-time execution
- Portfolio defense - process-based
- Collaborative projects - interpersonal dynamics
Trajectory Implications
Current State (2025)
- Graduate-level performance on standardized tests
- Expert-level writing and analysis
- Competition-level mathematics
Near-Term (2026-2027)
- Frontier knowledge gap closing
- Original research assistance
- Multi-modal academic tasks
Medium-Term (2028-2030)
- Research contribution at PhD level
- Novel discovery assistance
- Human advantage: judgment, ethics, relationships
Key Takeaways for Universities
- Academic integrity crisis is real: AI can complete most traditional assignments
- Assessment must evolve: Process-based, oral, and proctored methods essential
- Learning objectives shift: From knowledge to judgment and application
- AI literacy required: Students need to work with AI effectively
- Subject-specific strategies needed: Vulnerability varies by discipline