AI Impact Research

Comprehensive research on AI platform usage from frontier labs (2025)

View the Project on GitHub vishalsachdev/ai-impact

Back to Overview Sources →

Academic Task Benchmarks: Detailed Data

GPQA (Graduate-Level Science)

Benchmark Overview

GPQA Diamond tests PhD-level reasoning in:

Question Characteristics:

Performance Data

Model GPQA Diamond Date
GPT-4 ~55% 2023
Claude 3 Opus ~60% Early 2024
Gemini 2.5 Pro ~80% March 2025
Gemini 3 Deep Think 93.8% November 2025

Human Baselines

Group Estimated Score
PhD students (own domain) 65-70%
PhD students (cross-domain) 40-50%
Undergraduate (advanced) 30-40%
General population 20-25%

Implications


AIME (Mathematics Competition)

Benchmark Overview

American Invitational Mathematics Examination:

Performance Data

Model AIME 2025 Raw With Code Date
GPT-4 ~30% ~40% 2023
Claude 3.5 ~50% ~60% 2024
Gemini 2.5 Pro ~70% ~80% March 2025
Gemini 3 95% 100% November 2025

Human Baselines

Group Typical Score
AIME qualifier (avg) 6-7/15 (40-47%)
USAMO qualifier 10+/15 (67%+)
IMO-level student 12+/15 (80%+)

Implications


Humanity’s Last Exam

Benchmark Overview

Design Principle: Questions at frontier of human knowledge

Performance Data

Model Score (No Tools) Date
GPT-4 ~8% 2024
Claude 3.5 Sonnet ~12% 2024
Gemini 2.5 Pro 18.8% March 2025

Human Baselines

Group Estimated Score
Domain experts 70-90% (in domain)
PhD researchers 40-50% (cross-domain)
Advanced undergrad 15-25%

Implications


Writing and Analysis Benchmarks

Essay Quality (Various Benchmarks)

Task Type AI Performance Human Parity
Argumentative essay 75-85% Approaching
Research synthesis 80-90% Achieved
Creative writing 60-75% Below
Technical writing 85-95% Exceeded

Analysis Tasks

Task Type AI Performance
Literature review High (with sources)
Data interpretation Very high
Critical analysis Moderate-high
Original argumentation Moderate

Subject-Specific Academic Performance

STEM Fields

Subject AI Capability Level
Mathematics (computational) Expert
Mathematics (proof-based) Advanced
Physics (problem-solving) Expert
Chemistry Advanced-Expert
Biology (factual) Expert
Computer Science Expert
Engineering (analysis) Advanced

Humanities

Subject AI Capability Level
History (factual) Expert
History (interpretation) Advanced
Philosophy (analysis) Advanced
Literature (analysis) Moderate-Advanced
Languages (translation) Expert
Languages (nuance) Advanced

Social Sciences

Subject AI Capability Level
Economics (quantitative) Expert
Psychology (research) Advanced
Sociology Advanced
Political Science Advanced

Professional Fields

Subject AI Capability Level
Law (research) Expert
Law (judgment) Moderate
Medicine (diagnosis support) Advanced
Business (analysis) Expert
Education (content) Expert

Assessment Vulnerability Analysis

High Vulnerability (AI can likely complete)

  1. Take-home exams - all subjects
  2. Research papers - with proper prompting
  3. Problem sets - especially STEM
  4. Short-answer questions - most types
  5. Code assignments - most levels

Moderate Vulnerability

  1. In-class essays - depends on proctoring
  2. Lab reports - without actual lab work
  3. Project proposals - conceptual portions
  4. Presentations - slide content (not delivery)

Lower Vulnerability (Currently)

  1. Oral examinations - real-time interaction
  2. Lab practicals - physical presence required
  3. Live demonstrations - real-time execution
  4. Portfolio defense - process-based
  5. Collaborative projects - interpersonal dynamics

Trajectory Implications

Current State (2025)

Near-Term (2026-2027)

Medium-Term (2028-2030)


Key Takeaways for Universities

  1. Academic integrity crisis is real: AI can complete most traditional assignments
  2. Assessment must evolve: Process-based, oral, and proctored methods essential
  3. Learning objectives shift: From knowledge to judgment and application
  4. AI literacy required: Students need to work with AI effectively
  5. Subject-specific strategies needed: Vulnerability varies by discipline