AI Impact Research

Comprehensive research on AI platform usage from frontier labs (2025)

View the Project on GitHub vishalsachdev/ai-impact

Back to Overview Sources →

Agentic Capabilities: Detailed Data

OSWorld Benchmark

Overview

Design: Real desktop/web tasks on virtual machines

Task Categories:

  1. Web navigation and form filling
  2. File management operations
  3. Application usage (spreadsheets, documents)
  4. Multi-application workflows
  5. System configuration tasks

Performance Data

Model OSWorld Score Date
Claude Opus 4.5 61.4% Nov 2025
Claude Sonnet 4.5 ~50% Nov 2025
GPT-5 ~45% 2025
Gemini 3 ~40% Nov 2025
Earlier models <30% 2024

Task Difficulty Breakdown

Task Type Claude 4.5 Score
Simple navigation 85%+
Form filling 75-85%
File operations 70-80%
Multi-app workflows 50-65%
Complex configurations 40-55%

Autonomous Agent Behaviors

Self-Refinement Capability

Anthropic Research Finding:

Mechanism:

  1. Initial output generation
  2. Self-critique and error identification
  3. Revision based on self-feedback
  4. Repeat until convergence

Multi-Step Task Patterns

Successful Agentic Workflows:

  1. Research Agent
    • Query formulation
    • Source identification
    • Information extraction
    • Synthesis and summary
    • Citation management
  2. Coding Agent
    • Requirement analysis
    • Code generation
    • Test execution
    • Error diagnosis
    • Iterative debugging
  3. Document Agent
    • Outline generation
    • Section drafting
    • Internal consistency checking
    • Revision based on criteria
    • Formatting and finalization
  4. Data Agent
    • Data source identification
    • Extraction and cleaning
    • Transformation
    • Analysis execution
    • Visualization generation

Tool Use Capabilities

Available Tool Categories

Tool Type Current Capability
Web browsing Advanced
Code execution Expert
File read/write Expert
Image analysis Advanced
API calls Expert
Database queries Advanced
Shell commands Advanced
GUI interaction Moderate-Advanced

Multi-Tool Orchestration

AI agents can now coordinate:


Reliability Analysis

Success Rate by Task Length

Steps Success Rate Notes
1-3 85-95% Highly reliable
4-7 70-85% Generally reliable
8-12 50-70% Moderate reliability
13-20 35-55% Significant failure risk
20+ <40% Frequent failures

Failure Modes

  1. Cascading Errors
    • Early mistake propagates
    • No recovery mechanism
    • Task abandonment
  2. State Confusion
    • Lost track of progress
    • Repeated actions
    • Incomplete execution
  3. Novel Situations
    • Unexpected UI changes
    • Error messages not in training
    • Edge cases
  4. Resource Management
    • Context window exhaustion
    • Tool timeout handling
    • Memory limitations

Agent Safety Considerations

Risks

Risk Severity Mitigation
Unintended actions High Sandboxing, confirmations
Data exposure High Access controls
Resource consumption Medium Limits and monitoring
Infinite loops Medium Timeout mechanisms
Hallucinated actions Medium Verification steps

Best Practices

  1. Sandboxed Environments
    • Limited permissions
    • Reversible actions only
    • Isolated from production
  2. Human-in-the-Loop
    • Confirmation for high-stakes actions
    • Progress monitoring
    • Intervention capability
  3. Audit Logging
    • Complete action history
    • Decision rationale capture
    • Accountability trail

Application Domains

Currently Viable

  1. Software Development
    • Code completion and generation
    • Testing and debugging
    • Documentation
    • Code review assistance
  2. Research Support
    • Literature search
    • Data extraction
    • Analysis execution
    • Draft generation
  3. Administrative Tasks
    • Scheduling and coordination
    • Report generation
    • Data entry and processing
    • Communication drafting
  4. Customer Service
    • Inquiry handling
    • Information retrieval
    • Ticket routing
    • Basic troubleshooting

Emerging Capabilities

  1. Complex Project Management
    • Multi-stakeholder coordination
    • Timeline management
    • Resource allocation
  2. Creative Workflows
    • Multi-modal content creation
    • Iterative design processes
    • Brand consistency management
  3. Scientific Research
    • Hypothesis generation
    • Experiment design
    • Result interpretation

Trajectory Analysis

Current State (Late 2025)

Near-Term (2026)

Medium-Term (2027-2028)


University Implications

Administrative Automation

Research Support

Student Services

Teaching


Key Takeaways

  1. Computer use at 61.4%: Claude leads on real-world desktop tasks
  2. Self-refinement works: 4 iterations to peak quality
  3. Reliability drops with length: <10 steps most reliable
  4. Human oversight still needed: High-stakes tasks require verification
  5. Trajectory is clear: Autonomous capability expanding rapidly