← AI Impact Research · AI Capabilities Research

Agentic Capabilities

Research Question

What autonomous, multi-step work can AI perform?

Hypothesis

AI is transitioning from tool (requiring human prompting at each step) to agent (capable of autonomous multi-step workflows), with significant implications for task delegation and workforce augmentation.

Key Findings

1. OSWorld (Computer Use)

Benchmark: Real desktop and web tasks on virtual machines (Anthropic)

Model	Score	Date
Claude Opus 4.5	61.4%	Nov 2025
Other frontier models	30-50%	2025

Task Examples:

Navigate web applications
Fill out forms
Manage files and folders
Execute multi-step workflows

Anthropic Finding: Claude agents can autonomously refine their own outputs (Anthropic)

Peak performance achieved in 4 iterations
Other models couldn’t match quality after 10 iterations
Self-correction without human intervention

3. Multi-Step Task Completion

AI can now autonomously:

Research topics across multiple sources
Write and edit documents iteratively
Execute code and debug based on errors
Manage project workflows
Coordinate multi-tool operations

4. Current Limitations

Reliability degrades with task length
Novel situations cause failures
Error recovery still imperfect
Human oversight still necessary for high-stakes tasks

Implications for Universities

Task Delegation

Administrative workflows increasingly automatable
Research assistance at higher level of autonomy
Student support services augmented

Teaching Agentic AI

New curriculum area: AI agent design and oversight
Ethics of delegation and accountability
Human-agent collaboration skills

Research Workflows

Literature review automation
Data collection and processing agents
Experiment monitoring and adjustment

RQ01: Real-World Task Performance - Task-level capabilities
RQ04: Coding and Research - Coding agent capabilities
RQ06: Safety and Alignment - Agent safety concerns

Explore This Research

Detailed Data & Analysis → — OSWorld results, autonomy data, and limitations
All Sources → — Primary and secondary sources with links

← Previous: Coding & Research

Next: Safety & Alignment →