Comprehensive research on AI platform usage from frontier labs (2025)
| ← Back to Overview | Sources → |
Design: Real desktop/web tasks on virtual machines
Task Categories:
| Model | OSWorld Score | Date |
|---|---|---|
| Claude Opus 4.5 | 61.4% | Nov 2025 |
| Claude Sonnet 4.5 | ~50% | Nov 2025 |
| GPT-5 | ~45% | 2025 |
| Gemini 3 | ~40% | Nov 2025 |
| Earlier models | <30% | 2024 |
| Task Type | Claude 4.5 Score |
|---|---|
| Simple navigation | 85%+ |
| Form filling | 75-85% |
| File operations | 70-80% |
| Multi-app workflows | 50-65% |
| Complex configurations | 40-55% |
Anthropic Research Finding:
Mechanism:
Successful Agentic Workflows:
| Tool Type | Current Capability |
|---|---|
| Web browsing | Advanced |
| Code execution | Expert |
| File read/write | Expert |
| Image analysis | Advanced |
| API calls | Expert |
| Database queries | Advanced |
| Shell commands | Advanced |
| GUI interaction | Moderate-Advanced |
AI agents can now coordinate:
| Steps | Success Rate | Notes |
|---|---|---|
| 1-3 | 85-95% | Highly reliable |
| 4-7 | 70-85% | Generally reliable |
| 8-12 | 50-70% | Moderate reliability |
| 13-20 | 35-55% | Significant failure risk |
| 20+ | <40% | Frequent failures |
| Risk | Severity | Mitigation |
|---|---|---|
| Unintended actions | High | Sandboxing, confirmations |
| Data exposure | High | Access controls |
| Resource consumption | Medium | Limits and monitoring |
| Infinite loops | Medium | Timeout mechanisms |
| Hallucinated actions | Medium | Verification steps |