Comprehensive research on AI platform usage from frontier labs (2025)
| ← Back to Overview | Sources → |
Attempts to override model instructions through:
| Model | Resistance Level | Notes |
|---|---|---|
| Claude Opus 4.5 | Industry leading | Hardest to trick |
| GPT-5 | Strong | Improved from GPT-4 |
| Gemini 3 | Strong | Google’s investment |
| Open-source models | Variable | Often weaker |
Claude Opus 4.5 improvements:
| Technique | Claude 4.5 | GPT-5 | Gemini 3 |
|---|---|---|---|
| Direct override | <1% | <1% | <1% |
| Roleplay | 2-5% | 3-7% | 3-8% |
| Escalation | 5-10% | 5-12% | 5-15% |
| Encoding | 3-8% | 5-10% | 5-12% |
| Context manipulation | 5-15% | 8-18% | 8-20% |
Note: Rates are approximate; actual rates vary by specific prompt and model version.
Hallucination: Model generates false information presented as fact
| Domain | Estimated Rate | Notes |
|---|---|---|
| Common knowledge | 2-5% | Generally accurate |
| Technical details | 5-15% | Variable accuracy |
| Recent events | 15-30% | Knowledge cutoff issues |
| Obscure topics | 20-40% | Limited training data |
| Citations | 15-25% | Persistent problem |
| Numbers/statistics | 10-20% | Often approximate |
| Model Generation | Avg Hallucination Rate |
|---|---|
| GPT-3 era | 30-40% |
| GPT-4 era | 15-25% |
| Current frontier | 5-15% |
Approach:
Strengths:
Approach:
Strengths:
| Configuration | Safety | Capability |
|---|---|---|
| Maximum safety | Very high | Reduced |
| Balanced | High | Good |
| Maximum capability | Moderate | Very high |
Observation: No model achieves both maximum safety and maximum capability simultaneously.
Strengths:
Characteristics:
Strengths:
Characteristics:
Strengths:
Characteristics:
Approach: Adversarial testing by humans
Benchmarks:
Methods: