Leaderboard values move quickly. Treat this page as a static April 25, 2026 web-search snapshot, not a live ranking API.
| Model | Provider | Arena Elo | AA / Family Score | GPQA | SWE-bench | ARC-AGI-2 | Context | Best Use |
|---|---|---|---|---|---|---|---|---|
| Claude Mythos | Anthropic | 1497 | 54 | 90.2% | 79.5% | 72.0% | ~1M | Creative reasoning, narrative synthesis, long-context workflows |
| Claude Opus 4.6 | Anthropic | 1503 | 53 | 87.4-91.3% | 80.8% | 68.8% | ~1M | Complex coding, long tasks, professional writing |
| Gemini 3.1 Pro Preview | Google DeepMind | 1492-1494 | 57 | 94.3% | ~77-80.6% | 77.1% | ~1M-2M | Science reasoning, multimodal, long context |
| GPT-5.4 | OpenAI | 1485 | 57 | 83.9-92.8% | ~79-80% | 73.3% | ~1M | General agents, balanced production workflows |
| Grok 4.1 / 4.20 Beta | xAI | 1471-1496 | 49 family score | Not consistent | Not consistent | Not consistent | ~2M reported family context | Fast frontier chat, current-events workflows |
| Qwen 3.5 / GLM 5 | Alibaba / Zhipu | ~1449-1456 | 50 / 51 family score | Varies | Varies | Varies | 200K-1M family range | Open or accessible alternatives near frontier |
| DeepSeek V4 / R1 | DeepSeek AI | Independent V4 pending | 42 family score before V4 updates | Reported competitive | Reported competitive | Not verified here | V4 reports claim 1M | Open-source / low-cost reasoning, pending validation |
Ranges appear where sources disagree or use different benchmark variants. The page avoids claiming a single universal winner because April 2026 sources show frontier convergence across Claude, Gemini, and GPT.
Cached Arena Elo data: Claude Opus 4.6 at 1503, Gemini 3.1 Pro around 1492-1494, GPT-5.4 at 1485.
Reports Artificial Analysis score tie between Gemini 3.1 Pro and GPT-5.4, plus benchmark details for Gemini and Claude.
Summarizes Artificial Analysis family scores: GPT and Gemini at 57, Claude at 53, GLM and Qwen close behind.
Compares Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 on SWE-bench, ARC-AGI-2, and pricing.
Reports GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro developer-oriented numbers for SWE-bench, GPQA, ARC-AGI-2, and context.
Reports DeepSeek V4 release and notes that independent benchmarks are still needed to verify performance claims.
Method: web-search snapshot compiled on April 25, 2026. Public model benchmarks are not directly comparable unless they use the same harness, date, settings, and model version. Use this page as a dashboard, not as procurement proof.