BENCHMARK
APRIL 25 2026 SNAPSHOT :: CLAUDE OPUS 4.6 LEADS ARENA ELO :: GEMINI 3.1 PRO AND GPT-5.4 TIE TOP ARTIFICIAL ANALYSIS FAMILY SCORES :: FRONTIER MODELS CONVERGING :: DEEPSEEK V4 RELEASE REPORTED, INDEPENDENT BENCHMARKS PENDING ::
RANK
DATA_DATE: 25 APR 2026 PUBLIC_WEB_SOURCES

AI MODEL BENCHMARK INTELLIGENCE GRID.

A CYBER static dashboard collecting current public benchmark signals for frontier AI models: human preference Elo, composite intelligence scores, coding, reasoning, context, and value.

// ARENA LEADER
1503
CLAUDE OPUS 4.6 ELO
// AA TOP SCORE
57
GPT / GEMINI FAMILY TIE
// GPQA SIGNAL
94.3%
GEMINI 3.1 PRO REPORTED
// CODE SIGNAL
80.8%
CLAUDE OPUS 4.6 SWE-BENCH
OVERALL_LEADERBOARD
HUMAN PREFERENCE ELO
01
Claude Opus 4.6
Anthropic :: Arena Elo snapshot
1503
02
Gemini 3.1 Pro Preview
Google DeepMind :: near-frontier tie
1494
03
Gemini 3 Pro
Google DeepMind :: previous flagship
1486
04
GPT-5.4
OpenAI :: frontier general model
1485
05
GPT-5.2
OpenAI :: strong chat baseline
1481
06
Gemini 3 Flash
Google DeepMind :: speed class
1474
// BEST OVERALL SIGNAL
CLAUDE
Highest Arena Elo in the cached April leaderboard data.
// BEST REASONING SIGNAL
GEMINI
Reports show strong GPQA and ARC-AGI-2 numbers.
// BEST VALUE SIGNAL
OPEN / FAST
DeepSeek, Qwen, GLM, and Flash-class models compete on cost.

Leaderboard values move quickly. Treat this page as a static April 25, 2026 web-search snapshot, not a live ranking API.

BENCHMARK_MATRIX
FRONTIER MODEL SIGNALS
Model Provider Arena Elo AA / Family Score GPQA SWE-bench ARC-AGI-2 Context Best Use
Claude Mythos Anthropic 1497 54 90.2% 79.5% 72.0% ~1M Creative reasoning, narrative synthesis, long-context workflows
Claude Opus 4.6 Anthropic 1503 53 87.4-91.3% 80.8% 68.8% ~1M Complex coding, long tasks, professional writing
Gemini 3.1 Pro Preview Google DeepMind 1492-1494 57 94.3% ~77-80.6% 77.1% ~1M-2M Science reasoning, multimodal, long context
GPT-5.4 OpenAI 1485 57 83.9-92.8% ~79-80% 73.3% ~1M General agents, balanced production workflows
Grok 4.1 / 4.20 Beta xAI 1471-1496 49 family score Not consistent Not consistent Not consistent ~2M reported family context Fast frontier chat, current-events workflows
Qwen 3.5 / GLM 5 Alibaba / Zhipu ~1449-1456 50 / 51 family score Varies Varies Varies 200K-1M family range Open or accessible alternatives near frontier
DeepSeek V4 / R1 DeepSeek AI Independent V4 pending 42 family score before V4 updates Reported competitive Reported competitive Not verified here V4 reports claim 1M Open-source / low-cost reasoning, pending validation

Ranges appear where sources disagree or use different benchmark variants. The page avoids claiming a single universal winner because April 2026 sources show frontier convergence across Claude, Gemini, and GPT.

TASK_ROUTER
PRACTICAL PICKS
// CODEBASE REPAIR
CLAUDE
Opus 4.6 has the strongest repeated SWE-bench signal in the collected sources.
// SCIENCE REASONING
GEMINI
Gemini 3.1 Pro is repeatedly reported as leading GPQA and ARC-AGI-2 style tasks.
// BALANCED AGENTS
GPT-5.4
Strong aggregate score and agentic execution signals, with competitive pricing reports.
// BUDGET / OPEN
DEEPSEEK
V4 release is newly reported; independent benchmark confirmation should be watched.
// LONG CONTEXT
GEMINI / CLAUDE
Both families show strong long-context positioning, depending on task and retrieval quality.
// HUMAN PREFERENCE
CLAUDE
Claude Opus 4.6 leads the Arena Elo snapshot used for this page.
SOURCE_AUDIT
WEB SEARCH 25 APR 2026
LMMCap Arena Elo Benchmark

Cached Arena Elo data: Claude Opus 4.6 at 1503, Gemini 3.1 Pro around 1492-1494, GPT-5.4 at 1485.

BuildFastWithAI April 2026 Model Ranking

Reports Artificial Analysis score tie between Gemini 3.1 Pro and GPT-5.4, plus benchmark details for Gemini and Claude.

VibeCarats Model Families

Summarizes Artificial Analysis family scores: GPT and Gemini at 57, Claude at 53, GLM and Qwen close behind.

LearnAIForge Coding Comparison

Compares Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 on SWE-bench, ARC-AGI-2, and pricing.

Developer Benchmark 2026

Reports GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro developer-oriented numbers for SWE-bench, GPQA, ARC-AGI-2, and context.

AP News: DeepSeek V4

Reports DeepSeek V4 release and notes that independent benchmarks are still needed to verify performance claims.

Method: web-search snapshot compiled on April 25, 2026. Public model benchmarks are not directly comparable unless they use the same harness, date, settings, and model version. Use this page as a dashboard, not as procurement proof.