AI Model Benchmark // April 25 2026

OVERALL_LEADERBOARD

HUMAN PREFERENCE ELO

Claude Opus 4.6

Anthropic :: Arena Elo snapshot

1503

Gemini 3.1 Pro Preview

Google DeepMind :: near-frontier tie

1494

Gemini 3 Pro

Google DeepMind :: previous flagship

1486

GPT-5.4

OpenAI :: frontier general model

1485

GPT-5.2

OpenAI :: strong chat baseline

1481

Gemini 3 Flash

Google DeepMind :: speed class

1474

// BEST OVERALL SIGNAL

CLAUDE

Highest Arena Elo in the cached April leaderboard data.

// BEST REASONING SIGNAL

GEMINI

Reports show strong GPQA and ARC-AGI-2 numbers.

// BEST VALUE SIGNAL

OPEN / FAST

DeepSeek, Qwen, GLM, and Flash-class models compete on cost.

Leaderboard values move quickly. Treat this page as a static April 25, 2026 web-search snapshot, not a live ranking API.

BENCHMARK_MATRIX

FRONTIER MODEL SIGNALS

Model	Provider	Arena Elo	AA / Family Score	GPQA	SWE-bench	ARC-AGI-2	Context	Best Use
Claude Mythos	Anthropic	1497	54	90.2%	79.5%	72.0%	~1M	Creative reasoning, narrative synthesis, long-context workflows
Claude Opus 4.6	Anthropic	1503	53	87.4-91.3%	80.8%	68.8%	~1M	Complex coding, long tasks, professional writing
Gemini 3.1 Pro Preview	Google DeepMind	1492-1494	57	94.3%	~77-80.6%	77.1%	~1M-2M	Science reasoning, multimodal, long context
GPT-5.4	OpenAI	1485	57	83.9-92.8%	~79-80%	73.3%	~1M	General agents, balanced production workflows
Grok 4.1 / 4.20 Beta	xAI	1471-1496	49 family score	Not consistent	Not consistent	Not consistent	~2M reported family context	Fast frontier chat, current-events workflows
Qwen 3.5 / GLM 5	Alibaba / Zhipu	~1449-1456	50 / 51 family score	Varies	Varies	Varies	200K-1M family range	Open or accessible alternatives near frontier
DeepSeek V4 / R1	DeepSeek AI	Independent V4 pending	42 family score before V4 updates	Reported competitive	Reported competitive	Not verified here	V4 reports claim 1M	Open-source / low-cost reasoning, pending validation

Ranges appear where sources disagree or use different benchmark variants. The page avoids claiming a single universal winner because April 2026 sources show frontier convergence across Claude, Gemini, and GPT.

TASK_ROUTER

PRACTICAL PICKS

// CODEBASE REPAIR

CLAUDE

Opus 4.6 has the strongest repeated SWE-bench signal in the collected sources.

// SCIENCE REASONING

GEMINI

Gemini 3.1 Pro is repeatedly reported as leading GPQA and ARC-AGI-2 style tasks.

// BALANCED AGENTS

GPT-5.4

Strong aggregate score and agentic execution signals, with competitive pricing reports.

// BUDGET / OPEN

DEEPSEEK

V4 release is newly reported; independent benchmark confirmation should be watched.

// LONG CONTEXT

GEMINI / CLAUDE

Both families show strong long-context positioning, depending on task and retrieval quality.

// HUMAN PREFERENCE

CLAUDE

Claude Opus 4.6 leads the Arena Elo snapshot used for this page.

SOURCE_AUDIT

WEB SEARCH 25 APR 2026

LMMCap Arena Elo Benchmark

Cached Arena Elo data: Claude Opus 4.6 at 1503, Gemini 3.1 Pro around 1492-1494, GPT-5.4 at 1485.

BuildFastWithAI April 2026 Model Ranking

Reports Artificial Analysis score tie between Gemini 3.1 Pro and GPT-5.4, plus benchmark details for Gemini and Claude.

VibeCarats Model Families

Summarizes Artificial Analysis family scores: GPT and Gemini at 57, Claude at 53, GLM and Qwen close behind.

LearnAIForge Coding Comparison

Compares Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 on SWE-bench, ARC-AGI-2, and pricing.

Developer Benchmark 2026

Reports GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro developer-oriented numbers for SWE-bench, GPQA, ARC-AGI-2, and context.

AP News: DeepSeek V4

Reports DeepSeek V4 release and notes that independent benchmarks are still needed to verify performance claims.

Method: web-search snapshot compiled on April 25, 2026. Public model benchmarks are not directly comparable unless they use the same harness, date, settings, and model version. Use this page as a dashboard, not as procurement proof.

AI MODEL BENCHMARK INTELLIGENCE GRID.