Performance Data — Updated June 2026

Sakana Fugu Benchmarks: How It Stacks Up Against Frontier Models

Detailed Sakana Fugu benchmark results across engineering, coding, science, reasoning, and mathematics — compared head-to-head against GPT-5.5, Claude Opus 4.8, and Gemini 3.1 Pro.

Full Sakana Fugu Benchmark Table

Sakana Fugu and Sakana Fugu Ultra scores compared against the three leading frontier models. Higher scores indicate better performance on each benchmark.

Benchmark Category Fugu Fugu Ultra GPT-5.5 Opus 4.8 Gemini 3.1
SWE-Bench Pro Engineering 59.0 73.7 69.2 62.5 54.2
LiveCodeBench Coding 92.9 93.2 88.5 85.3 86.1
GPQA-D Science 92.0 95.5 92.0 94.3 92.8
Humanity's Last Exam Reasoning 47.2 50.0 49.8 41.4 43.6
MATH-500 Mathematics 98.6 99.0 97.8 96.4 97.2
AIME 2025 Mathematics 86.7 90.0 86.7 83.3 85.0

Why Sakana Fugu Outperforms Single Models

The key insight behind Sakana Fugu's benchmark dominance is specialization through coordination. No single frontier model excels at everything — GPT-5.5 is strong at code generation, Claude Opus excels at long-context reasoning, Gemini leads on multimodal tasks. Sakana Fugu's conductor model has learned which model to activate for which sub-task, achieving best-of-all performance through intelligent delegation.

On SWE-Bench Pro, the gap is most striking: Sakana Fugu Ultra scores 73.7 versus 69.2 for GPT-5.5 alone. This 4.5-point improvement represents a generational leap — the kind of gain that normally requires training an entirely new, larger model. Sakana Fugu achieves this by having different agents handle code understanding, solution generation, and verification as separate coordinated steps.

The Sakana Fugu advantage is even more pronounced on agentic, multi-step tasks. In AutoResearch benchmarks, Sakana Fugu Ultra ran 123 experiments over 14 hours autonomously. In trading benchmarks, it achieved +19.43% portfolio returns versus less than 15% for any single model. These results demonstrate that Sakana Fugu's orchestration scales with task complexity — the harder the problem, the more the multi-agent approach pays off.

Sakana Fugu Real-World Performance

Beyond standard benchmarks, Sakana Fugu has demonstrated exceptional performance on practical, multi-step tasks that reflect real-world AI usage:

Code Review Depth

20+ issues found

vs 3 issues by competing models

Sakana Fugu Ultra catches bugs across multiple categories simultaneously by assigning different review agents to different concern areas.

Research Automation

123 experiments in 14 hours

Fully autonomous

Sakana Fugu Ultra optimized model training recipes by running and evaluating experiments without human intervention.

Rubik's Cube Solver

300/300 cubes solved

vs crashes from competitors

Sakana Fugu generated a functional solver that handled all test cases, while competing single models produced crashing code.

Trading Strategy

+19.43% mean return

vs <15% for single models

Sakana Fugu Ultra's multi-agent coordination produced superior portfolio optimization strategies.

Sakana Fugu Benchmark FAQ

Frequently Asked Questions

Which benchmarks does Sakana Fugu lead on?
Sakana Fugu Ultra leads or ties on SWE-Bench Pro (73.7 vs 69.2 for GPT-5.5), LiveCodeBench (93.2), GPQA-D (95.5), MATH-500 (99.0), and AIME 2025 (90.0). Sakana Fugu's multi-agent orchestration consistently outperforms single frontier models across engineering, science, and mathematics benchmarks.
How does Sakana Fugu beat larger models with a 7B conductor?
Sakana Fugu's 7B conductor model does not compete on intelligence — it competes on coordination. The conductor learns which frontier models to activate for each sub-task, how agents should communicate, and when to verify answers. This learned orchestration inside Sakana Fugu delivers a generational improvement without training a new frontier model.
Are Sakana Fugu benchmarks independently verified?
Sakana Fugu's orchestration approach is built on two peer-reviewed ICLR 2026 papers (TRINITY and the Conductor), which passed independent academic review. The benchmark numbers on this page are vendor-reported by Sakana AI. Independent third-party evaluations of Sakana Fugu are expected as the product matures.
What is the difference between Sakana Fugu and Fugu Ultra benchmarks?
Sakana Fugu (balanced) prioritizes latency alongside performance, scoring slightly lower on benchmarks but responding faster. Sakana Fugu Ultra maximizes quality by coordinating deeper agent pools and spending more compute per request. The gap is most visible on hard benchmarks like SWE-Bench Pro (59.0 vs 73.7) and Humanity's Last Exam (47.2 vs 50.0).
Does Sakana Fugu perform well on real-world tasks, not just benchmarks?
Yes — Sakana Fugu Ultra demonstrated real-world performance including: running 123 experiments over 14 hours to optimize training recipes (AutoResearch), generating a functional Rubik's Cube solver (300/300 cubes solved), achieving +19.43% portfolio return in trading benchmarks, and finding 20+ code review issues where competitors found 3. Sakana Fugu excels on multi-step, agentic tasks.