Question 1

Which benchmarks does Sakana Fugu lead on?

Accepted Answer

Sakana Fugu Ultra leads or ties on SWE-Bench Pro (73.7 vs 69.2 for GPT-5.5), LiveCodeBench (93.2), GPQA-D (95.5), MATH-500 (99.0), and AIME 2025 (90.0). Sakana Fugu's multi-agent orchestration consistently outperforms single frontier models across engineering, science, and mathematics benchmarks.

Question 2

How does Sakana Fugu beat larger models with a 7B conductor?

Accepted Answer

Sakana Fugu's 7B conductor model does not compete on intelligence — it competes on coordination. The conductor learns which frontier models to activate for each sub-task, how agents should communicate, and when to verify answers. This learned orchestration inside Sakana Fugu delivers a generational improvement without training a new frontier model.

Question 3

Are Sakana Fugu benchmarks independently verified?

Accepted Answer

Sakana Fugu's orchestration approach is built on two peer-reviewed ICLR 2026 papers (TRINITY and the Conductor), which passed independent academic review. The benchmark numbers on this page are vendor-reported by Sakana AI. Independent third-party evaluations of Sakana Fugu are expected as the product matures.

Question 4

What is the difference between Sakana Fugu and Fugu Ultra benchmarks?

Accepted Answer

Sakana Fugu (balanced) prioritizes latency alongside performance, scoring slightly lower on benchmarks but responding faster. Sakana Fugu Ultra maximizes quality by coordinating deeper agent pools and spending more compute per request. The gap is most visible on hard benchmarks like SWE-Bench Pro (59.0 vs 73.7) and Humanity's Last Exam (47.2 vs 50.0).

Question 5

Does Sakana Fugu perform well on real-world tasks, not just benchmarks?

Accepted Answer

Yes — Sakana Fugu Ultra demonstrated real-world performance including: running 123 experiments over 14 hours to optimize training recipes (AutoResearch), generating a functional Rubik's Cube solver (300/300 cubes solved), achieving +19.43% portfolio return in trading benchmarks, and finding 20+ code review issues where competitors found 3. Sakana Fugu excels on multi-step, agentic tasks.

Benchmark	Category	Fugu	Fugu Ultra	GPT-5.5	Opus 4.8	Gemini 3.1
SWE-Bench Pro	Engineering	59.0	73.7	69.2	62.5	54.2
LiveCodeBench	Coding	92.9	93.2	88.5	85.3	86.1
GPQA-D	Science	92.0	95.5	92.0	94.3	92.8
Humanity's Last Exam	Reasoning	47.2	50.0	49.8	41.4	43.6
MATH-500	Mathematics	98.6	99.0	97.8	96.4	97.2
AIME 2025	Mathematics	86.7	90.0	86.7	83.3	85.0

Sakana Fugu Benchmarks: How It Stacks Up Against Frontier Models

Full Sakana Fugu Benchmark Table

Why Sakana Fugu Outperforms Single Models

Sakana Fugu Real-World Performance

Sakana Fugu Benchmark FAQ

Frequently Asked Questions