Decoding Sakana Fugu Technical Report

Authors
  • Amit Shekhar
    Name
    Amit Shekhar
    Published on
Decoding Sakana Fugu

In this blog, we are going to learn about Sakana Fugu, a family of AI models that work like a conductor for a team of other AI models.

We will cover the following:

  • What is Sakana Fugu?
  • Why Fugu was needed
  • The big picture: what Fugu does
  • Collective Intelligence
  • The two Fugus - Fugu and Fugu-Ultra
  • How Fugu picks the right model - the lightweight selection head
  • Teaching Fugu who is best - supervised fine-tuning
  • Polishing Fugu on real tasks - evolutionary strategies
  • How Fugu-Ultra conducts an orchestra - the Conductor
  • Teaching Fugu-Ultra to conduct - GRPO
  • Stopping the agents from copying each other
  • How well does Fugu perform?
  • The clever strategies Fugu discovered on its own
  • Quick Summary

I am Amit Shekhar, Founder @ Outcome School, I have taught and mentored many developers, and their efforts landed them high-paying tech jobs, helped many tech companies in solving their unique problems, and created many open-source libraries being used by top companies. I am passionate about sharing knowledge through open-source, blogs, and videos.

I teach AI and Machine Learning at Outcome School.

Let's get started.

What is Sakana Fugu?

Sakana Fugu is a family of AI models. A Fugu model does not answer your question by itself. Instead, it reads your question and decides which other powerful AI models should work on it, how they should work together, and how to combine their answers into one final reply.

In simple words, Fugu is a conductor. A conductor does not play any instrument. The conductor reads the music, decides which musician should play, when, and how loud, and turns many players into one beautiful performance. Fugu does the same thing with AI models.

Fugu comes from Sakana AI, and the team describes it in the "Sakana Fugu Technical Report". Sakana AI is a research lab in Tokyo. One of its founders, Llion Jones, is a co-author of the original "Attention Is All You Need" paper, the paper that introduced the Transformer.

The paper calls models like Fugu orchestrator models. An orchestrator is the one who arranges many parts so that they work together as one. So, the whole goal of Fugu is to take a team of strong AI models and make them act as a single, smarter system.

Why Fugu was needed

The best way to learn this is by taking an example.

Today we have many powerful AI models, often called frontier models. Frontier just means they are at the leading edge, the most powerful and latest models available. We have GPT-5.5 from OpenAI, Claude Opus 4.8 from Anthropic, and Gemini 3.1 Pro from Google. Each one is very strong. But, here is the catch. No single model is the best at everything.

The paper observed clear specializations:

  • GPT models are very strong at math and at planning and combining ideas.
  • Claude Opus models are very strong at software engineering and at finding security bugs, which means debugging.
  • Gemini models are very strong at directly implementing known algorithms and at science like chemistry and biology.

So, if we use only one model, we get its strengths, but we are also stuck with its weaknesses.

Now, the question is: how do we get the strengths of all of them at the same time? Let's build up the answer step by step.

Approach 1: Use one model for everything. This is the simplest approach. The issue with this approach is that we are stuck with one model's weak areas. Let's see how the next approach tries to solve this.

Approach 2: Build a fixed team by hand. We could write a fixed plan ourselves, for example, "Model X always writes the code, then Model Y always reviews it." The issue with this approach is that it is rigid. We have to design and tune it ourselves, and the same fixed plan is used for every question, even when it does not fit the question. Let's see how the next approach tries to solve this.

Approach 3: A simple router. We could train a small router that looks at the question and sends it to one model. This is better, but it is still single-step. It picks one model once, it cannot change its mind partway through the task, and it cannot make two models work together. The issue with this approach is that it is too limited. Let's see how Fugu solves this.

So, here comes Sakana Fugu to the rescue. Fugu is a learned orchestrator. It reads the question and builds a plan for that specific question - which models to involve, how they should talk to each other, and when to combine their answers. The plan is not fixed. It changes based on the question.

The big picture: what Fugu does

Before going into the details, let's see the simple input-to-output view.

You give Fugu a question. Fugu reads it, picks the right model or the right team of models from its pool, lets them work, and gives you back one final answer. From the outside, it feels like you are talking to a single model.

                  +-----------------------+
   "Write me      |    Fugu Orchestrator  |
    binary    --> |  (reads the question, | --> picks the right worker(s)
    search"       |    builds a plan)     |
                  +-----------------------+
                       |       |       |
                       v       v       v
                   +------+ +------+ +------+
                   | GPT  | | Opus | |Gemini|   <- pool of frontier workers
                   +------+ +------+ +------+
                       |       |       |
                       +-------+-------+
                               |
                               v
                          Final Answer

Here, we can see that Fugu sits in the middle. The user talks only to Fugu. Behind the scenes, Fugu uses a pool of frontier models as workers and combines their work into one answer.

Now, let's decode each piece one by one. The paper itself says that the complexity does not come from any single piece being hard. It comes from how the pieces are stacked together. So, if we understand each piece on its own, the whole system becomes simple.

Collective Intelligence

Collective intelligence is when a group of limited individuals, by working together, produces behavior that is smarter than any single member could be alone.

We see this everywhere. A single ant is simple, but an ant colony builds complex nests. A single person knows a little, but a good team solves big problems. Fugu brings this idea to AI models. Each frontier model is one expert. Fugu turns them into a team.

The paper makes a bigger point here. It treats orchestration as a new scaling axis. Until now, we made AI better mostly by training one bigger and more expensive model. Fugu says there is another way to get better: make many existing models cooperate well.

This brings real benefits:

  • When a new model is released, we just add it to the pool. We do not retrain anything.
  • We can remove a model, or favor one provider, or block a model for privacy or policy reasons, without retraining.

This flexibility matters in the real world too. A model can become unavailable for many reasons, such as access restrictions, outages, or price changes. Because Fugu's pool of models is swappable, it can simply route around any model that becomes unavailable. The team keeps working even if one member leaves.

The two Fugus - Fugu and Fugu-Ultra

The paper releases two versions, built for two different needs. The trade-off is between quality and latency. Latency means the time you wait for an answer.

  • Fugu balances quality with speed. For each input, it picks a single best worker model. So it is about as fast as just calling one model. It is meant for everyday use, like coding and chat.
  • Fugu-Ultra aims for the highest quality on the hardest problems. It builds a plan that uses multiple models per input. This is slower, but it gives better answers. It is meant for the most complex tasks.

Let me tabulate the differences between Fugu and Fugu-Ultra for your better understanding so that you can decide which one to use based on your use case.

PointFuguFugu-Ultra
Goalbalance speed and qualitymaximum quality
Models per inputone workermany workers in a plan
Speedfast, like one modelslower
Built onTrinityConductor
Best foreveryday coding and chathardest multi-step problems

Here, Trinity and Conductor are two earlier research papers from Sakana AI. Fugu builds on Trinity, and Fugu-Ultra builds on Conductor. We will understand both as we decode each variant.

How Fugu picks the right model - the lightweight selection head

Let's first understand the fast variant, Fugu.

Fugu uses a normal language model as its backbone, which means its brain. A language model is an AI that reads text and predicts what comes next, word by word. It is pre-trained, which means it has already learned from a huge amount of text before we use it. When you type a question, the language model reads it and forms a hidden state, which is simply a list of numbers that captures the meaning of what the model has read. We call it h.

Normally, a language model has an LM head that turns the hidden state into the next word. Fugu adds a second small head right next to it, called the lightweight selection head. This head takes the hidden state h and outputs L numbers called logits, one score for each worker model in the pool. The model with the highest score is selected.

  Question --> [ Language Model backbone ] --> hidden state h
                                                  |
                            +---------------------+---------------------+
                            |                                           |
                       [ LM Head ]                        [ Lightweight Selection Head ]
                    (writes normal text)                              |
                                                          L scores, one per worker model
                                                                      |
                                                          pick the highest -> dispatch

Here, we can see two heads reading the same hidden state. The LM head is the normal one that writes text. The selection head is the new one that scores the worker models.

Let's define the symbols:

  • h is the hidden state, the list of numbers that captures the meaning of the question.
  • L is the number of worker models in the pool.
  • logits are the raw scores, one per model, before turning them into probabilities.

Now, here is the key trick that makes Fugu fast. Fugu uses the logits, not generated text. It does not need to write out a full answer to make its decision. It computes the hidden state at an early position, applies the selection head, reads the scores, picks the highest, and immediately sends your question to the chosen worker. Because it skips the slow word-by-word writing, the decision is very fast. This is why Fugu has low latency.

This is also where Fugu differs from Trinity, the earlier Sakana model it builds on. Trinity also assigned a role to the chosen model, like Thinker, Worker, or Verifier. Fugu drops the roles and always uses the chosen model as a worker. Fewer choices means a smaller and faster decision, which is exactly what we want for everyday use.

One more thing to notice. To adapt its brain cheaply, Fugu does not retrain the whole model. Think of the model as having millions of tiny dials. Instead of changing all of them, Fugu changes only a small set of adjustment dials and leaves the rest fixed. This keeps training small and fast. The math trick that picks which small set of dials to turn is called SVD, which means Singular Value Decomposition, and it comes from an earlier Sakana paper called Transformer-squared. We do not need to understand the trick itself. We just need to remember that it lets Fugu adjust a few dials instead of all of them. The selection head itself is also kept small and simple.

Teaching Fugu who is best - supervised fine-tuning

Now, Fugu must learn which model is best for which kind of question. The paper trains it in two stages. The first stage is supervised fine-tuning, often called SFT.

To train Fugu in this stage, the Fugu team collects a big set of questions where the correct answer is already known. For each question, they run every worker model several times and check how well each one did. They give each model a reward, a score for how well it solved the question.

The best way to learn this is by taking an example.

Suppose we have a pool of 3 worker models. Call them GPT, Opus, and Gemini. We give them a math question. We run each one a few times and average the scores. The reward is between 0 and 1, where 1 means it solved the question perfectly. Say we get:

  • GPT: average reward 0.9
  • Opus: average reward 0.6
  • Gemini: average reward 0.3

So our score vector is s = (0.9, 0.6, 0.3).

Now, the simple approach would be to just say "GPT is the best, so the label is GPT." But this throws away useful information. It hides the fact that Opus was also quite good, and that Gemini was weak. So instead, the paper turns these scores into a soft target distribution using the softmax function. Softmax is a simple function that takes any list of scores and turns them into probabilities, which are all positive, all add up to 1, and keep the bigger scores bigger. Do not worry, we will compute it with real numbers.

The formula is:

p(j) = exp(r_j / T) / sum over all j' of exp(r_j' / T)

Where:

  • p(j) is the target probability for model j.
  • r_j is the average reward of model j.
  • T is the temperature, a knob that controls how sharp or how flat the result is.
  • exp(x) means e raised to the power x.
  • sum over all j' means we add up the term for every model.

Do not worry if this formula looks complex. We will break it down step by step with actual numbers. Let us use temperature T = 1, because dividing by 1 changes nothing.

Here, e is just a fixed number, about 2.718, that shows up everywhere in math. So exp(0.9) simply means 2.718 raised to the power 0.9. You do not need to compute this by hand. A calculator gives the value. What matters is the pattern: a bigger score gives a bigger exp value.

Step 1: Compute exp(x) for each score.

  • exp(0.9) = 2.460
  • exp(0.6) = 1.822
  • exp(0.3) = 1.350

Step 2: Add them up.

  • Sum = 2.460 + 1.822 + 1.350 = 5.632

Step 3: Divide each by the sum.

  • p(GPT) = 2.460 / 5.632 = 0.437
  • p(Opus) = 1.822 / 5.632 = 0.324
  • p(Gemini) = 1.350 / 5.632 = 0.240

So the soft target is (0.437, 0.324, 0.240). This is a probability distribution, which is just a set of numbers that say how likely each option is and that add up to 1, like all probabilities should.

Let's read this in plain words. The target says GPT is the best choice about 44% of the time, Opus about 32%, and Gemini about 24%. Notice that it did not say "always GPT." It kept the fact that Opus is also a strong backup. That is the richer signal that helps Fugu make robust choices when several models are similarly good.

What does the temperature T do? A smaller T makes the distribution sharper, putting more weight on the best model. A larger T makes it flatter, making the models more equal. It lets us control how strongly Fugu favors the top model.

Now Fugu is trained to predict this target. Just like we turned the rewards into probabilities, Fugu's own selection head scores are also passed through softmax, so Fugu also outputs a set of numbers that add up to 1, one for each model. We call this Fugu's predicted distribution. Fugu's predicted distribution is then compared to the soft target using KL divergence, which measures how far apart two probability distributions are.

The training loss is:

L_SFT = average over all questions of D_KL( p( . ) || pi( . | q ) )

Where:

  • pi( . | q ) is Fugu's own predicted distribution over the models for question q.
  • p( . ) is the soft target we just computed.
  • The dot inside p( . ) just means the whole set of numbers, one for every model.
  • D_KL is the KL divergence, the gap between the two distributions. The double bar || just separates the two distributions we are comparing, so D_KL( p || pi ) reads as the gap between distribution p and distribution pi.
  • We average over all questions in the training set.

Let's see KL divergence with numbers. Suppose right now Fugu predicts (0.5, 0.3, 0.2) for our question, while the target is (0.437, 0.324, 0.240). The KL divergence is computed as:

D_KL = sum over all j of p(j) * ln( p(j) / pi(j) )

Here, p(j) is the target probability for model j, and pi(j) is the single probability Fugu gives to model j. This pi is just a name for Fugu's prediction, not the number 3.14.

  • GPT term: 0.437 * ln(0.437 / 0.5) = 0.437 * ln(0.874) = 0.437 * (-0.135) = -0.059
  • Opus term: 0.324 * ln(0.324 / 0.3) = 0.324 * ln(1.080) = 0.324 * 0.077 = 0.025
  • Gemini term: 0.240 * ln(0.240 / 0.2) = 0.240 * ln(1.200) = 0.240 * 0.182 = 0.044

Add them: -0.059 + 0.025 + 0.044 = 0.010

So the KL divergence is about 0.010, a small positive number. It is small because Fugu's guess is already close to the target. Training nudges Fugu's prediction toward (0.437, 0.324, 0.240), which pushes the KL divergence toward 0. When the two distributions match exactly, the KL divergence is 0.

Here, ln is a standard math function on any calculator. You do not need to compute it by hand. We just plug the numbers in. The one thing to notice is that when the two numbers being compared are equal, ln of their ratio is 0, which is why a perfect match gives a KL divergence of 0. Remember, these are small numbers just for the sake of understanding. Real training uses a huge set of questions and many worker models.

To learn Fine-tuning, Logits and Cross-Entropy, and Supervised and Unsupervised Learning from the ground up, explore our AI and Machine Learning Program at Outcome School.

Polishing Fugu on real tasks - evolutionary strategies

Stage 1 taught Fugu using single-step questions, where we can cleanly score each model. But real work is not single-step. Real coding happens over many turns - reading a code repository, editing files, running tools, reading the errors, and trying again. So Stage 2 trains Fugu on full, end-to-end tasks.

Here is the problem. Normally, AI models learn using a method called gradient training. It needs a clear math signal at every small step that tells the model exactly which way to adjust itself. But in an end-to-end task, we only find out at the very end whether the task was completed. We get a single reward, 1 if done and 0 if not done. There is no clear step-by-step signal to follow through a long chain of tool calls and model handoffs. So normal gradient training does not fit well here.

So, here comes evolutionary strategies to the rescue. The paper uses a method called sep-CMA-ES. The idea is simple, and it copies how nature improves through evolution: try many small variations, keep the ones that perform best, and move toward them. No gradients are needed. We only need to measure how well each variation does.

Let me explain the idea with a tiny example. Imagine Fugu's trained values are just a single number theta. In reality it is a long list of numbers, but one number is enough to see the idea.

Setup:

  • Current value, the parent: theta = 0.0
  • Step size: sigma = 1.0, which controls how far we explore
  • We will try 4 variations, called the population

Step 1: Create variations. We add small random nudges z to the parent.

  • z1 = 0.5, so candidate theta1 = 0.0 + 1.0 * 0.5 = 0.5
  • z2 = -0.3, so candidate theta2 = -0.3
  • z3 = 1.0, so candidate theta3 = 1.0
  • z4 = -0.8, so candidate theta4 = -0.8

Step 2: Measure fitness. We run each candidate on the end-to-end tasks and record the fraction of tasks it completed. This is the fitness.

  • theta1 = 0.5 gives fitness 0.8
  • theta2 = -0.3 gives fitness 0.5
  • theta3 = 1.0 gives fitness 0.9
  • theta4 = -0.8 gives fitness 0.2

Step 3: Keep the best. We rank them and keep the top 2 winners: theta3 with 0.9 and theta1 with 0.8. Their nudges were z3 = 1.0 and z1 = 0.5.

Step 4: Move the parent toward the winners. We combine the winning nudges using weights that favor the better one. Say the weights are 0.7 for the best and 0.3 for the second best.

  • New theta = 0.0 + 1.0 * (0.7 * 1.0 + 0.3 * 0.5) = 0.7 + 0.15 = 0.85

So the parent moved from 0.0 to 0.85, toward the region that completed more tasks. This loop repeats for many generations, and Fugu slowly gets better at choosing models for real, multi-turn work.

The actual formulas from the paper are:

theta_k   = theta_t + sigma_t * D_t * z_k,   with z_k drawn from a normal distribution
theta_next = theta_t + sigma_t * D_t * sum over j of ( w_j * z_j )

Where:

  • theta_t is the current parent at step t.
  • sigma_t is the step size.
  • D_t is a diagonal matrix that scales each direction on its own.
  • z_k is a random nudge. It is drawn from a normal distribution, which just means the nudges are usually small and only rarely large.
  • w_j are the weights for combining the best candidates.
  • z_j is the nudge of the j-th best candidate.

Do not worry about the diagonal matrix D_t. It is just a way of letting each dial be explored by a different amount. The core idea is exactly the tiny example above: try variations, keep the best, and move toward them.

What does "sep" mean in sep-CMA-ES? It means "separable." It tunes each value on its own, by keeping only the diagonal D_t, instead of tracking how every pair of values relates to each other. This makes it much cheaper and lets it handle a large number of values. This is the same evolutionary method used to train Trinity, the earlier model that Fugu's fast variant builds on.

A quick note for you

No matter which tech domain you work in, get familiar with these topics:

  • LLM
  • RAG
  • MCP
  • Agent
  • Fine-tuning
  • Quantization

We put it all together in one video:

AI Engineering Explained: LLM, RAG, MCP, Agent, Fine-Tuning, and Quantization

No need to stop reading - bookmark it and watch later when you get time. Future you will thank you.

Now, let's get back to the topic.

How Fugu-Ultra conducts an orchestra - the Conductor

Now, let's move to Fugu-Ultra. It is built on the Conductor framework, another research paper from Sakana AI.

While Fugu picks one model per step, Fugu-Ultra goes further. It writes a full plan, called an agentic workflow, that puts several models to work together on one question. When a model is set loose to take actions on its own, like reading files, running code, and reacting to the results, we call it an agent. Agentic just means it works in this active way.

What is an agentic workflow? It is a sequence of steps. Each step has three parts:

  • A subtask, which is a plain-language instruction, written by Fugu-Ultra, telling a worker what to do.
  • A worker id, which says which model should do that subtask.
  • An access list, which says which earlier results this worker is allowed to see.

So, Fugu-Ultra is not just choosing models. It is writing the instructions for each model, choosing who does what, and wiring up who can see whose work. Because it controls the access list, it can build many shapes of teamwork:

  • A simple chain, where one model works after another.
  • Best-of-N, where several models try the same thing and then one picks the best.
  • A tree, where several models work in parallel and then one combines their results.

Let's see a tree with an example.

        Subtask: "answer this hard question"
                   /              \
            (leaf)                  (leaf)
        Gemini attempt           GPT attempt
                   \              /
                    v            v
                Gemini aggregator
            (reads both through the access list,
             keeps the correct parts of each)
                        |
                        v
                   Final Answer

Here, we can see two models attempt the question on their own at the leaves, and then a third model, the aggregator, combines their work into one answer.

This is a real example from the paper. For a hard trivia question, Fugu-Ultra built exactly this tree. It asked Gemini and GPT to each attempt the question on their own, then asked a second Gemini to read both attempts through the access list and combine them into one correct answer. Both leaf attempts were partly wrong, but the Gemini aggregator spotted the correct parts of each and produced a fully correct answer.

And here is the clever part. Fugu-Ultra chooses which model acts as the aggregator, based on the question. For the trivia question it used Gemini as the aggregator, because Gemini is strong at niche facts. For a hard math question it instead used GPT as the aggregator, because GPT is strong at math. Most older systems are forced to always use the same model as the final combiner, so they get stuck when that model is not the best one for the task. Fugu-Ultra avoids this by adapting on every question.

Teaching Fugu-Ultra to conduct - GRPO

How is Fugu-Ultra trained? With reinforcement learning. Reinforcement learning is a way of teaching by reward. The model tries something, gets a score for how well it did, and learns to do more of what earned a high score. The specific recipe used here is called GRPO, which means Group Relative Policy Optimization. It comes from the DeepSeekMath paper.

First, let's understand the reward. For each workflow Fugu-Ultra writes, it gets a reward r:

  • r = 0 if the workflow is malformed and cannot be read, which means the lists of subtasks, workers, and access lists are written in a broken format that the system cannot make sense of.
  • r = 1 if the workflow runs and the final answer is correct.
  • r = 0.5 if the workflow runs but the final answer is wrong.

So Fugu-Ultra is rewarded both for writing a valid plan and for getting the right answer.

Now, the GRPO idea. For one question, Fugu-Ultra writes a group of G different workflows. Each one is scored, then judged relative to the group. A workflow that is better than the group average gets pushed up. A workflow that is worse than the group average gets pushed down. There is no need for a separate "judge" network, which makes training simpler and cheaper.

The key formula is the advantage:

A_i = ( r_i - mean(r_1, ..., r_G) ) / std(r_1, ..., r_G)

Where:

  • A_i is the advantage of workflow i, which is how much better or worse it is than the group.
  • r_i is the reward of workflow i.
  • mean is the average reward of the group.
  • std is the standard deviation, which measures how spread out the rewards are.

Let me explain with numbers. Suppose for one question Fugu-Ultra writes G = 4 workflows with these rewards:

  • Workflow 1: r1 = 1.0, valid and correct.
  • Workflow 2: r2 = 0.5, valid but wrong.
  • Workflow 3: r3 = 0.5, valid but wrong.
  • Workflow 4: r4 = 0.0, malformed.

Step 1: Compute the mean.

  • mean = (1.0 + 0.5 + 0.5 + 0.0) / 4 = 2.0 / 4 = 0.5

Step 2: Compute the standard deviation. Standard deviation is just a way to measure how spread out a set of numbers is. We find how far each reward is from the average, square those gaps so they are all positive, average them, and take the square root. A bigger result means the rewards are more spread out.

  • Differences from the mean: 0.5, 0, 0, -0.5
  • Square them: 0.25, 0, 0, 0.25
  • Average of the squares: (0.25 + 0 + 0 + 0.25) / 4 = 0.5 / 4 = 0.125
  • Standard deviation = sqrt(0.125) = 0.354

Step 3: Compute the advantage of each.

  • A1 = (1.0 - 0.5) / 0.354 = 0.5 / 0.354 = +1.41
  • A2 = (0.5 - 0.5) / 0.354 = 0
  • A3 = (0.5 - 0.5) / 0.354 = 0
  • A4 = (0.0 - 0.5) / 0.354 = -1.41

We divide by the standard deviation so that the advantages are on a fair scale, no matter how spread out this particular group of rewards happened to be.

So workflow 1, which was correct, gets a positive advantage and is reinforced. Workflow 4, which was malformed, gets a negative advantage and is discouraged. The two middling ones are neutral. Over many questions, Fugu-Ultra learns to write workflows that look like the ones that worked.

Here, sqrt means square root. The full GRPO objective also has two more pieces: a clip that stops the update from changing the model too much in a single step, which is a safety limit, and a KL term that keeps the new model close to the original. The paper trained Fugu-Ultra without the KL term. These tiny numbers are just for the sake of understanding.

To master Reinforcement Learning, PPO, and GRPO in depth, check out our AI and Machine Learning Program at Outcome School.

Stopping the agents from copying each other

When several models act as agents and can call tools, which are outside helpers like running code, searching the web, or reading a file, at any time, two problems appear. The paper solves both.

Problem 1: Orchestration collapse. If every agent can see the full history of what the first agent did, then the first agent's path becomes everyone's path. The later agents just follow along and repeat the same work, instead of bringing their own fresh ideas. The team collapses into one opinion.

Solution: Intra-workflow isolation. Inside one workflow, each agent only sees other agents' work through the access list that Fugu-Ultra set. Otherwise, an agent sees only its own past actions. This gives each agent the freedom to solve its subtask in its own way.

Problem 2: Forgetting everything. But complete isolation is also bad. Over a long, multi-turn conversation, agents should remember what tools were already called, so they do not repeat the same calls and rediscover the same facts again and again.

Solution: Persistent shared memory. Across workflows, agents share a memory of past tool calls. So, within a single workflow they stay independent, but across the whole conversation they share useful background.

   Workflow (one question)
   +-------------------------------------------+
   |  Agent A lane   Agent B lane   Agent C lane |
   |     |              |              |          |
   |  isolated      isolated      isolated        |   <- inside a workflow:
   |  (see others only via the access list)       |      stay independent
   +-------------------------------------------+
                        |
                        v
            +---------------------------+
            |  Persistent shared memory  |   <- across the conversation:
            |  (past tool calls, facts)  |      share useful background
            +---------------------------+

Here, we can see the balance. Inside a single workflow, the agents work in separate lanes so they do not copy each other. Across the conversation, they share one memory so they do not redo work that was already done. This balance is what keeps the team both diverse and efficient.

If we want to go deep into AI Agents, Multi-Agent Systems, Tool use, and Memory in Agents - and build an AI Coding Agent from scratch - we cover all of it in our AI and Machine Learning Program at Outcome School.

How well does Fugu perform?

Now, let's see the results. Both Fugu and Fugu-Ultra reach state-of-the-art performance, which means the best scores anyone has achieved so far, across a wide range of hard tests. These standard tests used to compare AI models are called benchmarks. In many cases, the Fugu models beat the very models that sit inside their own pool.

Here are a few rows from the paper's results, with the best score in each row in bold.

BenchmarkFugu-UltraFuguClaude Opus 4.8Gemini 3.1GPT-5.5
SWE Bench Pro (fixing real software bugs)73.759.069.254.258.6
Terminal Bench 2.1 (using the terminal)82.180.274.670.378.2
GPQA Diamond (hard science)95.595.592.094.393.6
Humanity's Last Exam (very hard reasoning)50.047.249.844.441.4

Here, we can notice the most striking part. On many of these hard tests, Fugu-Ultra beats every model it uses. By combining them well, the team can be stronger than any single member. For example, on SWE Bench Pro, which is about fixing real software bugs, Fugu-Ultra scores 73.7, higher than Claude Opus 4.8 at 69.2, GPT-5.5 at 58.6, and Gemini 3.1 at 54.2.

It even beats some models that are not in its pool and are not publicly available. This supports the paper's main claim: orchestration is a real way to gain capability, on top of just training bigger models.

The paper also tested Fugu on fun, hard tasks beyond standard benchmarks:

  • Rubik's cube solver. Each model had to write code that solves a cube. Fugu models wrote code that solved all 300 test cubes. Fugu-Ultra found the shortest solutions, with a mean of 19.72 moves, where the proven best possible is 20. Two of the three frontier models wrote code that crashed before solving even one cube.
  • Blindfold chess. With no board shown, only moves given in text, the model had to track the whole game in its head. Fugu won all four games, including one against a strong chess engine, and made no blunders.
  • Online stock trading. Over a 50-week simulation starting at 10,000 dollars, Fugu-Ultra grew it to 11,943 dollars, a gain of 19.43%, better than each individual model.

Fugu also did well on autonomous AI research, on reading the order of classical Japanese handwritten letters, and on generating a working mechanical part in CAD. In each case, the orchestrated team did better than any single model alone.

The clever strategies Fugu discovered on its own

The interesting part is that these strategies were not programmed. They emerged on their own during training. The paper observed three of them.

Debate and aggregation. Fugu-Ultra builds a tree, has models attempt the problem independently, then has a chosen model combine the best parts. We saw this earlier when we decoded the Conductor.

Build and debug. In coding tasks, Fugu-Ultra often uses GPT as the builder, and then brings in Opus at key moments to find bugs and security holes, because Opus is strong at debugging. Then it hands the findings back. For example, while building a small package server, Opus caught several concrete risks that GPT had missed, and relaying those findings back let GPT finish the job correctly.

Bringing in a specialist. Fugu-Ultra pulls in a specific model only when a special skill is needed. For example, in a cryptanalysis task, it used Opus, which is strong at security, to start the attack, and then used GPT, which is strong at math, to re-derive the exact math the attack needed.

The lesson here is simple. Fugu learned the fine-grained strengths of each model and how to combine them, like a good manager who knows exactly which team member to call for which part of a job.

Quick Summary

Let's recap what we have decoded, piece by piece:

  • Sakana Fugu is a family of orchestrator models that direct a team of frontier AI models instead of answering alone.
  • Fugu was needed because no single model is best at everything, and older fixed teams and simple routers were too rigid.
  • Collective intelligence means a well-coordinated team beats any single member, and orchestration is a new way to scale AI.
  • There are two variants: Fugu, which is fast and picks one worker per step, and Fugu-Ultra, which is high quality and uses many workers per question.
  • Fugu uses a lightweight selection head that reads a hidden state and scores each model, deciding without writing text, which makes it fast.
  • Fugu first learns who is best with supervised fine-tuning on a soft, softmax-based target, trained with KL divergence.
  • Fugu is then polished on real end-to-end tasks with evolutionary strategies, sep-CMA-ES, which need no gradients.
  • Fugu-Ultra writes full agentic workflows, with subtasks, worker ids, and access lists, built on the Conductor framework.
  • Fugu-Ultra is trained with GRPO, judging a group of workflows relative to each other, with a reward for valid and correct plans.
  • It keeps agents independent inside a workflow through isolation, while sharing memory across the whole conversation.
  • On many hard benchmarks, Fugu-Ultra beats every model in its own pool, and even some models outside it.
  • It discovered its own strategies: debate and aggregation, build and debug, and bringing in a specialist.

Now, we have decoded Sakana Fugu piece by piece and understood how a model can orchestrate a team of other models to achieve more than any single one of them alone.

Prepare yourself for AI Engineering Interview: AI Engineering Interview Questions

That's it for now.

Thanks

Amit Shekhar
Founder @ Outcome School

You can connect with me on:

Follow Outcome School on:

Read all of our high-quality blogs here.

Subscribe to our newsletter to get our latest AI and Machine Learning blogs straight to your inbox.