There’s a meeting that keeps happening. Engineering says the agent passed all nine suites. Product reads the scorecard. Someone — usually the CISO, sometimes the GC — asks, "so it’s shipped?"

The answer is no. Not yet.

What the suite actually tells you

The eval suite measures the agent against the failure modes you anticipated. Nine dimensions: tool-use, planning depth, hallucination, prompt injection, bias, drift, cost, and the two or three more your domain demands. Pass at the threshold and the agent is eligible: cleared to ride the next set of gates.

Eligibility is not deployment. We have to say this in writing because we keep getting asked.

What stands between eligibility and shipping

Three things, in this order:

  1. A shadow window. Two weeks of production traffic, scored in parallel, no user-visible output from the agent. You watch for the deltas the eval suite couldn’t predict because it was built before the agent saw real input.
  2. A regression budget. The first time you ship into production, you accept a budget for how much score you can lose in the first 30 days. Past that budget, the gate closes — automatically, contractually. Nobody negotiates with it.
  3. A roll-back path that a junior engineer can find at 3am. If your roll-back requires a Slack thread to identify, you don’t have one.

None of those are in the eval suite. The suite tells you what the agent does on canonical and adversarial inputs. The shadow window tells you what your actual users do. They are not the same thing. They are never the same thing.

The pattern we see most

A team passes 8 of 9 suites. The 9th is borderline — usually prompt injection at L4, or hallucination on the long tail of grounded QA. They ship anyway, because the deadline was last week and 8 of 9 is "basically passing."

Three weeks in, the borderline suite degrades past threshold under real traffic distribution. The team is now in a position where they have to either close the gate — which now means rolling back an agent users have started to depend on — or rewrite the threshold, which means choosing what truth to redefine.

Both are losing moves. The winning move is to not ship a borderline eligible agent. The borderline is information. Acting on it is the discipline.

What we actually do

When a client agent passes 8 of 9 and the 9th is borderline, our recommendation is: hold. Run the failing suite again with a doubled case-count. Patch the prompt or the harness. Re-score. Ship when the 9th sits clearly inside threshold for three consecutive runs, on three different days, against three different SHAs.

That looks like dragging your feet. It is dragging your feet. The cost of dragging your feet on a borderline pass is a week. The cost of shipping a borderline pass and watching it degrade is the trust you spent six months building.

The rule

Passing the eval is the gate. The day you treat passing as shipping is the day you lose the trust you built getting there. The gate exists so that when you do ship, you can sign your name to it.

We sign our names to ours. That’s what the scorecard is for. That’s what the SHA is for. That’s the whole product.