Why we design the suite first

The suite is the contract. Writing the eval before the agent forces us to name what we’ll accept and what we won’t, in a document signed by the same people who later have to ship the agent. By the time the agent exists, the threshold is already on paper.

Designing the suite first also reveals which problems are well-posed and which aren’t. A surprising number of agent projects we audit fail at this step — the team can describe what they want the agent to do, but can’t write the test that would tell them if it did. We won’t build past that point.

What "in shadow" means

Shadow means the agent reads real production traffic and produces real outputs, but the user never sees them. We score the agent’s outputs against ground truth and against your existing system’s outputs, in parallel. Two weeks of this catches the distribution shifts that synthetic evals can’t.