The 9 suites every agent must survive.
We design the eval suite before anyone writes a line of agent code. Then we score and stress-test LLMs and agents across nine dimensions — in shadow against your real production traffic. The scorecard is public, signed, and immutable.
Methodology
/ 4 phasesSuite design
We write the eval suite against your domain before any agent code lands. Nine canonical dimensions, plus the two or three you tell us to add. Adversarial cases get drafted in parallel.
Baseline + shadow build
Baseline the candidate agent against the suite. Wire it into your prod-shadow pipeline. Score every input the agent would see in production, against ground truth, in parallel to your live system.
Adversarial + cost
Run the adversarial suites. Score cost-per-decision and drift relative to the canonical run. Report deltas and surface every regression in writing.
Gate review + scorecard
Sign and publish the scorecard. Public URL, signed footer, immutable SHA. The failing suites stay on the page alongside the passes.
Engagement shape
/ what you sign up for- 019 eval suites — canonical + adversarial, scoped to your domain
- 0214-day shadow regression report against real production traffic
- 03Public scorecard — signed, hashed, immutable URL on lattice.ai/evals
- 04Continuous monitoring runbook with SLA 99.9% and severity 1–3 escalation
Why we design the suite first
The suite is the contract. Writing the eval before the agent forces us to name what we’ll accept and what we won’t, in a document signed by the same people who later have to ship the agent. By the time the agent exists, the threshold is already on paper.
Designing the suite first also reveals which problems are well-posed and which aren’t. A surprising number of agent projects we audit fail at this step — the team can describe what they want the agent to do, but can’t write the test that would tell them if it did. We won’t build past that point.
What "in shadow" means
Shadow means the agent reads real production traffic and produces real outputs, but the user never sees them. We score the agent’s outputs against ground truth and against your existing system’s outputs, in parallel. Two weeks of this catches the distribution shifts that synthetic evals can’t.