/ practice · evaluate

The 9 suites every agent must survive.

We design the eval suite before anyone writes a line of agent code. Then we score and stress-test LLMs and agents across nine dimensions — in shadow against your real production traffic. The scorecard is public, signed, and immutable.

Methodology

/ 4 phases

Week 1–2

Suite design

We write the eval suite against your domain before any agent code lands. Nine canonical dimensions, plus the two or three you tell us to add. Adversarial cases get drafted in parallel.

Week 3–5

Baseline + shadow build

Baseline the candidate agent against the suite. Wire it into your prod-shadow pipeline. Score every input the agent would see in production, against ground truth, in parallel to your live system.

Week 6–7

Adversarial + cost

Run the adversarial suites. Score cost-per-decision and drift relative to the canonical run. Report deltas and surface every regression in writing.

Week 8

Gate review + scorecard

Sign and publish the scorecard. Public URL, signed footer, immutable SHA. The failing suites stay on the page alongside the passes.

Engagement shape

/ what you sign up for

/ typical cycle

8 — 10 weeks

/ our team on the engagement

1 eval lead, 1 evaluation engineer, 1 reviewer

/ what we need from you

A named technical counterpart, prod-shadow read access, decision rights on threshold setting, and a willingness to publish the result

/ revenue cap

No cap. This is core practice.

/ deliverables

019 eval suites — canonical + adversarial, scoped to your domain
0214-day shadow regression report against real production traffic
03Public scorecard — signed, hashed, immutable URL on lattice.ai/evals
04Continuous monitoring runbook with SLA 99.9% and severity 1–3 escalation

Mappings

/ 5 frameworks

EU AI Act

High-risk and general-purpose schedules; suite design maps to Article 9 risk-management requirements

ISO 42001

Clauses 6.1 (risk assessment), 8.2 (operational planning), and Annex A controls A.5–A.8

NIST AI RMF

Measure 2.1–2.10; Manage 4.1–4.3

UAE AI Charter

Articles on transparency, safety, and human oversight

Lattice/AI internal

Suite v3.x — see lattice.ai/evals for the canonical run history

Evidence

/ 1 eval · 0 work

/ eval scorecard

helios-planner-7B passed 7 of 9 suites.

READ THE SCORECARD →

Why we design the suite first

The suite is the contract. Writing the eval before the agent forces us to name what we’ll accept and what we won’t, in a document signed by the same people who later have to ship the agent. By the time the agent exists, the threshold is already on paper.

Designing the suite first also reveals which problems are well-posed and which aren’t. A surprising number of agent projects we audit fail at this step — the team can describe what they want the agent to do, but can’t write the test that would tell them if it did. We won’t build past that point.

What "in shadow" means

Shadow means the agent reads real production traffic and produces real outputs, but the user never sees them. We score the agent’s outputs against ground truth and against your existing system’s outputs, in parallel. Two weeks of this catches the distribution shifts that synthetic evals can’t.

/ ready to talk shape

Tell us what’s under contract.

START A BRIEF →READ THE FIELD NOTES

LATTICE/AIEST. 2026 · UAE

← BACK TO HOMEPAGE