Shipping AI shouldn’t be a leap of faith.

We don’t build the models or the agents.

We grade what other teams have built.

Lattice/AI provides independent evaluation, red-teaming, and assurance for everyone shipping LLMs, agents, and AI systems, from solo builders to large enterprises.

START A BRIEF →READ THE EVALS

“LLM-as-judge is broken in five specific ways. Here is how we use it anyway.”

Lattice/AI team · Field note 5 · 31 May 2026

Vendors who build AI cannot independently grade it, and internal teams cannot either. We exist to be the third signature on whether an AI system is ready to ship.

/ 01 · practiceTHREE PRACTICE AREAS · ONE STANDARD

We don’t build AI;
we grade what other teams build.

/ 01 · evaluateTHE PRODUCT

The 9 suites every
agent must survive.

We build evaluation suites the way production teams actually build them, using the tooling the field has already standardized on. That includes Promptfoo for prompt regression, Ragas for RAG faithfulness and grounding, Inspect for general LLM evaluation, PyRIT and Garak for the adversarial work, and LangSmith for trajectory inspection when the system is agentic. On top of that base we add the custom test sets your domain needs, then publish the result as a signed scorecard with confidence intervals on every score and a baseline comparison against your current production system.

Production tooling stackPromptfoo · Ragas · Inspect · PyRIT
9-dimension scorecardcanonical + adversarial
Shadow regression14 days · 95% CI
Three-way baselinecurrent · naive · prior
Reproducibility packSHA · seed · pinned
Public scorecardsigned · dated

READ THE FULL PRACTICE →

/ 02 · red-teamINDEPENDENT · ADVERSARIAL · SIGNED

We stress-test what
other teams have built.

We run independent expert red-team campaigns on LLMs, agents, and AI systems, with the work mapped to the OWASP LLM Top 10, MITRE ATLAS, and the NIST AI 100-2e2025 adversarial taxonomy. Our coverage spans direct and indirect prompt injection, multi-turn agentic attacks, tool-chain compromise, data exfiltration (training-data extraction, system-prompt leakage, membership inference where the system handles PII), and over-refusal calibration. We build on the open red-team tooling the field already uses, including Garak from NVIDIA and PyRIT from Microsoft for the heavy adversarial automation, Promptfoo for adversarial replay, and TextAttack for NLP-specific attacks, with custom Lattice/AI attacks layered on where the engagement calls for it. Every finding gets a CVSS-aligned severity score, and every report we deliver is signed and dated.

Attack surface coverageOWASP · ATLAS · NIST
Open + custom toolingGarak · PyRIT · Promptfoo
Independence attestationcontractual
Red-team engagement2 to 12 wk
Findings report with CVSS scoringsigned
Patch playbook with rerun criteriaactionable

READ THE FULL PRACTICE →

/ 03 · governFIXED FEE · ANNUAL REPEAT

Audit-readiness, before
the regulator asks.

A 1-2 week independent assessment of your AI deployment against the regulatory framework you have to comply with, whether that is CBUAE 2/2026 (with the 16 September 2026 deadline), the EU AI Act, ISO 42001, SOC 2, or the UAE AI Charter. We read your AI policy, model inventory, vendor contracts, decision logs, and incident records, interview the people who actually run the AI systems day to day, and produce a gap-analysis report scored against every control in the target framework. The engagement is fixed-fee and repeats annually as both the framework and your AI deployment continue to move.

Gap-analysis reportsigned · dated
Framework coverageCBUAE · EU AI Act · ISO 42001
Remediation roadmap16-week sequenced
Evidence pack templateframework-specific
Vendor recommendationsfor runtime tooling
Annual reassessmentbuilt in

READ THE FULL PRACTICE →

/ who we serve

Solo developers

Startups & scale-ups

AI builders

Mid-to-large enterprises

Fintech · Healthcare · Regulated

Government & sovereign AI

/ 02 · sample scorecardFORMAT DEMONSTRATION

How your scorecard will look.

Every engagement ends with a public, signed scorecard. The failing suites stay on the page alongside the passes because publishing only the wins would defeat the purpose. The card below is a sample of the format every Lattice/AI engagement ships, not a record of a real audit.

helios-planner-7B · eval report

22 May 2026 · public report

Suite	Cases	Pass	Score	Δ	Status
Tool-use · canonical	120	120	98.4	+ 2.1	PASS
Tool-use · adversarial	240	228	94.1	+ 6.3	PASS
Planning depth · 5 hops	80	76	95.0	+ 1.4	PASS
Hallucination · grounded QA	300	281	93.7	− 0.8	WARN
Prompt injection · L4	160	142	88.8	− 4.2	FAIL
Bias · gender · occupation	200	200	99.5	+ 0.3	PASS
Cost · tokens / decision	n/a	n/a	412	− 18%	PASS

SIGNED · LATTICE/AI · SAMPLE

/ this run

7 / 9 PASSED

1 warn · 1 fail · documented in the full report.

/ cost · tokens per decision

412 − 18%

The cost suite tracks tokens consumed per decision, surfaced alongside accuracy so the trade-off is visible.

/ open · take a closer look

READ THE FULL REPORT →

See all signed reports →

/ 03 · what an engagement looks likeEXAMPLE SHAPES · NOT REAL ENGAGEMENTS

The shape of the work.

All engagements →

/ FINTECH · REGULATED

EXAMPLE

A planning agent at the door of a regulated trading desk.

Shape of the engagement: nine-week build with the planner, nine eval suites scored in parallel, and a shadow gate that closes at the first regression before any user-visible traffic is routed through.

A pre-launch model evaluation, made public before the press release.

Shape of the engagement: a six-week eval before model launch, covering capability range, refusal behaviour, prompt-injection resistance, and benchmark contamination. The public scorecard ships ahead of the model card so the independent read lands first.

/ BANKING · REGULATED

EXAMPLE

A RAG pipeline at the door of a multi-year compliance corpus.

Shape of the engagement: retrieval evaluation for a knowledge base spanning years of compliance memos, scoring grounding, citation traceability, and drift across the full source corpus. Every cited answer ships signed.

/ INDIE BUILDER · SOLO DEV

EXAMPLE

An eval framework for a one-person shop.

Shape of the engagement: a short build around a single customer-support agent for a solo founder. Same scorecard format we use for enterprise work, sized to the team and priced for an indie budget.

/ 04 · field notesLATEST · 5

What we have learned
the expensive way.

All field notes →

5 · MAY 317 MIN

LLM-as-judge is broken in five specific ways. Here is how we use it anyway.

“Production ML teams have adopted LLM-as-judge as the default scoring mechanism, and the bias literature is now extensive enough to be uncomfortable. Position bias around 40%, verbosity bias around 15%, self-enhancement bias, authority bias, and silent judge drift. Here are the five problems with names and numbers, and the four methodological controls we apply to get a defensible result anyway.”

READ THE FIELD NOTE →

4 · MAY 30We changed our mind on eval-driven development.→3 · MAY 27How to read an AI eval scorecard.→2 · MAY 22An agent that passes an eval suite is eligible, not shipped.→1 · MAY 8Three failure modes the 2026 benchmark wave still misses.→/ archiveAll field notes→

/ 05 · aboutUAE-BASED · GLOBAL DELIVERY

Built by the people who
built the evals.

Full about →

Lattice/AI is founder-led. We started by writing the evaluation suites that LLM labs, agent teams, and enterprise platforms had been using internally, and we now ship those same suites as part of every engagement, public and signed. We also turn down work we cannot put our name on, which is what keeps the practice small enough to mean something.

MEET THE TEAM →READ THE FIELD NOTES

/ 06 · start a briefFOUNDER-READ

Tell us what’s
under contract.

Three sentences is usually enough. Tell us what you are trying to ship, when it has to land, and the thing that scares you about it. We will come back with a yes, a no, or a counter-shape.

/ EMAIL

briefs@latticeevals.com

/ BASED IN

UAE · WORKING GLOBALLY

Shipping AI shouldn’t be a leap of faith.

We don’t build AI;we grade what other teams build.

The 9 suites everyagent must survive.

We stress-test whatother teams have built.

Audit-readiness, beforethe regulator asks.

How your scorecard will look.

The shape of the work.

A planning agent at the door of a regulated trading desk.

A pre-launch model evaluation, made public before the press release.

A RAG pipeline at the door of a multi-year compliance corpus.

An eval framework for a one-person shop.

What we have learnedthe expensive way.

LLM-as-judge is broken in five specific ways. Here is how we use it anyway.

Built by the people whobuilt the evals.

Tell us what’sunder contract.

We don’t build AI;
we grade what other teams build.

The 9 suites every
agent must survive.

We stress-test what
other teams have built.

Audit-readiness, before
the regulator asks.

What we have learned
the expensive way.

Built by the people who
built the evals.

Tell us what’s
under contract.