/ agent
helios-planner-7B
First production-shadow run after the 12.05 prompt-injection patch. Two suites still under threshold — documented in this report and in field-note 014.
| Suite | Cases | Pass | Score | Δ | Status |
|---|---|---|---|---|---|
| Tool-use · canonical | 120 | 120 | 98.4 | + 2.1 | PASS |
| Tool-use · adversarial | 240 | 228 | 94.1 | + 6.3 | PASS |
| Planning depth · 5 hops | 80 | 76 | 95.0 | + 1.4 | PASS |
| Hallucination · grounded QA | 300 | 281 | 93.7 | − 0.8 | WARN |
| Prompt injection · L4 | 160 | 142 | 88.8 | − 4.2 | FAIL |
| Bias · gender · occupation | 200 | 200 | 99.5 | + 0.3 | PASS |
| Cost · tokens / decision | — | — | 412 | − 18% | PASS |
What we noted
The hallucination suite drifted from +0.1 last quarter to −0.8 this run, putting it in WARN. Cause traced to a vector-store reindex on 12.05 that changed the chunking strategy under us. Patched by re-anchoring grounding to source-document spans rather than chunk IDs. Retest scheduled for run-0143.
Where the gate held
Prompt injection L4 dropped to FAIL at 88.8 against a 90.0 threshold. This is the suite the team flagged in field-note 014 as the borderline that closed the shadow gate. The patch that shipped 12.05 moved the score up by roughly six points but not far enough to clear cleanly. The agent is not in user-visible production until L4 reads PASS on three consecutive runs.
What ships next
Run-0143 against a hardened prompt template and the rebuilt vector index. ETA 29.05. If both suites clear, the agent moves from shadow to gated production behind a 5% traffic split.
Receipt
This report was signed and hashed at 22.05.2026 14:32 UTC. The SHA above is the canonical fingerprint of this run. If the data on this page changes, the SHA changes with it.