Rules fail at the prompt, succeed at the boundary

Builder TL;DR

The risk: prompt-injection and coercion of agentic workflows (human-in-loop or fully autonomous) are now primary attack vectors. Real-world incidents — from the Gemini Calendar prompt-injection disclosure to the September 2025 state-sponsored use of a Claude-based intrusion engine (affecting roughly 30 organizations) — show that rules inside prompts are fragile; boundaries enforced outside the model are survivable and testable MIT Technology Review and recent technical analyses (see arXiv collection below) document multiple practical exploits arXiv:2602.04326, arXiv:2602.04248, arXiv:2602.04284.

Concrete artifact (metric): aim for detection latency < 2 minutes for agentic outbound actions, Mean Time To Contain (MTTC) < 15 minutes, and an initial true positive detection rate target of >= 90% for coercive action patterns.

Goal and expected outcome

Goal: stop coercion of agents by keeping model instructions immutable for sensitive capabilities and ensure any high-risk action is gated by external checks.

Expected outcome (measurable):

Reduce unauthorized outbound action rate to < 0.5% of agent-initiated actions during rollout.
Human approvals for high-risk actions (credential use, code execution, network access) enforced with signed attestations and 2FA.
Audit trail with cryptographically-signed transcripts and action tokens preserved for 1 year (or required retention window).

Concrete rollout decision: enforce boundaries first in monitor-only mode for 2–4 weeks (collect baseline), then move to block mode for top-3 highest-risk actions.

Stack and prerequisites

Recommended stack choices (concrete architecture choices and configs):

Model layer: hosted LLM API (e.g., vendor) behind an inference proxy; alternative: self-hosted Llama/Claude in an isolated tenant.
Policy engine: Open Policy Agent (OPA) as an externalized decision point.
Sandboxing: Firecracker or gVisor for code execution; no host networking by default.
Secrets: short-lived credentials from vault (HashiCorp Vault or AWS Secrets Manager) with automated rotation.
Telemetry: centralized logging (e.g., ELK or Loki) and tracing (OpenTelemetry) with retention and alerting.

Prerequisite configs (examples):

Network: agent subnet has egress only through an egress-proxy with TLS interception for metadata routing.
Vault TTL: set secret_ttl = 5m for ephemeral credentials used by agents.
OPA: enable audit logging and policy versioning. Example constraint: reject any action token that is not signed by the boundary signature service.

Artifact (config): sample OPA constraint: allow_action = false if action.capability == "exec_remote" and not action.attestation.present.

Minimum developer prerequisites:

Familiarity with policy-as-code (Rego), container sandboxing, and HMAC signing.
CI/CD to roll policies (git repo + mandatory PR + automated policy tests).

Step-by-step implementation

Inference proxy + immutable system channel

Build an inference proxy that always injects a sealed SYSTEM channel that the model cannot modify. The proxy signs the system channel with a boundary HMAC and sends signed tokens downstream.

External policy check (OPA)

Intercept proposed agent actions (structured JSON outputs) and pass them to OPA for allow/deny. Keep policy repo in Git and require policy unit tests.

Capability gating

Map model "capabilities" to concrete actions (e.g., send-email, read-db, exec-shell) and only map if external approval is present.

Human approval with attestations

For any capability classified as high-risk (credential use, lateral network access, code execution), require a signed human attestation using 2FA. Store attestation token with TTL in Vault.

Sandboxed execution

Execute actions inside microVMs (Firecracker) with ephemeral credentials and network egress only through a hard-filtering egress proxy.

Monitoring and response

Log model outputs, OPA decisions, attestation tokens, and action results. Alert on policy deny rates or abnormal token creation.

Code artifact: minimal example of middleware that verifies a signed boundary token (Node.js / pseudo):

const crypto = require('crypto');
function verifyBoundary(token, secret) {
  const [payloadB64, sigB64] = token.split('.');
  const expected = crypto.createHmac('sha256', secret).update(payloadB64).digest('base64');
  return expected === sigB64;
}

// Usage: verify incoming SYSTEM token before honoring any capability

Rollout decision artifact: run the proxy in "observe" mode for 14 days, then enforce deny policies for top-3 actions.

Reference architecture

Concrete architecture choice: three-layer boundary design

Inference proxy (stateless) — injects SYSTEM channel, signs tokens, rate-limits.
Policy plane (OPA + policy repo) — deterministic allow/deny, policy testing pipeline.
Execution plane (sandbox microVMs + Vault + egress proxy) — runs only allowed actions.

Key metrics to instrument:

Decision latency (proxy -> OPA -> response): target <= 200ms.
MTTC (Mean Time To Contain): target <= 15 minutes.
Action authorization rate and denial ratio.

Example network segmentation (artifact): all agent-related compute on VLAN A, egress via egress-proxy with host-based ACLs; management plane on VLAN B with restricted access.

For visualization and running tests, include a minimal compose/deployment that separates these services and enables per-service tracing.

Founder lens: ROI and adoption path

Compact cost / risk decision frame (one-liner + bullets):

Quick win: deploy observe-mode proxy + OPA for £10–£30k engineering and infra per quarter; prevents large-scale automated intrusions that, as seen in late-2025, impacted ~30 orgs — potential breach cost >> deployment cost.

Cost / risk bullets (artifact: estimated costs & adoption milestones):

Initial engineering (4–6 weeks): £25k — build proxy, OPA repo, tests.
Infra ops: £2–5k/month — microVM capacity, Vault, logging.
Adoption path: pilot with internal non-sensitive agents (2–4 wks observe), then phasing to sensitive capabilities (4–8 wks), full enforcement at 12 wks.
Risk reduction metric: expected >90% reduction in successful unauthorized high-risk actions after enforcement.

Trade-offs to communicate to founders (artifact: go/no-go decision points):

Latency vs. safety: adding policy checks increases request latency (~100–200ms); acceptable for ticket automation, risky for real-time chat.
UX friction: human attestations slow workflows; choose progressive enforcement (observe -> warn -> block).

Failure modes and debugging

Common failure modes and concrete debugging artifacts:

False negatives (coercion slips past rules): metric to monitor — percentage of unvetted outbound actions per day. Investigate by replaying signed transcripts and running anomaly detectors.
False positives (legitimate actions blocked): track rollback rate and time-to-fix; keep "override" flow logged and auditable.
Token compromise: detection artifact — sudden spike in attestation token creation; response — revoke all tokens, rotate HMAC key.
Sandbox breakout: artifact — unauthorized network connections from microVMs; response — isolate host and capture disk snapshot.

Debugging checklist (bulleted):

Collect: model transcript, signed SYSTEM token, OPA decision, execution logs.
Reproduce: run model transcript against local policy engine and simulate action path.
Test: inject adversarial prompt variations from the arXiv exploit corpus and validate deny behavior [see arXiv papers].
Metrics: alert when unvetted action rate > 0.5% or decision latency > 500ms.

Production checklist

Deployment and operations checklist (artifact: checklist items — at least 8 bullets):

[ ] Observe-mode for 14 days; collect baseline metrics (decision latency, deny ratios, false positives).
[ ] Policy repo with CI: unit tests, policy coverage target >= 80%.
[ ] Vault integration: secret TTL <= 5 minutes for ephemeral credentials.
[ ] Signed SYSTEM channel: HMAC key rotation schedule (rotate every 7 days).
[ ] Sandboxing: run code in microVMs (no host networking) with resource limits.
[ ] Audit trail: store signed transcripts and action tokens for 1 year (or compliance window).
[ ] Incident runbook: MTTC target <= 15 minutes and playbooks for token compromise, sandbox breakout.
[ ] Canary/rollback: blue/green deploy for policy updates; automatic rollback if deny-rate spikes.
[ ] Telemetry: traces and logs to ELK/OpenTelemetry with SLOs for ingestion latency.
[ ] Training: red-team prompt injection tests monthly using shared corpus (include arXiv exploit patterns).

Final note: rules embedded in prompts are brittle; enforce immutable boundaries outside the model. Start with observe-mode, run policy-as-code, and gate capabilities with attestations. For technical references and the recent incident analyses, read the MIT Technology Review piece and the arXiv papers linked above: https://www.technologyreview.com/2026/01/28/1131003/rules-fail-at-the-prompt-succeed-at-the-boundary/, https://arxiv.org/abs/2602.04326, https://arxiv.org/abs/2602.04248, https://arxiv.org/abs/2602.04284.

Rules fail at the prompt, succeed at the boundary

Builder TL;DR

Goal and expected outcome

Stack and prerequisites

Step-by-step implementation

Reference architecture

Founder lens: ROI and adoption path

Failure modes and debugging

Production checklist

Share

Sources

Get AI Signals by email

Related posts

Builder TL;DR

Goal and expected outcome

Stack and prerequisites

Step-by-step implementation

Reference architecture

Founder lens: ROI and adoption path

Failure modes and debugging

Production checklist

Share

Sources

Get AI Signals by email

Related posts

Unlocking the Codex harness: how we built the App Server

GPT-OSS Agentic RL: What Builders Can Actually Ship