Implement deterministic PASS/FAIL completion checks for AI coding agents using DoneSpec

TL;DR in plain English

DoneSpec is an open-source pattern and reference implementation for deterministic validation of AI agent task completion; see the repository at https://github.com/xryv/DoneSpec.
Pattern: agent writes its output to disk (JSON or text), then a small machine-checkable validator reads that artifact and returns PASS (exit code 0) or FAIL (non-zero). The repo describes this “deterministic validation” approach: https://github.com/xryv/DoneSpec.
Why use it: turn ambiguous natural-language "I'm done" signals into boolean, auditable outcomes; this reduces human review time and enables CI gating and automated retry logic.

Quick facts and thresholds (summary):

Pilot: 1–3 hours to ship a single simple check; 4–12 hours for a robust spec.
Canary: run an optional 7-day (168-hour) canary targeting ~100 runs before enforcement.
Enforcement thresholds: require ≥95% pass rate over ~100 runs to flip checks to required; allow rollback if pass rate drops below 90%.

Methodology note: this guide references the DoneSpec repository and adapts its deterministic-validation pattern (https://github.com/xryv/DoneSpec).

What you will build and why it helps

You will add a deterministic post-run validator to your agent loop that reads the agent's output artifact and emits a clear PASS/FAIL. The DoneSpec repo shows examples and the approach: https://github.com/xryv/DoneSpec.

Concrete benefits (measurable):

Clear gate: PASS/FAIL instead of fuzzy text reduces time-to-merge and human review counts by measurable amounts (target: reduce ambiguous reviews by ≥50% in pilot data).
CI automation: use a boolean exit code to gate merges or trigger automated retries (retry policies e.g., up to 3 attempts).
Observability: track pass-rate (%) and validation latency (ms) to evaluate stability.

What you will produce:

One spec file (YAML or JSON) and 2+ fixtures (good/bad). Aim for 10+ fixtures over time.
A validator CLI or library that runs in <2000 ms for typical checks (target 200–2000 ms; heavy checks should be capped at 5 minutes / 300000 ms).

Reference: https://github.com/xryv/DoneSpec

Before you start (time, cost, prerequisites)

Estimated effort and cost (numbers):

Quick path: ~1 hour (single check). Conservative: ~3 hours.
Robust spec and fixtures: 2–12 hours (target 4–12 hours). Maintain 10+ fixtures over months.
CI cost: expect roughly $0.10–$5.00 for small pilots depending on runtime and number of runs; monitor minutes used.
Retention: keep artifacts for at least 30 days to allow replay and audits.

People and permissions:

Team size: 1–3 people for a small pilot; scale owners as needed.
Required access: git and permission to add or edit CI workflows (example: GitHub Actions).

Technical prerequisites:

An agent or test harness that writes output to disk (JSON, NDJSON, or plain text).
A CI system; GitHub Actions is a common example and integrates well with the pattern: https://github.com/xryv/DoneSpec.
Basic shell, git, and small scripting skills (bash/python).

Quick pre-flight checklist

[ ] Clone the DoneSpec repository: https://github.com/xryv/DoneSpec
[ ] Identify the agent output file to validate (JSON preferred)
[ ] Create at least two fixtures: one known-good and one known-bad (10+ recommended over time)

Step-by-step setup and implementation

Follow these condensed steps. The DoneSpec repo contains example specs and tools: https://github.com/xryv/DoneSpec.

Clone and inspect the repo

git clone https://github.com/xryv/DoneSpec.git
cd DoneSpec
ls -la

Create a minimal spec and fixtures

Choose one high-confidence invariant (required key, non-empty tests array, or checksum match).
Store spec at specs/my_spec.yaml and fixtures/good.json, fixtures/bad.json.

Wire the validator into the agent

Pattern A: in-process library call that returns True/False or exits 0/1.
Pattern B: external CLI that reads artifact.json and exits 0/1.

Local validation and timing targets

Validate good and bad fixtures locally. Expect deterministic outcomes: good -> exit 0, bad -> non-zero.
Measure validation time; target 200–2000 ms per run for lightweight checks. Cap heavy checks at 5 minutes / 300000 ms.

Add CI job and canary (optional for first 7 days)

Keep the check optional for the first 168 hours (7 days), collect ~100 runs, and log pass/fail counts and mean latency (ms).

Example GitHub Actions snippet:

name: donespec-check
on: [pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run agent in test mode
        run: ./scripts/run-agent.sh --output artifact.json --mode test
      - name: Run DoneSpec validator
        run: |
          python3 tools/donespec_validator.py --spec specs/my_spec.yaml --input artifact.json

Iterate with regression fixtures

Maintain >=10 fixtures over time (e.g., 5 good, 5 bad) to measure false positive and false negative rates.

Command example (local validator):

python3 tools/donespec_validator.py --spec specs/my_spec.yaml --input fixtures/good.json
# expect exit code 0

Repo reference: https://github.com/xryv/DoneSpec

Common problems and quick fixes

Common symptoms, likely cause, and quick action. See examples in the repo: https://github.com/xryv/DoneSpec.

| Symptom | Likely cause | Quick fix (example thresholds) | |---|---:|---| | False negatives | Spec too strict | Relax non-essential asserts; canonicalize whitespace/timestamps; allow IDs to vary (target FP reduction ≥50%) | | False positives | Spec too loose | Add 1–3 high-confidence invariants (required keys, checksum); require presence of tests file | | CI timeouts / high cost | Heavy checks in PR path | Move heavy checks to nightly; set per-check timeout = 300000 ms (5 minutes) | | Environment flakiness | Unpinned runtimes or missing caches | Pin runtimes (node=18.x, python=3.11) and use Docker image; target <1% environment-induced failures |

Quick fixes

Too many false negatives: relax optional asserts and add normalization steps.
Too many false positives: add required-key checks and a checksum.
Flaky CI: pin runtimes or run inside Docker; cache dependencies.

Repo: https://github.com/xryv/DoneSpec

First use case for a small team

Target: 1–3 people or a solo maintainer; fast iteration and low overhead.

Concrete plan (hours / days):

Ship one high-value invariant in 1–3 hours

Example invariant: every generated PR must include at least 1 tests file and the linter must exit 0.
Implement the spec and two fixtures (good + bad). Keep the spec concise (aim <50 lines).

Run a 7-day (168-hour) optional CI canary

Collect ~100 runs if available. Track pass-rate (%) and average validation time (ms).

Enforce only after stability

Flip to required checks only when pass-rate ≥95% over the last ~100 runs.
Roll back if pass-rate falls below 90% during enforcement.

Low-effort observability

Log PASS/FAIL with timestamp and a short reason (under 200 tokens). Retain artifacts for 30 days.

Practical checklist

[ ] Create specs/my_spec.yaml and fixtures/good.json and fixtures/bad.json
[ ] Add CI workflow that runs DoneSpec as an optional check for 7 days
[ ] Monitor pass-rate (%) and validation latency (ms)
[ ] Flip to required when pass-rate ≥95% over 100 runs

Examples and fallback

If you need a tiny validator, adapt scripts from https://github.com/xryv/DoneSpec or write a small Python CLI to check JSON keys and exit 0/1.

Technical notes (optional)

Determinism: the pattern requires saving agent outputs to disk so validations are replayable and deterministic (DoneSpec describes deterministic validation: https://github.com/xryv/DoneSpec).
Integration patterns: in-process library call or out-of-process CLI that reads artifact.json; both should produce a boolean outcome.
Test harness advice: keep at least 10 fixtures (e.g., 5 good, 5 bad) and measure false positive/negative rates as counts.

Example test loop (bash):

for f in fixtures/*.json; do
  python3 tools/donespec_validator.py --spec specs/my_spec.yaml --input "$f" || echo "FAILED: $f"
done

Logging and telemetry: emit minimal structured logs (timestamp, pass/fail, reason code). Keep message sizes under ~200 tokens and store for 30 days.

Repo: https://github.com/xryv/DoneSpec

What to do next (production checklist)

Assumptions / Hypotheses

The DoneSpec repository at https://github.com/xryv/DoneSpec provides examples and a deterministic validation approach you can adapt. This guide assumes you can run a validator as a CLI or library and save agent outputs as JSON or text artifacts.
Assumes permission to modify CI workflows (for example GitHub Actions) and to add optional status checks during a canary.

Risks / Mitigations

Risk: false enforcement causing developer friction. Mitigation: keep the check optional for 7 days (168 hours), collect ~100 runs, and require ≥95% pass rate before enforcing.
Risk: spec drift as the agent evolves. Mitigation: require review for spec changes, keep a changelog, and require owner signoff for edits.
Risk: increased CI cost or latency. Mitigation: cap per-check runtime (for example 5 minutes / 300000 ms), move heavy checks to nightly runs, and measure CI cost during the pilot (target $0.10–$5.00 for the pilot).

Next steps

Short-term (1–7 days)

Clone https://github.com/xryv/DoneSpec, write a minimal spec and two fixtures, and add an optional CI job. Run a 7-day canary targeting ~100 runs and record pass-rate (%) and mean latency (ms).

Medium-term (2–4 weeks)

Expand fixtures to 10+ examples, tune the spec, add dashboards for pass-rate (%), average validation time (ms), and failure counts.

Long-term (1–3 months)

Make DoneSpec checks a required merge gate once thresholds are stable (≥95% over ~100 runs). Maintain a spec-change log and automate rollback if pass-rate drops below the recovery threshold (90%).

Final reference: https://github.com/xryv/DoneSpec

Implement deterministic PASS/FAIL completion checks for AI coding agents using DoneSpec

TL;DR in plain English

What you will build and why it helps

Before you start (time, cost, prerequisites)

Step-by-step setup and implementation

Common problems and quick fixes

First use case for a small team

Technical notes (optional)

What to do next (production checklist)

Assumptions / Hypotheses

Risks / Mitigations

Next steps

Share

Sources

Get AI Signals by email

Need this shipped faster?

Related posts

TL;DR in plain English

What you will build and why it helps

Before you start (time, cost, prerequisites)

Step-by-step setup and implementation

Common problems and quick fixes

First use case for a small team

Technical notes (optional)

What to do next (production checklist)

Assumptions / Hypotheses

Risks / Mitigations

Next steps

Share

Sources

Get AI Signals by email

Need this shipped faster?

Related posts

Flightplanner: human-readable product specs as the canonical source for end-to-end testing

Commit-gated validation pipeline to prevent AI-generated schema errors

Voicetest: Open-source test harness that unifies Retell, VAPI, LiveKit and Bland via an AgentGraph IR