How Viktor uses prompt caching and byte-stable prefixes to cut agent-thread costs

TL;DR in plain English

Prompt caching converts repeated thread history into cheap cache reads (≈0.1x read cost) by enforcing a byte-stable prefix and an SDK-first tool model. Viktor reports an example 40-step thread dropping from $11.35 to $2.07 (81.8% savings) on Claude Opus 4.8. Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching
Why it matters: LLM APIs are mostly stateless, so agents re-send the conversation each call. Viktor measured ~2.17M input tokens for a 40-step thread even though the transcript is ~85K tokens; caching turns that repeated prefix into cheap reads. Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching
Quick start checklist (30–60 minutes):
- [ ] Record baseline tokens-per-thread and cost-per-thread
- [ ] Stand up a Redis/TLS cache and instrument hit-rate
- [ ] Add a byte-stable serializer and an SDK wrapper for tools

Methodology note: numbers and the running example come from Viktor's production write-up and are used here as a reproducible reference. See: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

What you will build and why it helps

You will build a thread engine organized around a prompt cache so repeated tokens become cache reads at ≈0.1x cost. Key structural choices (all described in Viktor's post): SDK-exposed tools, append-only thread logs, byte-stable prefixes, and in-cache summarization/compaction. Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

Why these choices help:

SDK-first tools remove variable schema text from prompts and stabilize prefixes.
Append-only logs make canonical prefixes deterministic and safe to compact.
Byte-stability (deterministic serialization) yields high cache hit-rates.
In-cache summarization turns a separate full-priced summarization call into a cheap cache read.

Comparison (Viktor example):

Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

Before you start (time, cost, prerequisites)

Time estimates

Prototype to capture tokens and test cache: ~3 hours.
SDK conversion and safe compaction rollout: 2–5 days.

Cost illustration (Viktor example)

40-step thread: $11.35 without caching, $2.07 with caching → $9.28 saved (81.8%). Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

Prerequisites

Model API access and token-counting or billing traces.
A fast cache (example: Redis with TLS and TTL support).
Ability to move tool invocations into an SDK layer.
Instrumentation for tokens-per-call, cache-hit-rate, latency, and cost-per-thread.

Pre-launch checklist

[ ] API keys and billing access
[ ] Cache instance (example: Redis) with TLS
[ ] Tokenization library in your language
[ ] Logging for tokens-per-call and thread-id
[ ] A representative test thread (target: ~40 calls)

Security note: treat cached summaries and thread logs as sensitive; enable encryption-at-rest and strict ACLs. Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

Step-by-step setup and implementation

Instrument and measure baseline

Log tokens-per-call, cumulative tokens-per-thread, latency (ms), and cost-per-thread ($). Run a 40-step representative thread to confirm the baseline (~2.17M input tokens vs ~85K transcript tokens in Viktor's example). Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

Add a byte-stable serializer and canonical prefix

Deterministic JSON ordering, strip ephemeral timestamps, normalize floats and IDs so identical inputs produce identical bytes.

Implement a prompt cache

Key pattern: thread:{thread_id}:prefix:v1
TTL tuning examples: hot window 30m; default TTL 60m (3600s). Tune per provider. Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

Make tools SDK-first

Move tool logic out of prompt text; call SDK functions and append tool outputs to the append-only thread log.

Convert threads to append-only logs

Event schema example: {seq:int, type:user|model|tool, ts, payload}
Append-only invariant preserves byte-stability and simplifies compaction.

Summarize and compact inside the cache

Trigger compaction when thresholds hit (example triggers: >40 events or >85,000 tokens). Summaries become cheap reads (~0.1x) rather than full-priced model calls.

Provider adapter and compaction timing

Never compact a hot thread. Example gate: compact only if cache-age > 30 minutes and cache-hit-rate > 80%.
Build provider adapters because providers differ (explicit breakpoints, TTLs, routing). Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

Validate with the running example

Re-run the 40-step example and measure cost-per-thread goal (target ≥50% reduction; Viktor example: 81.8%).

Smoke-test commands:

# run a single-thread smoke test
python run_thread_smoke.py --thread-id sample-40 --capture-tokens --save=./out --retries=3
# inspect token and cost summary
cat ./out/sample-40-summary.json | jq '.tokens_per_call | {total, calls, cost_estimate}'

Example config (cache and compaction policy):

cache:
  backend: redis
  prefix: "thread"
  ttl_seconds: 3600  # 60m default
compaction:
  hot_window_minutes: 30
  compact_before_cold_minutes: 5
  summary_trigger_calls: 40
  summary_trigger_tokens: 85000
provider_adapters:
  opus:
    explicit_breakpoints: true
    recommended_ttl: 1800

Rollout plan and gates

Canary: 10% traffic for 48h. Gate: cache-hit-rate ≥ 75% and cost/thread reduction ≥ 50%.
Ramp: 10% → 50% → 100% if metrics hold.
Rollback triggers: correctness drop > 3%, user-reported regressions, or cost-per-thread rising > 20%.

Reference design: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

Common problems and quick fixes

Problem: cache misses from tiny serialization differences

Fix: enforce deterministic JSON, strip timestamps, normalize IDs. Add byte-diff tests; aim for serialization byte-drift < 1 byte per stable event. Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

Problem: compaction removes context and hurts accuracy

Fix: add a compaction safety window (do not compact if cache-age < 30 minutes). Validation gate: ensure A/B accuracy degradation < 2% before enabling aggressive compaction.

Problem: tools still inject variability into prompts

Fix: move tool runs to SDK functions and append outputs to the append-only log.

Problem: provider cache eviction or routing differs

Fix: add a provider adapter and use conservative TTLs (example recommended_ttl 1800s). Monitor provider-specific cache behavior and route accordingly. Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

Quick metrics to monitor

cache-hit-rate (target > 75%)
cost-per-thread (target reduction ≥ 50%)
latency tail increase (target < 100 ms)
correctness degradation (alert if > 3%)

First use case for a small team

Scenario: 2–5 person support team where a ticket-triage agent averages 30–50 model calls per ticket. Follow this 6-step rollout (reference: Viktor example). Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

Baseline capture: record tokens/thread and cost/thread for 7 days.
Prototype: add cache + byte-stable prefix and run on offline replay threads (prototype ~3 hours).
Smoke tests: run 10 sample threads. Target: cost reduction > 50%.
Canary: enable caching for 10% of live tickets for 48h. Monitor correctness and cache-hit-rate.
Ramp: 10% → 50% → 100% if cache-hit-rate > 75% and correctness holds.
Full ramp + optimize compaction (suggested triggers: 40 calls or 85,000 tokens).

Operational targets for the small team:

cache-hit-rate > 75%
cost-per-thread reduction > 50%
latency tail increase < 100 ms

Aim: Viktor’s 81.8% cost reduction on a 40-step thread is a practical benchmark. Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

Technical notes (optional)

Byte-stability: use deterministic JSON, stable field ordering, and removal of ephemeral fields for reliable cache hits. Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching
Compaction algorithms: in-cache LLM summarization, windowed snapshots, or checkpoint compression. Trade-offs: accuracy vs read-cost; Viktor runs summarization inside the thread's cache so history reads are cheap. Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching
Provider differences: caches behave differently (explicit breakpoints vs automatic TTLs vs routing). Build an adapter layer and tune TTLs per-provider.
Security: encrypt cache values at rest, enforce ACLs, and redact PII where needed.

What to do next (production checklist)

Assumptions / Hypotheses

Model APIs are stateless; callers re-send history and repeated tokens drive cost. Viktor’s example: 2.17M input tokens vs 85K transcript on a 40-step thread and cost drop from $11.35 → $2.07 (81.8%). Source: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching
Hypothesis: byte-stable prefixes + SDK-first tools will yield cache-hit-rate > 80% on representative threads; validate with A/B testing and a canary (10% traffic for 48h).

Risks / Mitigations

Risk: correctness loss from over-aggressive compaction.
- Mitigation: compaction safety window (do not compact if cache-age < 30 minutes), A/B tests, rollback if correctness drop > 3%.
Risk: cache misses from serialization drift.
- Mitigation: deterministic serializer, byte-diff tests, and stable event schemas.
Risk: provider cache eviction surprises.
- Mitigation: provider adapter and conservative TTLs (example hot window 30m; recommended_ttl 1800s).
Risk: PII leakage in cache.
- Mitigation: encryption-at-rest, strict ACLs, redaction policies.

Next steps

Build a 3-hour prototype to measure tokens-per-thread on your heaviest agent; aim to reproduce the baseline numbers.
Implement a canonical serializer and provider adapter; canary at 10% traffic and target cache-hit-rate ≥ 75%.
Stage rollout: 10% → 50% → 100% with gates (cache-hit-rate, correctness, cost-per-thread).

Post-launch alerts to configure:

cache-hit-rate < 60%
cost-per-thread rises > 20%
correctness drop > 3%

Core reference and further reading: https://viktor.com/blog/how-we-built-viktor-around-prompt-caching

How Viktor uses prompt caching and byte-stable prefixes to cut agent-thread costs

TL;DR in plain English

What you will build and why it helps

Before you start (time, cost, prerequisites)

Step-by-step setup and implementation

Common problems and quick fixes

First use case for a small team

Technical notes (optional)

What to do next (production checklist)

Assumptions / Hypotheses

Risks / Mitigations

Next steps

Share

Sources

Get AI Signals by email

Need this shipped faster?

Related posts

TL;DR in plain English

What you will build and why it helps

Before you start (time, cost, prerequisites)

Step-by-step setup and implementation

Common problems and quick fixes

First use case for a small team

Technical notes (optional)

What to do next (production checklist)

Assumptions / Hypotheses

Risks / Mitigations

Next steps

Share

Sources

Get AI Signals by email

Need this shipped faster?

Related posts

OpenAI and Anthropic launch PE-backed ventures to embed engineers for enterprise AI deployments

AIPriceCompare — Compare public AI model API pricing by media type and request count

Prototyping Interfaze: Building a Multimodal Perception, Context-Construction and Action Stack for Task-Specific Small Models