TL;DR in plain English
- Most debugging agents fail because they receive raw, uncurated observability data; curation must happen before the model sees the data (source: https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/).
- Build a curation pipeline that turns traces, logs, and metrics into a compact "incident package" containing a bounded time window, a small set of relevant traces/spans, a few log snippets, and minimal metadata so the agent reasons instead of filtering noise (source: https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/).
- Start read-only, require human approval for remediation, and iterate thresholds based on reviewer feedback (source: https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/).
Methodology note: this document follows the curation-first principle described above and moves concrete numeric thresholds to the assumptions section for trial validation (https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/).
What you will build and why it helps
You will build a small curation pipeline that: (1) selects a bounded window around an incident, (2) fetches and ranks traces and logs within that window, (3) extracts 1–N high-signal spans and a few log snippets, (4) assembles an incident package with metadata and a one-line human note, and (5) exposes an expand-on-demand link to raw data. This reduces the signal-to-noise problem so the agent spends compute and tokens on reasoning rather than filtering (source: https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/).
Example decision frame (illustrative — use your trial to set numeric thresholds):
| Decision | Purpose | Example outcome | |---|---:|---| | Window selection | Limit scope to likely causal events | narrow time range around anomaly | | Trace limit | Reduce token cost and focus model | top-ranked traces only | | Span selection | Surface likely failure points | error spans, long-latency spans | | Log snippets | Provide representative evidence | 1–N lines with error or correlating IDs |
Reference: the need to structure and scope data before sending it to the agent is described here: https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/.
Why this helps (brief): curated packages present context-rich, scoped inputs so models don't treat every trace/line equally. That materially reduces wasted token budget and reduces plausible-but-wrong fixes.
Before you start (time, cost, prerequisites)
Read this section and then validate concrete thresholds in a controlled trial (see Assumptions / Hypotheses below) (source: https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/).
Prerequisites:
- Read-only API credentials for traces, logs, and metrics (least privilege).
- Storage for incident packages (S3/GCS or a small DB) and a retention policy.
- A test model endpoint or sandbox and a human review channel (Slack, email, or GitHub).
- A set of historical incidents for validation and iteration.
Quick checklist to start:
- [ ] Obtain read-only observability API keys
- [ ] Configure storage with retention
- [ ] Provision a model sandbox and reviewer channel
- [ ] Collect historical incidents for validation (sample set)
Each item above maps to the curation-first rationale in this post: https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/.
Step-by-step setup and implementation
- Pick an initial incident slice to protect (one service and one incident class). Keep the scope narrow for the first trials (source: https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/).
- Define a curation flow: select a bounded time window, query traces/logs with strict limits, rank traces/spans by signal (errors, latency, exception counts), deduplicate, redact PII, and assemble the incident package with metadata and a human_note.
- Expose the curated package to the agent in read-only mode and require a reviewer to accept any suggested remediation.
- Provide an expand-on-demand path to raw data for reviewers; keep that path time-limited and size-limited.
- Collect reviewer feedback and iterate thresholds until reviewers consistently find the root cause in a high fraction of packages.
Example command (replace with your observability tool; this is illustrative):
# fetch top error traces for service "payments" in a narrow window
obsctl traces query --service payments --from now-5m --filter "status:500" --limit 10 > traces.json
Example compact incident-package (conceptual JSON; tailor fields to your schema):
{
"incident_id": "inv-YYYYMMDD-001",
"window": "2026-06-26T10:00:00Z/2026-06-26T10:05:00Z",
"top_spans": [{"service":"payments","span_id":"s1","error":true}],
"log_snippets": ["ERROR: payment timeout for order 123"],
"enrichments": {"deployment":"v1.2.3"},
"human_note": "Suspect network timeout on payments-worker"
}
Notes: keep the package schema compact and include a short README explaining why the package was assembled that way; reviewers should see the rationale alongside the data (https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/).
Common problems and quick fixes
Problem: the agent chases irrelevant spans.
- Quick fix: narrow the package scope and suppress patterns from noisy services; add a suppression list for health-check noise.
Problem: logs overwhelm the model and increase cost.
- Quick fix: replace full logs with a 1–2 sentence auto-summary plus a small set of representative log snippets; provide raw logs on demand.
Problem: timestamps are inconsistent across tools.
- Quick fix: compute and store a timestamp_confidence field and surface it to reviewers; if confidence is low, mark the package as "low timestamp confidence" so the reviewer knows to expand the raw data.
Problem: agent lacks necessary human context.
- Quick fix: always attach a one-line human_note and deployment/rollout tags; require a human approval step before any automated remediation.
These mitigations follow the curation-first approach discussed here: https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/.
First use case for a small team
Aim for maximum value with minimal effort: protect one high-impact service and run read-only with human-in-the-loop review. The source emphasizes that curation needs to happen before the agent sees the data (https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/).
Minimal implementation plan for a 1–3 person team:
- Day 0: pick the critical service and collect 5–10 historical incidents.
- Day 1: script a basic pipeline that queries a short window, selects top traces, extracts a few spans and log snippets, writes a JSON package to storage, and posts to a reviewer channel.
- Week 1: run read-only, gather reviewer feedback, and tune the ranking/deduplication rules.
Operational safety: keep automation disabled until you have stable reviewer metrics; provide an expand-on-demand raw-data link for reviewers to investigate deeper.
Reference and motivation: https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/.
Technical notes (optional)
Keep schema and tooling simple. Persist only curated snippets to reduce model input size and cost; redact PII at curation time with deterministic rules (https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/).
Example redaction config (YAML):
redaction:
email: 'regex: ".+@.+\\..+"'
token: 'regex: "(api|auth)_?token=\\w+"'
keep_days: 30
Audit and metrics to collect (examples):
- packages_created_per_day
- reviewer_coverage_fraction
- token_spend_per_package
- agent_suggestion_accept_rate
Log every package creation and reviewer decision so you can trace regressions and measure whether curation improved signal-to-noise (https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/).
What to do next (production checklist)
Assumptions / Hypotheses
- Agents perform poorly when fed raw, uncurated observability data; curation-first reduces irrelevant attention and wasted token budget (source: https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/).
Numeric thresholds to validate during trials (treat these as hypotheses to test, not prescriptive mandates):
- window lengths to try: 2 minutes, 5 minutes, 10 minutes (2, 5, 10)
- top traces to present: 3, 10, 20
- spans per package: 1–5
- log lines per trace: 1–10
- token budgets to experiment with: 512, 2,048, 8,192 tokens
- reviewer coverage target: root cause present in >= 80% of packages (80%)
- initial canary traffic for automation: 5% then 25% then full
- cost guardrail to alert on during experiments: $50/day
Validate each number against reviewer feedback and historical incident coverage before making them production defaults.
Risks / Mitigations
-
Risk: over-pruning hides the root cause.
- Mitigation: provide expand-on-demand to raw data, log package provenance, and require a human approval step before remediation.
-
Risk: PII or secrets leak in snippets.
- Mitigation: automated redaction at curation time, enforce retention policy, and keep audit logs for 30 days.
-
Risk: automation causes regressions.
- Mitigation: canary at small traffic slices (5% → 25% → 100%), monitor a false-positive trigger (example rollback if false-positive rate > 20%), and require human sign-off for high-risk actions.
Next steps
- Run the curation pipeline against 10 historical incidents and measure coverage; target: validate the reviewer_coverage hypothesis.
- Start a 7-day read-only trial and collect daily reviewer feedback.
- Configure feature-flagged canary rollouts for any automated actions (5% → 25% → full) and define rollback triggers.
- Instrument SLIs: token_spend_per_package, agent_suggestion_accept_rate, false_positive_rate; set alerts on validated thresholds.
Production checklist:
- [ ] Implement one curation rule and a safe read-only output path
- [ ] Run 7 days read-only and collect reviewer feedback
- [ ] Validate >= 80% coverage on historical incidents (experiment)
- [ ] Configure canary flags (5% → 25% → 100%)
- [ ] Define rollback triggers and automated flips
For the original motivation and deeper explanation of the curation-first principle, read: https://www.multiplayer.app/blog/how-to-curate-observability-data-for-ai-agents/.