ERROR HANDLING POLICY¶
Status: IMMUTABLE. Loaded by every agent. Version: 1.0.0 Decision rank: #4
π― PRINCIPLE¶
"Errors are first-class citizens. Treat them with the same rigor as successful outputs."
Every failure is logged, classified, attributed, and either auto-remediated or escalated. Silent failure is the worst failure.
π·οΈ ERROR TAXONOMY¶
All errors classified into one of:
| Class | Examples | Default Handler |
|---|---|---|
| TRANSIENT | Network timeout, rate limit, 503 | Retry with backoff |
| DETERMINISTIC | Schema invalid, capability missing, bad input | No retry; reroute or fail |
| RESOURCE | OOM, disk full, token budget exhausted | Halt + escalate |
| CONSTITUTIONAL | Hard constraint violated | Hard stop + incident |
| SECURITY | Auth fail, secret leak attempt, injection | Hard stop + security alert |
| UNKNOWN | Uncategorized | Treat as DETERMINISTIC; log for triage |
π RETRY POLICY¶
TRANSIENT errors only:¶
retry:
max_attempts: 3
backoff: exponential
base_delay_ms: 1000
max_delay_ms: 30000
jitter: 0.3 # 30% random jitter to avoid thundering herd
retry_on_status: [408, 429, 500, 502, 503, 504]
retry_on_exception: [TimeoutError, ConnectionError]
After 3 attempts β escalate to circuit breaker.
DETERMINISTIC errors:¶
Never retry. Reroute to alternative agent/skill or fail story.
β‘ CIRCUIT BREAKERS¶
Per-agent and per-skill circuit breakers.
States: - CLOSED β normal operation - HALF-OPEN β testing if recovered (single probe) - OPEN β failing, rerouting to backup
Triggers:
breaker:
failure_threshold: 5 # 5 failures in window
failure_window_seconds: 60
recovery_timeout_seconds: 30
half_open_probe_count: 1
When breaker OPEN:
1. Mark agent/skill temporarily_unavailable
2. Reroute via fallback chain
3. Emit alert
4. After recovery_timeout β HALF-OPEN probe
5. Probe success β CLOSED; failure β reset timer
πͺ FALLBACK CHAINS¶
Defined per agent in _agents-registry.yaml:
agent_id: frontend-specialist
fallbacks:
primary: frontend-specialist
fallback_1: frontend-specialist-fallback # same role, different prompt
fallback_2: generalist-engineer # cheaper, less specialized
fallback_3: human_escalation # last resort
Fallback chain executed in order. Each step logs reason for fallback.
π HARD STOPS (no recovery, immediate halt)¶
The following errors halt execution AND escalate:
- CONSTITUTIONAL violation β incident report mandatory
- SECURITY breach attempt β security team notified
- Budget hard-stop (cost > 95% project budget)
- Recursion depth exceeded (>5 levels)
- Mutual deadlock detected (2+ agents waiting on each other)
- Story timeout (>30min)
- Output validation persistent failure (3 consecutive)
On hard stop: - Snapshot full state - Log incident - Notify user - Wait for user intervention - Never auto-resume hard-stopped tasks
π©Ή AUTO-REMEDIATION¶
For known failure patterns (catalogued in FAILURE-MODES.md):
remediation_rules:
- pattern: "skill X returns empty result for input Y"
confidence: 0.95
action: reroute_to_skill_Z
auto_apply: true
- pattern: "agent X exceeds tokens"
confidence: 0.9
action: increase_context_compaction_aggressiveness
auto_apply: true
- pattern: "schema validation fails on field F"
confidence: 0.85
action: re_prompt_with_explicit_schema
auto_apply: true
Confidence threshold for auto-apply: 0.85. Below threshold β suggest to user, wait for approval.
π INCIDENT MANAGEMENT¶
Triggered by HARD STOPs or persistent failures.
incident:
id: auto-uuid
severity: [LOW | MEDIUM | HIGH | CRITICAL]
triggered_at: timestamp
triggered_by: agent_id
classification: [error_class]
context:
project_id: ...
story_id: ...
trace_id: ...
state_snapshot_ref: artifact://snapshots/...
diagnosis:
auto_rca: true
rca_agent: auditor-haiku
root_cause: ...
contributing_factors: [...]
remediation:
applied: [...]
blocked_until: ...
user_action_required: bool
postmortem:
auto_generated: true
lessons_learned: [...]
pattern_added_to_failure_modes: bool
adr_generated: optional
Incidents added to FAILURE-MODES.md for pattern mining.
π ROLLBACK STRATEGY¶
Every state change must have rollback:
| Operation | Rollback |
|---|---|
| File write | Restore from artifact store (versioned) |
| DB insert | DELETE WHERE idempotency_key = X |
| DB update | UPDATE SET old_values WHERE id = X |
| External API call | Compensating call (e.g., refund, cancel) |
| Skill install | npx skills remove |
| Config change | Revert to previous committed version |
Rollback triggered automatically on: - Hard stop within transaction - Constitutional violation discovered post-action - User explicit revert
π§ͺ ERROR REPRODUCTION¶
Every error must be reproducible:
- Inputs hashed and stored
- Random seeds logged (if any)
- External state snapshot (timestamp + version)
- Environment captured (config, env vars hash)
Re-execution flag: --replay-from-incident=<id>.
Replays run in shadow mode (no real side effects).
π ERROR METRICS¶
Tracked per-agent, per-skill:
error_rate_total(counter, by class)error_recovery_rate(auto-fixed / total)incident_count(counter, by severity)mttr_seconds(mean time to remediate)mttd_seconds(mean time to detect)
Goals (per OBSERVABILITY-POLICY SLOs): - Layer 1-2: error rate < 0.5% - Layer 3-4: error rate < 2% - Auto-recovery rate > 80% - MTTR < 5 minutes for HIGH severity