Skip to content

ERROR HANDLING POLICY

Status: IMMUTABLE. Loaded by every agent. Version: 1.0.0 Decision rank: #4


🎯 PRINCIPLE

"Errors are first-class citizens. Treat them with the same rigor as successful outputs."

Every failure is logged, classified, attributed, and either auto-remediated or escalated. Silent failure is the worst failure.


🏷️ ERROR TAXONOMY

All errors classified into one of:

Class Examples Default Handler
TRANSIENT Network timeout, rate limit, 503 Retry with backoff
DETERMINISTIC Schema invalid, capability missing, bad input No retry; reroute or fail
RESOURCE OOM, disk full, token budget exhausted Halt + escalate
CONSTITUTIONAL Hard constraint violated Hard stop + incident
SECURITY Auth fail, secret leak attempt, injection Hard stop + security alert
UNKNOWN Uncategorized Treat as DETERMINISTIC; log for triage

πŸ” RETRY POLICY

TRANSIENT errors only:

retry:
  max_attempts: 3
  backoff: exponential
  base_delay_ms: 1000
  max_delay_ms: 30000
  jitter: 0.3   # 30% random jitter to avoid thundering herd
  retry_on_status: [408, 429, 500, 502, 503, 504]
  retry_on_exception: [TimeoutError, ConnectionError]

After 3 attempts β†’ escalate to circuit breaker.

DETERMINISTIC errors:

Never retry. Reroute to alternative agent/skill or fail story.


⚑ CIRCUIT BREAKERS

Per-agent and per-skill circuit breakers.

States: - CLOSED β€” normal operation - HALF-OPEN β€” testing if recovered (single probe) - OPEN β€” failing, rerouting to backup

Triggers:

breaker:
  failure_threshold: 5         # 5 failures in window
  failure_window_seconds: 60
  recovery_timeout_seconds: 30
  half_open_probe_count: 1

When breaker OPEN: 1. Mark agent/skill temporarily_unavailable 2. Reroute via fallback chain 3. Emit alert 4. After recovery_timeout β†’ HALF-OPEN probe 5. Probe success β†’ CLOSED; failure β†’ reset timer


πŸͺ‚ FALLBACK CHAINS

Defined per agent in _agents-registry.yaml:

agent_id: frontend-specialist
fallbacks:
  primary: frontend-specialist
  fallback_1: frontend-specialist-fallback   # same role, different prompt
  fallback_2: generalist-engineer            # cheaper, less specialized
  fallback_3: human_escalation               # last resort

Fallback chain executed in order. Each step logs reason for fallback.


πŸ›‘ HARD STOPS (no recovery, immediate halt)

The following errors halt execution AND escalate:

  1. CONSTITUTIONAL violation β€” incident report mandatory
  2. SECURITY breach attempt β€” security team notified
  3. Budget hard-stop (cost > 95% project budget)
  4. Recursion depth exceeded (>5 levels)
  5. Mutual deadlock detected (2+ agents waiting on each other)
  6. Story timeout (>30min)
  7. Output validation persistent failure (3 consecutive)

On hard stop: - Snapshot full state - Log incident - Notify user - Wait for user intervention - Never auto-resume hard-stopped tasks


🩹 AUTO-REMEDIATION

For known failure patterns (catalogued in FAILURE-MODES.md):

remediation_rules:
  - pattern: "skill X returns empty result for input Y"
    confidence: 0.95
    action: reroute_to_skill_Z
    auto_apply: true

  - pattern: "agent X exceeds tokens"
    confidence: 0.9
    action: increase_context_compaction_aggressiveness
    auto_apply: true

  - pattern: "schema validation fails on field F"
    confidence: 0.85
    action: re_prompt_with_explicit_schema
    auto_apply: true

Confidence threshold for auto-apply: 0.85. Below threshold β†’ suggest to user, wait for approval.


πŸ“‹ INCIDENT MANAGEMENT

Triggered by HARD STOPs or persistent failures.

incident:
  id: auto-uuid
  severity: [LOW | MEDIUM | HIGH | CRITICAL]
  triggered_at: timestamp
  triggered_by: agent_id
  classification: [error_class]

  context:
    project_id: ...
    story_id: ...
    trace_id: ...
    state_snapshot_ref: artifact://snapshots/...

  diagnosis:
    auto_rca: true
    rca_agent: auditor-haiku
    root_cause: ...
    contributing_factors: [...]

  remediation:
    applied: [...]
    blocked_until: ...
    user_action_required: bool

  postmortem:
    auto_generated: true
    lessons_learned: [...]
    pattern_added_to_failure_modes: bool
    adr_generated: optional

Incidents added to FAILURE-MODES.md for pattern mining.


πŸ”„ ROLLBACK STRATEGY

Every state change must have rollback:

Operation Rollback
File write Restore from artifact store (versioned)
DB insert DELETE WHERE idempotency_key = X
DB update UPDATE SET old_values WHERE id = X
External API call Compensating call (e.g., refund, cancel)
Skill install npx skills remove
Config change Revert to previous committed version

Rollback triggered automatically on: - Hard stop within transaction - Constitutional violation discovered post-action - User explicit revert


πŸ§ͺ ERROR REPRODUCTION

Every error must be reproducible:

  • Inputs hashed and stored
  • Random seeds logged (if any)
  • External state snapshot (timestamp + version)
  • Environment captured (config, env vars hash)

Re-execution flag: --replay-from-incident=<id>.

Replays run in shadow mode (no real side effects).


πŸ“Š ERROR METRICS

Tracked per-agent, per-skill:

  • error_rate_total (counter, by class)
  • error_recovery_rate (auto-fixed / total)
  • incident_count (counter, by severity)
  • mttr_seconds (mean time to remediate)
  • mttd_seconds (mean time to detect)

Goals (per OBSERVABILITY-POLICY SLOs): - Layer 1-2: error rate < 0.5% - Layer 3-4: error rate < 2% - Auto-recovery rate > 80% - MTTR < 5 minutes for HIGH severity