ERROR HANDLING POLICY¶

Status: IMMUTABLE. Loaded by every agent. Version: 1.0.0 Decision rank: #4

🎯 PRINCIPLE¶

"Errors are first-class citizens. Treat them with the same rigor as successful outputs."

Every failure is logged, classified, attributed, and either auto-remediated or escalated. Silent failure is the worst failure.

🏷️ ERROR TAXONOMY¶

All errors classified into one of:

Class	Examples	Default Handler
TRANSIENT	Network timeout, rate limit, 503	Retry with backoff
DETERMINISTIC	Schema invalid, capability missing, bad input	No retry; reroute or fail
RESOURCE	OOM, disk full, token budget exhausted	Halt + escalate
CONSTITUTIONAL	Hard constraint violated	Hard stop + incident
SECURITY	Auth fail, secret leak attempt, injection	Hard stop + security alert
UNKNOWN	Uncategorized	Treat as DETERMINISTIC; log for triage

🔁 RETRY POLICY¶

TRANSIENT errors only:¶

retry:
  max_attempts: 3
  backoff: exponential
  base_delay_ms: 1000
  max_delay_ms: 30000
  jitter: 0.3   # 30% random jitter to avoid thundering herd
  retry_on_status: [408, 429, 500, 502, 503, 504]
  retry_on_exception: [TimeoutError, ConnectionError]

After 3 attempts → escalate to circuit breaker.

DETERMINISTIC errors:¶

Never retry. Reroute to alternative agent/skill or fail story.

⚡ CIRCUIT BREAKERS¶

Per-agent and per-skill circuit breakers.

States: - CLOSED — normal operation - HALF-OPEN — testing if recovered (single probe) - OPEN — failing, rerouting to backup

Triggers:

breaker:
  failure_threshold: 5         # 5 failures in window
  failure_window_seconds: 60
  recovery_timeout_seconds: 30
  half_open_probe_count: 1

When breaker OPEN: 1. Mark agent/skill temporarily_unavailable 2. Reroute via fallback chain 3. Emit alert 4. After recovery_timeout → HALF-OPEN probe 5. Probe success → CLOSED; failure → reset timer

🪂 FALLBACK CHAINS¶

Defined per agent in _agents-registry.yaml:

agent_id: frontend-specialist
fallbacks:
  primary: frontend-specialist
  fallback_1: frontend-specialist-fallback   # same role, different prompt
  fallback_2: generalist-engineer            # cheaper, less specialized
  fallback_3: human_escalation               # last resort

Fallback chain executed in order. Each step logs reason for fallback.

🛑 HARD STOPS (no recovery, immediate halt)¶

The following errors halt execution AND escalate:

CONSTITUTIONAL violation — incident report mandatory
SECURITY breach attempt — security team notified
Budget hard-stop (cost > 95% project budget)
Recursion depth exceeded (>5 levels)
Mutual deadlock detected (2+ agents waiting on each other)
Story timeout (>30min)
Output validation persistent failure (3 consecutive)

On hard stop: - Snapshot full state - Log incident - Notify user - Wait for user intervention - Never auto-resume hard-stopped tasks

🩹 AUTO-REMEDIATION¶

For known failure patterns (catalogued in FAILURE-MODES.md):

remediation_rules:
  - pattern: "skill X returns empty result for input Y"
    confidence: 0.95
    action: reroute_to_skill_Z
    auto_apply: true

  - pattern: "agent X exceeds tokens"
    confidence: 0.9
    action: increase_context_compaction_aggressiveness
    auto_apply: true

  - pattern: "schema validation fails on field F"
    confidence: 0.85
    action: re_prompt_with_explicit_schema
    auto_apply: true

Confidence threshold for auto-apply: 0.85. Below threshold → suggest to user, wait for approval.

📋 INCIDENT MANAGEMENT¶

Triggered by HARD STOPs or persistent failures.

incident:
  id: auto-uuid
  severity: [LOW | MEDIUM | HIGH | CRITICAL]
  triggered_at: timestamp
  triggered_by: agent_id
  classification: [error_class]

  context:
    project_id: ...
    story_id: ...
    trace_id: ...
    state_snapshot_ref: artifact://snapshots/...

  diagnosis:
    auto_rca: true
    rca_agent: auditor-haiku
    root_cause: ...
    contributing_factors: [...]

  remediation:
    applied: [...]
    blocked_until: ...
    user_action_required: bool

  postmortem:
    auto_generated: true
    lessons_learned: [...]
    pattern_added_to_failure_modes: bool
    adr_generated: optional

Incidents added to FAILURE-MODES.md for pattern mining.

🔄 ROLLBACK STRATEGY¶

Every state change must have rollback:

Operation	Rollback
File write	Restore from artifact store (versioned)
DB insert	DELETE WHERE idempotency_key = X
DB update	UPDATE SET old_values WHERE id = X
External API call	Compensating call (e.g., refund, cancel)
Skill install	`npx skills remove`
Config change	Revert to previous committed version

Rollback triggered automatically on: - Hard stop within transaction - Constitutional violation discovered post-action - User explicit revert

🧪 ERROR REPRODUCTION¶

Every error must be reproducible:

Inputs hashed and stored
Random seeds logged (if any)
External state snapshot (timestamp + version)
Environment captured (config, env vars hash)

Re-execution flag: --replay-from-incident=<id>.

Replays run in shadow mode (no real side effects).

📊 ERROR METRICS¶

Tracked per-agent, per-skill:

error_rate_total (counter, by class)
error_recovery_rate (auto-fixed / total)
incident_count (counter, by severity)
mttr_seconds (mean time to remediate)
mttd_seconds (mean time to detect)

Goals (per OBSERVABILITY-POLICY SLOs): - Layer 1-2: error rate < 0.5% - Layer 3-4: error rate < 2% - Auto-recovery rate > 80% - MTTR < 5 minutes for HIGH severity