Skip to content

OBSERVABILITY POLICY

Status: IMMUTABLE. Loaded by every agent. Version: 1.0.0 Decision rank: #5


🧠 PRINCIPLE

"If it happened, it's logged. If it's not logged, it didn't happen."

Every action by every agent emits structured telemetry. Non-negotiable.


πŸ“Š THREE PILLARS

1. METRICS (Prometheus / OpenTelemetry)

Per-agent (cardinality: agent_id + skill_id + layer):

Metric Type Unit
agent_invocations_total counter count
agent_invocation_duration_ms histogram ms
agent_tokens_input_total counter tokens
agent_tokens_output_total counter tokens
agent_cost_dollars_total counter USD
agent_success_total counter count
agent_failure_total counter count
agent_quality_score gauge 0-1
agent_circuit_breaker_state gauge 0/1/2 (closed/open/half)
cache_hit_total counter count (cache_type label)

Per-skill:

Metric Type
skill_invocations_total counter
skill_avg_latency_ms gauge
skill_avg_cost_dollars gauge
skill_quality_score_avg gauge
skill_deprecation_status gauge (0=ok, 1=suspect, 2=broken)

Per-orchestrator:

Metric Type
routing_decisions_total counter (target_agent label)
routing_fallback_total counter
budget_remaining_dollars gauge (project_id)
stories_in_progress gauge

2. TRACES (OpenTelemetry)

Every invocation creates a span. Spans propagate via context capsule.

Trace: user_request_xyz
β”œβ”€ Span: global_orchestrator.parse_intent (5ms)
β”œβ”€ Span: global_orchestrator.route (3ms)
β”œβ”€ Span: domain_orch_frontend.plan (120ms)
β”‚  └─ Span: task_orch.execute_story (1200ms)
β”‚     β”œβ”€ Span: specialist_frontend.execute (800ms)
β”‚     β”‚  β”œβ”€ Span: skill.shadcn (300ms)
β”‚     β”‚  └─ Span: skill.tailwind (200ms)
β”‚     β”œβ”€ Span: auditor.validate (50ms)
β”‚     └─ Span: artifact_store.write (30ms)
└─ Span: global_orchestrator.aggregate (10ms)

Required span attributes: - agent.id, agent.layer, agent.version - skill.id (if applicable), skill.version - tokens.input, tokens.output, cost.dollars - idempotency_key - parent_span_id, trace_id - outcome (success/failure/partial)

3. LOGS (structured JSON, Postgres + ELK)

Every action emits a log event.

Mandatory fields:

{
  "timestamp": "2026-05-18T14:32:11.234Z",
  "trace_id": "abc123",
  "span_id": "def456",
  "agent_id": "frontend-specialist",
  "agent_layer": 4,
  "agent_version": "1.2.3",
  "action": "invoked_skill",
  "skill_id": "shadcn-ui",
  "inputs_hash": "sha256:...",
  "outputs_hash": "sha256:...",
  "why": "task requires UI component and registry shows shadcn as top-quality option",
  "duration_ms": 1240,
  "tokens_input": 2400,
  "tokens_output": 890,
  "cost_dollars": 0.0072,
  "outcome": "success",
  "quality_score": 0.92,
  "confidence": 0.95
}

Log levels: - DEBUG β€” verbose, only when env=dev - INFO β€” every action (default) - WARN β€” anomaly detected, not blocking - ERROR β€” failure that triggered retry - CRITICAL β€” failure that escalated to user


πŸ—„οΈ RETENTION

Tier Hot Warm Cold Forever
Metrics 30d (full res) 90d (1m res) 1y (1h res) aggregate
Traces 7d 30d (sampled 10%) 90d (sampled 1%) β€”
Logs (INFO+) 7d (full) 30d (compressed) 1y (Glacier) β€”
Logs (ERROR+) 30d (full) 1y (compressed) indefinite β€”
Audit 30d (full) 1y (compressed) indefinite (Glacier) yes

🎯 SLOs (Service Level Objectives)

Per-layer targets:

Metric Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
Availability 99.95% 99.9% 99.5% 99% 99%
Latency p95 <100ms <500ms <2s <30s <10s
Error rate <0.1% <0.5% <1% <2% <2%
Quality score avg n/a >0.85 >0.85 >0.85 >0.80
Cost per invocation <$0.001 <$0.01 <$0.10 <$1.00 <$0.10

SLO breach β†’ alert β†’ potential rollback to last known good version.


🚨 ALERTING

Alerts emitted to: - Slack channel #mult-agentes-alerts (immediate) - PagerDuty (if severity CRITICAL) - Audit log (always)

Severity tiers: - INFO β€” informational, no action needed - WARN β€” investigate within 24h - HIGH β€” investigate within 1h - CRITICAL β€” page immediately

Alert categories: - SLO breach (any pillar) - Budget exhaustion (>80% used) - Quality score drop (>20% week-over-week) - Skill deprecation detected - Constitutional violation - Anomaly (statistical 3Οƒ deviation) - Security event (see SECURITY-CONSTITUTION.md)


πŸ”¬ EXPLAINABILITY

Every action MUST be answerable to:

"Why did agent X do Y at time Z?"

via single query against audit log returning: - agent_id, action, inputs, skill_used, why, outcome, cost

Target: 95% of queries answerable in <50ms (single Postgres index).


πŸ§ͺ OBSERVABILITY OF OBSERVABILITY

Meta-monitoring (so we know logging itself works): - Heartbeat metric observability_health_score per layer - Lost-event detector (gap analysis on trace_ids) - Cost-of-observability tracked separately (must stay <5% of total framework cost)