OBSERVABILITY POLICY¶

Status: IMMUTABLE. Loaded by every agent. Version: 1.0.0 Decision rank: #5

🧠 PRINCIPLE¶

"If it happened, it's logged. If it's not logged, it didn't happen."

Every action by every agent emits structured telemetry. Non-negotiable.

📊 THREE PILLARS¶

1. METRICS (Prometheus / OpenTelemetry)¶

Per-agent (cardinality: agent_id + skill_id + layer):

Metric	Type	Unit
`agent_invocations_total`	counter	count
`agent_invocation_duration_ms`	histogram	ms
`agent_tokens_input_total`	counter	tokens
`agent_tokens_output_total`	counter	tokens
`agent_cost_dollars_total`	counter	USD
`agent_success_total`	counter	count
`agent_failure_total`	counter	count
`agent_quality_score`	gauge	0-1
`agent_circuit_breaker_state`	gauge	0/1/2 (closed/open/half)
`cache_hit_total`	counter	count (cache_type label)

Per-skill:

Metric	Type
`skill_invocations_total`	counter
`skill_avg_latency_ms`	gauge
`skill_avg_cost_dollars`	gauge
`skill_quality_score_avg`	gauge
`skill_deprecation_status`	gauge (0=ok, 1=suspect, 2=broken)

Per-orchestrator:

Metric	Type
`routing_decisions_total`	counter (target_agent label)
`routing_fallback_total`	counter
`budget_remaining_dollars`	gauge (project_id)
`stories_in_progress`	gauge

2. TRACES (OpenTelemetry)¶

Every invocation creates a span. Spans propagate via context capsule.

Trace: user_request_xyz
├─ Span: global_orchestrator.parse_intent (5ms)
├─ Span: global_orchestrator.route (3ms)
├─ Span: domain_orch_frontend.plan (120ms)
│  └─ Span: task_orch.execute_story (1200ms)
│     ├─ Span: specialist_frontend.execute (800ms)
│     │  ├─ Span: skill.shadcn (300ms)
│     │  └─ Span: skill.tailwind (200ms)
│     ├─ Span: auditor.validate (50ms)
│     └─ Span: artifact_store.write (30ms)
└─ Span: global_orchestrator.aggregate (10ms)

Required span attributes: - agent.id, agent.layer, agent.version - skill.id (if applicable), skill.version - tokens.input, tokens.output, cost.dollars - idempotency_key - parent_span_id, trace_id - outcome (success/failure/partial)

3. LOGS (structured JSON, Postgres + ELK)¶

Every action emits a log event.

Mandatory fields:

{
  "timestamp": "2026-05-18T14:32:11.234Z",
  "trace_id": "abc123",
  "span_id": "def456",
  "agent_id": "frontend-specialist",
  "agent_layer": 4,
  "agent_version": "1.2.3",
  "action": "invoked_skill",
  "skill_id": "shadcn-ui",
  "inputs_hash": "sha256:...",
  "outputs_hash": "sha256:...",
  "why": "task requires UI component and registry shows shadcn as top-quality option",
  "duration_ms": 1240,
  "tokens_input": 2400,
  "tokens_output": 890,
  "cost_dollars": 0.0072,
  "outcome": "success",
  "quality_score": 0.92,
  "confidence": 0.95
}

Log levels: - DEBUG — verbose, only when env=dev - INFO — every action (default) - WARN — anomaly detected, not blocking - ERROR — failure that triggered retry - CRITICAL — failure that escalated to user

🗄️ RETENTION¶

Tier	Hot	Warm	Cold	Forever
Metrics	30d (full res)	90d (1m res)	1y (1h res)	aggregate
Traces	7d	30d (sampled 10%)	90d (sampled 1%)	—
Logs (INFO+)	7d (full)	30d (compressed)	1y (Glacier)	—
Logs (ERROR+)	30d (full)	1y (compressed)	indefinite	—
Audit	30d (full)	1y (compressed)	indefinite (Glacier)	yes

🎯 SLOs (Service Level Objectives)¶

Per-layer targets:

Metric	Layer 1	Layer 2	Layer 3	Layer 4	Layer 5
Availability	99.95%	99.9%	99.5%	99%	99%
Latency p95	<100ms	<500ms	<2s	<30s	<10s
Error rate	<0.1%	<0.5%	<1%	<2%	<2%
Quality score avg	n/a	>0.85	>0.85	>0.85	>0.80
Cost per invocation	<$0.001	<$0.01	<$0.10	<$1.00	<$0.10

SLO breach → alert → potential rollback to last known good version.

🚨 ALERTING¶

Alerts emitted to: - Slack channel #mult-agentes-alerts (immediate) - PagerDuty (if severity CRITICAL) - Audit log (always)

Severity tiers: - INFO — informational, no action needed - WARN — investigate within 24h - HIGH — investigate within 1h - CRITICAL — page immediately

Alert categories: - SLO breach (any pillar) - Budget exhaustion (>80% used) - Quality score drop (>20% week-over-week) - Skill deprecation detected - Constitutional violation - Anomaly (statistical 3σ deviation) - Security event (see SECURITY-CONSTITUTION.md)

🔬 EXPLAINABILITY¶

Every action MUST be answerable to:

"Why did agent X do Y at time Z?"

via single query against audit log returning: - agent_id, action, inputs, skill_used, why, outcome, cost

Target: 95% of queries answerable in <50ms (single Postgres index).

🧪 OBSERVABILITY OF OBSERVABILITY¶

Meta-monitoring (so we know logging itself works): - Heartbeat metric observability_health_score per layer - Lost-event detector (gap analysis on trace_ids) - Cost-of-observability tracked separately (must stay <5% of total framework cost)