OBSERVABILITY POLICY¶
Status: IMMUTABLE. Loaded by every agent. Version: 1.0.0 Decision rank: #5
π§ PRINCIPLE¶
"If it happened, it's logged. If it's not logged, it didn't happen."
Every action by every agent emits structured telemetry. Non-negotiable.
π THREE PILLARS¶
1. METRICS (Prometheus / OpenTelemetry)¶
Per-agent (cardinality: agent_id + skill_id + layer):
| Metric | Type | Unit |
|---|---|---|
agent_invocations_total |
counter | count |
agent_invocation_duration_ms |
histogram | ms |
agent_tokens_input_total |
counter | tokens |
agent_tokens_output_total |
counter | tokens |
agent_cost_dollars_total |
counter | USD |
agent_success_total |
counter | count |
agent_failure_total |
counter | count |
agent_quality_score |
gauge | 0-1 |
agent_circuit_breaker_state |
gauge | 0/1/2 (closed/open/half) |
cache_hit_total |
counter | count (cache_type label) |
Per-skill:
| Metric | Type |
|---|---|
skill_invocations_total |
counter |
skill_avg_latency_ms |
gauge |
skill_avg_cost_dollars |
gauge |
skill_quality_score_avg |
gauge |
skill_deprecation_status |
gauge (0=ok, 1=suspect, 2=broken) |
Per-orchestrator:
| Metric | Type |
|---|---|
routing_decisions_total |
counter (target_agent label) |
routing_fallback_total |
counter |
budget_remaining_dollars |
gauge (project_id) |
stories_in_progress |
gauge |
2. TRACES (OpenTelemetry)¶
Every invocation creates a span. Spans propagate via context capsule.
Trace: user_request_xyz
ββ Span: global_orchestrator.parse_intent (5ms)
ββ Span: global_orchestrator.route (3ms)
ββ Span: domain_orch_frontend.plan (120ms)
β ββ Span: task_orch.execute_story (1200ms)
β ββ Span: specialist_frontend.execute (800ms)
β β ββ Span: skill.shadcn (300ms)
β β ββ Span: skill.tailwind (200ms)
β ββ Span: auditor.validate (50ms)
β ββ Span: artifact_store.write (30ms)
ββ Span: global_orchestrator.aggregate (10ms)
Required span attributes:
- agent.id, agent.layer, agent.version
- skill.id (if applicable), skill.version
- tokens.input, tokens.output, cost.dollars
- idempotency_key
- parent_span_id, trace_id
- outcome (success/failure/partial)
3. LOGS (structured JSON, Postgres + ELK)¶
Every action emits a log event.
Mandatory fields:
{
"timestamp": "2026-05-18T14:32:11.234Z",
"trace_id": "abc123",
"span_id": "def456",
"agent_id": "frontend-specialist",
"agent_layer": 4,
"agent_version": "1.2.3",
"action": "invoked_skill",
"skill_id": "shadcn-ui",
"inputs_hash": "sha256:...",
"outputs_hash": "sha256:...",
"why": "task requires UI component and registry shows shadcn as top-quality option",
"duration_ms": 1240,
"tokens_input": 2400,
"tokens_output": 890,
"cost_dollars": 0.0072,
"outcome": "success",
"quality_score": 0.92,
"confidence": 0.95
}
Log levels:
- DEBUG β verbose, only when env=dev
- INFO β every action (default)
- WARN β anomaly detected, not blocking
- ERROR β failure that triggered retry
- CRITICAL β failure that escalated to user
ποΈ RETENTION¶
| Tier | Hot | Warm | Cold | Forever |
|---|---|---|---|---|
| Metrics | 30d (full res) | 90d (1m res) | 1y (1h res) | aggregate |
| Traces | 7d | 30d (sampled 10%) | 90d (sampled 1%) | β |
| Logs (INFO+) | 7d (full) | 30d (compressed) | 1y (Glacier) | β |
| Logs (ERROR+) | 30d (full) | 1y (compressed) | indefinite | β |
| Audit | 30d (full) | 1y (compressed) | indefinite (Glacier) | yes |
π― SLOs (Service Level Objectives)¶
Per-layer targets:
| Metric | Layer 1 | Layer 2 | Layer 3 | Layer 4 | Layer 5 |
|---|---|---|---|---|---|
| Availability | 99.95% | 99.9% | 99.5% | 99% | 99% |
| Latency p95 | <100ms | <500ms | <2s | <30s | <10s |
| Error rate | <0.1% | <0.5% | <1% | <2% | <2% |
| Quality score avg | n/a | >0.85 | >0.85 | >0.85 | >0.80 |
| Cost per invocation | <$0.001 | <$0.01 | <$0.10 | <$1.00 | <$0.10 |
SLO breach β alert β potential rollback to last known good version.
π¨ ALERTING¶
Alerts emitted to:
- Slack channel #mult-agentes-alerts (immediate)
- PagerDuty (if severity CRITICAL)
- Audit log (always)
Severity tiers:
- INFO β informational, no action needed
- WARN β investigate within 24h
- HIGH β investigate within 1h
- CRITICAL β page immediately
Alert categories: - SLO breach (any pillar) - Budget exhaustion (>80% used) - Quality score drop (>20% week-over-week) - Skill deprecation detected - Constitutional violation - Anomaly (statistical 3Ο deviation) - Security event (see SECURITY-CONSTITUTION.md)
π¬ EXPLAINABILITY¶
Every action MUST be answerable to:
"Why did agent X do Y at time Z?"
via single query against audit log returning:
- agent_id, action, inputs, skill_used, why, outcome, cost
Target: 95% of queries answerable in <50ms (single Postgres index).
π§ͺ OBSERVABILITY OF OBSERVABILITY¶
Meta-monitoring (so we know logging itself works):
- Heartbeat metric observability_health_score per layer
- Lost-event detector (gap analysis on trace_ids)
- Cost-of-observability tracked separately (must stay <5% of total framework cost)