Skip to content

MODEL-MANAGEMENT — LLM Versioning + Upgrade Policy

Status: live · Version: 1.0.0 · Camada: 9

Purpose

How we version, pin, evaluate, and upgrade LLMs across the agent fleet. LLM choice is consequential — wrong upgrade can regress quality silently.

Model registry (current)

Agent layer Model (current) Notes
L1 Cortex haiku-4-5 cheap classifier
L2 Domain orchs sonnet-4-6 reasoning over decomposition
L3 Task orchs sonnet-4-6 per-story planning
L4 Specialists sonnet-4-6 default per-agent override allowed
L5 Workers haiku-4-5 atomic tasks
planner-opus opus-4-7 sparing use
auditor-haiku haiku-4-5 universal reviewer

Upgrade triggers

Trigger Action
Provider releases new minor (4.6 → 4.7) A/B eval on representative tasks; promote if delta-acc ≥ +1% AND delta-cost ≤ +30%
Provider releases major (4.x → 5.x) RFC + ADR; full eval suite
Pricing change re-evaluate cost-per-task; may downgrade fleet
Quality regression detected rollback to previous pin

Pinning

Each agent's runtime.model field in AGENT-MANIFEST is the source of truth. Changing it requires: - ADR for major version - DECISIONS log for minor - Cost projection update in BUDGET.md - 24h observation before promotion to all agents in layer

Evaluation suite

Per-layer eval task set (when implemented in Wave 4): - L1: intent classification accuracy on 500 samples - L2: routing decision quality - L4: skill outputs reviewed by planner-opus - Workers: deterministic task pass-rate

Rollback

1. Quality regression flagged (success_rate < threshold, or postmortem identifies cause)
2. Manifest `runtime.model` reverts to last_known_good
3. CHANGELOG entry; ADR if it was an active choice
4. Investigation: was it a model issue or our prompt?

Multi-provider fallback (planned)

Primary: Anthropic. Fallback: OpenAI / Vertex (when configured).

Switch triggers: - Anthropic outage > 5min - Anthropic rate-limit response - Cost spike beyond threshold (auto-throttle to cheaper)

Fallback is per-capsule decision via cortex; logged in audit.

Cross-references

  • AGENT-MANIFEST (C4 template)
  • BUDGET (C1 template) — cost projections
  • COST-THERMOSTAT (C15 planned) — automated cost-based fallback
  • MODEL-THROTTLE-PROTOCOL (C15 planned)
  • TECHNOLOGY-RADAR (C16 planned) — scans new model releases