Measuring AI Reliability: SLOs, Drift, and Incident MTTR

FlowRidge

MEASURING AI RELIABILITY

SLO attainment

99.9 %

Availability + latency + quality

Drift rate

<5 %

Features drifted per 7-day window

Incident MTTR

2 h

Sev-1 mean time to recover

Figure 276. The three canonical Reliability metrics: SLO attainment, drift rate, and incident MTTR.

Definition

Reliability is the COMPEL Trust & Performance dimension that asks whether an AI system does what it promised, when it promised it, with predictable quality — and whether failures are detected fast, diagnosed honestly, and resolved inside a named time budget. This article defines the three canonical Reliability metrics — service-level objective (SLO) attainment, drift rate, and incident mean-time-to-recover () — and explains how to build each one into release gates, runtime monitoring, and incident response. The methodology draws on Google SRE practice, ISO/IEC 25010 (software quality — reliability sub-characteristics), and the NIST AI RMF “Reliable and Robust” characteristic.

Why this dimension matters

The trust contract. Every AI service makes an implicit promise to its users: it will be there, it will be fast enough, and the answers will be about as good as they were yesterday. Reliability is the dimension where that promise is either kept or quietly broken. Users forgive occasional failures; they do not forgive silent degradation.

The drift problem is unique to AI. Traditional software fails loudly. AI software degrades silently. A fraud model with a drifting feature distribution produces the same number of decisions at the same latency — and quietly stops catching new fraud patterns. Without drift metrics, the outage is invisible until the business notices the bleed.

Incident response is a cost center unless measured. MTTR converts incident response from “we did our best” into a number that leadership can compare period over period and attribute to investment.

What good looks like

Every production AI service has published SLOs for availability, latency, and quality, and an error budget policy tied to them.
Drift detectors run on every production feature and on model outputs, with owners and alerts.
Incidents have a structured post-mortem process and MTTR is measured and trended.
Reliability metrics sit next to the infrastructure SRE dashboards and are reviewed on the same cadence.

Core metrics

Metric 1: SLO attainment

Definition. The percentage of the measurement window during which the service met its stated service-level objectives across three pillars: availability, latency, and quality.

Formula. slo_attainment = (minutes_in_slo / total_minutes) × 100 per SLO, plus a composite “all-SLOs-met” rollup.

Cadence. Continuous; reported weekly and monthly.

Owner. Service owner with SRE.

Three pillars.

Availability SLO. Successful responses divided by total requests. Typical target 99.9% for customer-facing, 99.5% for internal.
Latency SLO. Percentile latency under a stated threshold (e.g., “p95 response under 2 seconds”). For generative systems, measure both time-to-first-token and total completion time.
Quality SLO. For deterministic systems, accuracy on a golden set. For generative systems, a rubric-based quality score on a sampled subset of production traffic rated by an LLM-as-judge scorer calibrated to humans.

Metric 2: Drift rate

Definition. The magnitude and frequency of statistically significant distribution shifts in model inputs, model outputs, or the relationship between them, measured against a stable baseline.

Formula. For feature drift, Population Stability Index (PSI) per feature, or Kolmogorov–Smirnov / Jensen–Shannon divergence, compared to a baseline window. A feature is “drifted” if its PSI exceeds 0.2 or its JS divergence exceeds a configured threshold. drift_rate = (drifted_features / total_monitored_features) × 100.

Cadence. Daily on batch systems, continuous on streaming systems.

Owner. Model owner.

Three drift classes. (1) Data drift — the input distribution changes. (2) Concept drift — the relationship between inputs and the true label changes. (3) Label drift — the distribution of ground-truth labels changes. All three must be monitored; each has a different remediation pattern.

Metric 3: Incident MTTR

Definition. The mean elapsed time from incident detection to incident resolution for AI-system incidents of stated severity.

Formula. mttr = sum(resolution_time_i) / count(incidents), reported by severity tier.

Cadence. Per incident; aggregated monthly.

Owner. Incident commander function, with service owner accountability.

Companion metrics. MTTR alone is insufficient. Pair it with mean-time-to-detect (MTTD — drift and silent-failure problems show up in long MTTD), change-failure rate (percentage of releases that produced an incident), and the ratio of user-reported to system-reported incidents (if users report more than the monitoring catches, your detection is broken).

How to measure — step by step

Write the SLOs. For each service, document the three SLOs, the measurement window, the error budget, and the policy that triggers when the budget is burned. An SLO without an error-budget policy is a suggestion.
Instrument the pipeline. Emit availability, latency, and quality signals at the service boundary, not at the model boundary — users experience the composed system.
Baseline the drift detectors. Pick a stable window post-release, compute baselines, and store them. Recompute baselines only on a documented schedule or on approved model updates — never silently.
Register the incident classification. Sev-1 through Sev-4 with time budgets, escalation paths, and a mandatory post-mortem for Sev-1 and Sev-2.
Release gate. A release candidate that regresses p95 latency by more than 10% or quality-score by more than 2% blocks at the gate.
Runtime detection. Drift alerts, SLO burn-rate alerts, and anomaly alerts all land in the same on-call queue so correlated failures are visible.
Post-mortem loop. Every Sev-1/Sev-2 produces a written post-mortem with a root cause, a remediation, and a date. Reliability metrics are only credible if post-mortems actually change the system.

Targets and thresholds

Availability SLO. 99.9% customer-facing; 99.5% internal. Error budget 0.1% / 0.5% per 30 days.
Latency SLO. Use case dependent; publish both target and alert thresholds.
Quality SLO. Customer-facing generative systems 90% of outputs rated acceptable or better; deterministic classifiers within 1% of release-candidate accuracy.
Drift rate. Fewer than 5% of monitored features drifted in any 7-day window; any concept drift on a label-critical feature triggers re-evaluation.
MTTR. Sev-1 under 2 hours, Sev-2 under 8 hours, Sev-3 under 3 business days. MTTD under 15 minutes for Sev-1.

Common pitfalls

Silent degradation that looks healthy. A system can hit every infrastructure SLO while the quality of its outputs collapses. Quality must be an SLO, not an afterthought.

Drift alerts with no owner. A drift alert that fires to nobody is worse than no alert — it trains the team to ignore the channel.

Aspirational SLOs. An SLO the team cannot meet is not an SLO. Start with what the system actually delivers, publish it, and improve it.

Ignoring MTTD. A team that closes incidents fast but detects them slowly is producing customer harm masked by a healthy MTTR number.

Post-mortems without follow-through. If remediation actions from post-mortems are not tracked to closure, reliability metrics will be noise rather than signal.

M2.5Technology and Process Performance Metrics M2.5Designing Measurement Frameworks for Agentic AI Systems M3.3Scalability and Performance Architecture M3.3Measuring AI Safety M3.6The Measurement and Value Realization Framework