Measuring AI Safety: Content, Jailbreak, and Grounding Metrics

FlowRidge

MEASURING AI SAFETY

Figure 277. Safety coverage across content, jailbreak, and grounding failure modes.

Definition

Safety is the COMPEL Trust & Performance dimension that asks whether an AI system can be trusted to avoid producing harmful, deceptive, or boundary-violating output under both normal use and adversarial pressure. This article defines the three canonical Safety metrics — content-safety pass rate, score, and grounding rate — and explains how to measure each one under a structured evaluation regime that feeds release gates, production monitoring, and quarterly red-team cycles. The methodology is anchored to the NIST AI RMF “Safe” characteristic, ISO/IEC 23894 risk management, and industry-standard eval suites such as HELM, TruthfulQA, and MLCommons AILuminate.

Why this dimension matters

The containment gap. Article M1.5-Art12 established the containment boundary for autonomous AI: the technical and procedural controls that keep an AI system inside its authorized behavior envelope. Safety measurement is how you know whether the containment boundary holds. Without metrics, containment is a diagram on a whiteboard; with metrics, it is an operational property with a trend line.

The release decision. The question “is this model safe enough to ship” is not a yes/no. It is a set of thresholds on measurable properties. A team that ships based on “we ran some red-team prompts and it felt okay” is not making a safety decision — it is making a political one. Safety metrics convert the release decision into an audit-defensible judgment.

The adversary does not sleep. Prompt-injection techniques and jailbreak vectors evolve weekly. A safety program that measures once at launch and never again is measuring the past, not the present. Continuous measurement is what converts safety from a point-in-time assurance into a durable property.

What good looks like

Every production model has an approved evaluation harness that runs on every release candidate and on a rotating schedule in production.
The harness includes three independent suites: a content-safety suite, a jailbreak suite, and a grounding/hallucination suite.
Thresholds are published and gated — a release candidate that fails a safety gate cannot ship without an approved, time-boxed exception.
Red-team exercises run on a quarterly cadence against the same models, and their findings feed back into the harness.
Safety metrics roll up to the board alongside value, reliability, and compliance.

Core metrics

Metric 1: Content-safety pass rate

Definition. The percentage of model outputs on a standardized harmful-content evaluation suite that are rated safe by the scoring policy within a measurement window.

Formula. content_safety_pass_rate = (safe_outputs / total_evaluated_outputs) × 100.

Cadence. Measured on every release candidate and weekly in production on a rotating sample.

Owner. Model owner with review by the AI safety lead.

Scoring. Prefer an automated scorer (a safety classifier) calibrated against a human-rated gold set. Report both the automated pass rate and the human agreement rate, because the automated scorer drifts.

Suite composition. The harmful-content suite must cover the taxonomies defined in NIST AI 600-1 (generative AI profile) and MLCommons AILuminate: violent crime, sexual content involving minors, hate, self-harm, privacy violation, specialized advice (medical, legal, financial), intellectual property, and defamation. Each category should have at least 100 prompts.

Metric 2: Jailbreak resistance score

Definition. The percentage of adversarial prompts in a fixed jailbreak evaluation suite that the model successfully refuses or neutralizes.

Formula. jailbreak_resistance = (refused_attacks / total_attacks) × 100.

Cadence. On every release candidate; weekly on production.

Suite composition. Mix of published academic attacks (PAIR, GCG, DAN variants), internal red-team finds, and a rotating “unseen” set of 50 new attacks per quarter that the model has never seen. Never let the attack set become static — a model tuned to pass a fixed suite will fail on the next attack class.

Owner. AI safety lead.

Metric 3: Grounding rate

Definition. The percentage of generative outputs whose factual claims can be traced to a verifiable source in the provided context, for models operating in a retrieval-augmented or tool-grounded mode.

Formula. grounding_rate = (claims_with_valid_citation / total_claims_evaluated) × 100.

Cadence. On every release candidate; continuous sampling in production.

Owner. Model owner.

Why it matters. For RAG systems, grounding rate is a safety metric, not a quality metric. An ungrounded answer that sounds authoritative is a hallucination that a user will act on — the safety harm is the downstream decision, not the incorrect sentence.

How to measure — step by step

Define the behavior envelope. Write down, per use case, what the model is permitted to do, forbidden to do, and required to do. Safety metrics only have meaning against an explicit envelope.
Assemble the harness. Build or adopt the three suites. Pin each suite to a version. Store the suites in the evidence repository so auditors can rerun them.
Calibrate scorers. For every automated scorer, maintain a gold set rated by humans. Report the scorer’s agreement rate with humans alongside the metric — if agreement drops below 90%, pause the gate until recalibration.
Run the gate. Every release candidate runs the full harness. Results are stored, signed, and linked to the release artifact.
Sample production. Production traffic is sampled daily, with rare or sensitive categories oversampled. A safety failure on live traffic triggers an incident ticket with the same severity as a reliability incident.
Red-team quarterly. A dedicated team runs unstructured adversarial exercises. Findings become new suite items and the suite is re-versioned.
Report to the board. The three metrics, their trend lines, and the open red-team findings appear on the trust scorecard.

Targets and thresholds

Content-safety pass rate. 99% for consumer-facing and high-risk enterprise models, 97% for internal tooling. Alert threshold 0.5pp below target.
Jailbreak resistance. 95% on the published academic suite, 90% on the rotating unseen set. Any regression greater than 2pp from the prior release is a blocker.
Grounding rate. 95% for RAG systems in regulated domains (healthcare, legal, financial), 90% for general enterprise.
Scorer–human agreement. 90% minimum before a scorer may gate releases.

Common pitfalls

Teaching to the test. A model fine-tuned on the exact evaluation suite will pass everything and fail everything new. Rotate the suite, hold out a secret subset, and refresh attacks quarterly.

Treating a single number as the story. A 98% content-safety pass rate means 2% of outputs produced harm. On 10 million queries a day, that is 200,000 harms. Always pair the percentage with an absolute volume estimate.

No owner for the harness. If nobody owns the evaluation harness, it rots. The AI safety lead must own its version, its execution, and its drift.

Ignoring scorer drift. Safety classifiers are themselves models and they drift. Measure scorer agreement monthly and recalibrate when it slips.

Conflating safety with alignment. Safety is measurable behavior on a defined envelope. Alignment is a research property. Do not use one to excuse the absence of the other.

M1.5Safety Boundaries and Containment for Autonomous AI M3.3AI Security Architecture M3.3Measuring AI Security M2.5Designing Measurement Frameworks for Agentic AI Systems M3.4Measuring AI Responsibility