Measuring AI Security: Injection Resistance, Leakage, and Integrity Metrics

FlowRidge

MEASURING AI SECURITY

Preventive

Stop abuse before it happens

Input filteringPrompt firewallsAccess controlSecure SDLC

Detective

Identify adversarial behavior

Injection testsLeakage monitorsRed-team evalsAudit logs

Corrective

Respond and remediate

Incident responseRollbackModel quarantinePatch release

Assurance

Prove integrity

Model signingSupply-chain attestationSBOMVerification rate

Figure 280. Layered AI security controls across prevention, detection, correction, and compensation.

Definition

Security is the COMPEL Trust & Performance dimension that asks whether an AI system resists adversarial abuse, protects the data it handles, and can prove that the model running in production is the model that was approved. This article defines the three canonical Security metrics — prompt- score, data leakage rate, and model integrity verification rate — and explains how to measure each one using evaluation harnesses, adversarial testing, and supply-chain attestation. The methodology is anchored to the OWASP Top 10 for LLM Applications (2025), MITRE ATLAS, NIST SP 800-218A (SSDF for AI), NIST AI 600-1, and ISO/IEC 27001 Annex A controls.

Why this dimension matters

The architecture-to-metric gap. Article M3.3-Art05 established the security architecture for enterprise AI — identity, isolation, data protection, and supply-chain controls. Architecture is a set of intentions; metrics are the evidence that those intentions hold under load. Without measurement, a security architecture is unfalsifiable and therefore untrustworthy.

The attack surface is bigger than the model. Prompt injection enters through user input, retrieved documents, tool outputs, and image modalities. Data leakage happens through training data memorization, system-prompt echo, and overly verbose tool traces. Model integrity failures happen when an attacker swaps a signed artifact for an unsigned one in the registry. A security metric program must cover all three surfaces or it covers none.

The regulator’s question. Auditors do not ask “is your AI secure.” They ask “what is your prompt-injection resistance score on the current release, and how has it trended.” A team that cannot answer that question is not operating a security program.

What good looks like

Every production model has a security evaluation harness covering injection, leakage, and integrity, tied to release gates.
Attack suites evolve — at least 25% of the injection suite is rotated per quarter.
Data-leakage probes run in both directions — probing what the model remembers from training and what the system accidentally reveals at inference time.
Every deployed model artifact is signed and the signature is verified at load time on every host.
Red-team findings feed the harness and the harness feeds the incident response runbook.

Core metrics

Metric 1: Prompt-injection resistance score

Definition. The percentage of adversarial prompt-injection payloads in a versioned test suite that the system successfully rejects or neutralizes without executing the injected instruction.

Formula. injection_resistance = (neutralized_attacks / total_attacks) × 100.

Cadence. On every release candidate; weekly in production against the live stack (model + guardrails + retrieval + tools).

Owner. AI security lead.

Suite composition. Must cover the OWASP LLM Top 10 (LLM01 Prompt Injection — direct, indirect, multimodal), MITRE ATLAS techniques, and a rotating unseen set. Test the composed system, not the raw model — a model that fails the raw suite but passes when wrapped in guardrails is an acceptable defense-in-depth outcome, provided the guardrails are themselves tested.

Scoring nuance. A “success” for the attacker is not just jailbreaking safety content — it is any unauthorized action: exfiltrating data, invoking an unauthorized tool, writing to an unauthorized path, or altering the conversation state in a way the legitimate user did not request.

Metric 2: Data leakage rate

Definition. The percentage of probe queries that cause the system to reveal data it was not authorized to reveal — including training-data memorization, retrieval of out-of-scope documents, echo of system prompts or secrets, and cross-tenant or cross-session bleed.

Formula. leakage_rate = (probes_that_leaked / total_probes) × 100 — lower is better; target approaches zero.

Cadence. Release candidate plus monthly production probe.

Owner. Data protection officer jointly with AI security lead.

Probe categories. (1) Training data extraction attacks (canaries planted in training data; the canary recovery rate is the memorization metric). (2) System-prompt exfiltration attempts. (3) Cross-tenant retrieval probes (a query from tenant A designed to retrieve tenant B’s documents). (4) Secret echo tests (API keys and PII inserted into context and the response scanned for echo). (5) Tool-output leakage (sensitive tool responses appearing in places the user should not see).

Metric 3: Model integrity verification rate

Definition. The percentage of model load events on production hosts where the signed artifact hash matches the approved release hash, verified against a trusted signing authority.

Formula. integrity_rate = (verified_loads / total_loads) × 100.

Cadence. Continuous — every load event.

Owner. Platform engineering, reported to the AI security lead.

Target. 100%. Any deviation is an incident.

Scope. Covers base model weights, fine-tune adapters, guardrail models, embedding models, and tool-handler code. All must be signed; all must be verified on load. Unsigned artifacts must be blocked by the platform, not just logged.

How to measure — step by step

Map the attack surface. For each production system, list every input channel (user, retrieved document, tool, image, audio), every output channel, and every privileged action the model can trigger. The injection suite must cover all input channels.
Build the harness. Adopt published suites (OWASP, ATLAS, Garak) plus internal red-team finds. Version everything.
Instrument the pipeline. Every inference must emit enough telemetry to evaluate leakage probes post-hoc — the context that was sent to the model, the context the model retrieved, and the tools it invoked.
Wire the signing authority. Use Sigstore, cosign, or an equivalent to sign every model artifact at build time. Configure the model loader to fail closed if the signature cannot be verified.
Run the gate. Release candidate blocked if injection resistance drops more than 2pp from prior release, leakage rate exceeds target, or any integrity check fails.
Rotate attacks. Every quarter, retire the oldest 25% of injection attacks and replace them with new ones from red-team and open-source feeds.
Report to the board. Three numbers on the trust scorecard, plus the count of open red-team findings.

Targets and thresholds

Injection resistance. 95% on the published suite, 88% on the rotating unseen set. Regression greater than 2pp from prior release is a blocker.
Leakage rate. Training-data memorization less than 0.1% on canary probes. System-prompt exfiltration 0%. Cross-tenant leakage 0% — any occurrence is a Sev-1 incident.
Integrity verification. 100%. Anything less is an incident.
Red-team cycle. Quarterly minimum for production systems; monthly for systems handling regulated data.

Common pitfalls

Testing the raw model instead of the system. The adversary attacks what is deployed, not what came from the model provider. Test the composed stack.

Static attack suites. An attack suite that never changes trains the model to pass it. Rotate.

Logging secrets into your own probe data. Leakage probes need synthetic secrets, not real ones. A real secret in a probe dataset becomes a new leak source.

Signing at build time, not verifying at load time. Signatures that are never checked provide no security. Verification must fail closed and must be instrumented.

Confusing security and safety. Safety is “does the model refuse to produce harm.” Security is “can an adversary force the model to act against its owner’s intent, exfiltrate data, or tamper with the artifact.” Both dimensions are required and neither substitutes for the other.

M3.3AI Security Architecture M3.3Measuring AI Safety M1.3Technology Pillar Domains — Integration and Security M3.4AI Risk Governance at Enterprise Scale M4.3NIST AI RMF Implementation at Enterprise Scale