Skip to main content
AITP M2.5-Art14 v1.0 Reviewed 2026-04-06 Open Access
M2.5 Measurement, Evaluation, and Value Realization

Measuring AI Sustainability: Energy, Carbon, and Cost per Inference

Measuring AI Sustainability: Energy, Carbon, and Cost per Inference — Value Realization & ROI — Applied depth — COMPEL Body of Knowledge.

7 min read Article 14 of 16 Evaluate
MEASURING AI SUSTAINABILITY
Training Energy baseline kWh per training run captured Fine-tuning Incremental footprint Marginal kWh + carbon Deployment Inference telemetry kWh per 1,000 inferences Operate Carbon attribution gCO2e per 1,000 inferences Retire Cost + SCI rollup FinOps + SCI reporting
Figure 281. Energy, carbon, and cost measurement points across the AI model lifecycle.

Why this dimension matters

The invisible cost of autonomy. A single agentic workflow can consume hundreds or thousands of tokens across multiple tool calls before producing an answer. Multiplied across an enterprise user base, this produces a compute footprint that is invisible to procurement and untraceable to any single budget line. Sustainability measurement is how you make that footprint visible.

Carbon disclosure is becoming mandatory. The EU Corporate Sustainability Reporting Directive (CSRD), SEC climate disclosure rules, and California SB 253/SB 261 require Scope 2 and increasingly Scope 3 emissions reporting. Cloud AI usage falls under Scope 3 for most buyers. A company that cannot report the carbon footprint of its AI stack will have an auditor problem within two reporting cycles.

Finance wants the number. Beyond environmental reporting, finance wants cost per inference tied to business value. The question “does this workflow pay for itself” cannot be answered without a unit-cost metric. Sustainability and FinOps converge on the same measurement.

What good looks like

  • Every production AI workload has a named cost owner and a monthly cost-per-inference trend.
  • Energy and carbon are estimated using a documented method and the estimate is reconciled with provider reports.
  • Optimization decisions (model choice, caching, batching) are made on evidence, not vibes.
  • Sustainability metrics appear on the same FinOps dashboard as cloud spend and are reviewed in monthly cost reviews.

Core metrics

Metric 1: Energy per 1,000 inferences (kWh/1k)

Definition. The electrical energy consumed per 1,000 model inference calls, attributed to the compute infrastructure used.

Formula. energy_per_1k = (total_kwh_in_window / total_inferences_in_window) × 1000.

Cadence. Monthly; reported by model and by workload.

Owner. Platform engineering with FinOps.

Data sources. (1) For managed providers (OpenAI, Anthropic, Google, Azure OpenAI), use the provider’s published energy per token where available and multiply by token counts from your observability pipeline. (2) For self-hosted inference, measure GPU power draw via NVIDIA SMI / DCGM and attribute to requests via the scheduler. (3) For hybrid, combine both. Document the method, the assumptions, and the uncertainty.

Honesty note. Measured energy per inference carries real uncertainty — often ±30% on published vendor figures. Report the figure with its uncertainty band; a precise number implies a precision that does not exist.

Metric 2: Carbon attribution per 1,000 inferences (gCO₂e/1k)

Definition. The grams of CO₂-equivalent emissions attributed to 1,000 inference calls, using the GHG Protocol Scope 2 market-based methodology where contractual instruments exist, and the location-based methodology where they do not.

Formula. carbon_per_1k = energy_per_1k × grid_emissions_factor_gco2e_per_kwh.

Cadence. Monthly; reconciled quarterly with provider sustainability reports.

Owner. Sustainability lead with platform engineering.

Grid factor source. Use the region-specific hourly factor from an authoritative source (Electricity Maps, WattTime, or provider-published region factors). Record which one you used and its vintage. Global averages are acceptable only when region data is unavailable, and must be flagged.

SCI conformance. For external reporting, follow the Green Software Foundation’s Software Carbon Intensity specification, which defines how to bound the system, select the functional unit, and exclude energy that would have been consumed regardless.

Metric 3: Cost per inference

Definition. The all-in fully loaded dollar cost of a single inference call, including model fees, retrieval fees, guardrail fees, orchestration overhead, and amortized platform cost.

Formula. cost_per_inference = (total_cost_in_window / total_inferences_in_window).

Cadence. Weekly; reported by workload.

Owner. FinOps lead with workload owner.

What to include. Model API fees, embedding generation, vector database reads/writes, guardrail model calls, tool-handler compute, logging and observability, and a share of platform/engineering overhead. A cost number that excludes these components is not a cost number — it is an invoice.

How to measure — step by step

  1. Agree the scope. What counts as “an inference”? For chat, one user turn is typical. For agents, one completed user request (which may involve many model calls) is typical. Document the choice — every downstream number depends on it.
  2. Stand up a token-counting pipeline. Every model call emits its input tokens, output tokens, model ID, region, and latency into an observability store. Without this, nothing else is possible.
  3. Reconcile with vendor bills. Monthly, sum the pipeline’s attributed cost and reconcile against the cloud and model-provider invoices. Unreconciled variance greater than 5% means the telemetry is incomplete.
  4. Attribute energy. For hosted APIs, multiply tokens by the provider’s published per-token energy. For self-hosted, capture GPU power via DCGM and attribute proportional to utilization.
  5. Apply the grid factor. Use the region and hour-of-day factor. Document the source and the vintage.
  6. Publish the unit metrics. Energy, carbon, and cost per 1,000 inferences, per workload, per month, with trend arrows.
  7. Tie to outcome. The business metric is not cost per inference in isolation — it is cost per successful business outcome. Divide cost by completed workflow to get the number that matters.

Targets and thresholds

  • Trend direction. Cost per inference and energy per inference should trend down release over release. A stable number means the optimization program is not working; a rising number requires investigation.
  • Variance from budget. Monthly cost per workload within 10% of the approved FinOps budget.
  • Carbon intensity reduction. Year-over-year reduction tied to the enterprise decarbonization target and publicly reported where required.
  • Reconciliation variance. Under 5% between telemetry-attributed cost and provider-billed cost.

Common pitfalls

Using a vendor’s average efficiency figure as if it applied to your workload. Energy per token depends on model, batch size, region, and time of day. A global average is a starting point, not an answer.

Counting compute but not storage and retrieval. Vector databases, embedding storage, and logging pipelines can rival inference cost in RAG-heavy workloads. Include them.

Celebrating cost reduction that pushed the workload to a worse grid. Moving inference from a low-carbon region to a cheaper high-carbon region reduces cost and increases emissions. Optimize on the composite metric, not just the dollar.

Reporting precision that does not exist. 483.27 gCO₂e per 1,000 inferences, reported to two decimal places, is false precision. Report 480 gCO₂e ±120 and show the assumptions.

Treating sustainability as a compliance-only exercise. Sustainability and FinOps run on the same telemetry. Integrate them and the numbers earn their place in the weekly review.

M2.5Agentic AI Cost Modeling — Token Economics, Compute Budgets, and ROI M2.5Technology and Process Performance Metrics M3.3AI Infrastructure Economics and FinOps M4.4Institutionalizing the AI Operating Model — Sustainability and Self-Renewal M3.6Measuring AI Reliability