AI Spend Is Making Cloud Waste Trend Up Again
By Shayan Ghasemnezhad on February 11, 2026 · 3 min read
GPU instances and inference endpoints have reopened the cloud cost problem that FinOps was starting to solve. Governance needs to catch up.
Cloud cost management was making progress. Teams were tagging resources, right-sizing instances, and buying Savings Plans. Then AI workloads arrived—and the cost curve bent upward again. GPU instances cost 10–40x their CPU equivalents. Inference endpoints run 24/7 whether or not anyone is asking questions. Training jobs can burn through five figures in a weekend. The FinOps playbook that worked for web applications needs new chapters.
Why AI Workloads Break Traditional FinOps
Traditional cloud cost management assumes relatively predictable, steady-state workloads. You provision a fleet of instances, they run services, and cost scales roughly with traffic. AI workloads violate every part of this assumption. Training jobs are bursty and unpredictable—a hyperparameter sweep might spin up 50 GPU instances for six hours, then nothing for two weeks. Inference demand is hard to forecast because product teams are still discovering what users do with AI features.
The unit economics are also different. A single p4d.24xlarge instance costs roughly €28 per hour. A team running fine-tuning experiments without cost guardrails can spend more in a day than their entire monthly EC2 budget for non-AI workloads. And unlike CPU instances, GPU instances have limited Savings Plan coverage and sparse spot availability.
Inference Economics: The Hidden Steady-State Cost
Training gets the headlines, but inference is where the ongoing cost lives. A self-hosted model endpoint on a g5.2xlarge costs approximately €1.20/hour —roughly €870/month if it runs continuously. If your AI feature handles 50 requests per hour, that is €0.024 per request. At 5 requests per hour, it is €0.24—an order of magnitude difference in unit cost for the same infrastructure.
The decision between self-hosted inference and managed API (OpenAI, Anthropic, Bedrock) is fundamentally a utilisation question. Managed APIs charge per token with no idle cost. Self-hosted endpoints have high fixed cost and low marginal cost. The crossover point depends on volume, latency requirements, and data residency constraints.
Governance Framework for AI Spend
Build governance around three controls: budgets, approval gates, and automated shutdown.
- Budgets: Set per-team monthly GPU budgets. Track burn rate daily, not monthly. A team that has spent 80% of its budget by day 15 needs a conversation, not a surprise at month-end.
- Approval gates: Require explicit sign-off for training jobs above a cost threshold (e.g. €500). This is not about slowing people down—it is about making cost a conscious input to experiment design.
- Automated shutdown: Training jobs should have hard time limits. Idle inference endpoints should scale to zero. Use SageMaker Serverless Inference or Lambda-based inference for low-traffic features.
# SageMaker training job with max runtime guard
import boto3
sm = boto3.client('sagemaker')
sm.create_training_job(
TrainingJobName='fine-tune-guard-example',
StoppingCondition={
'MaxRuntimeInSeconds': 14400, # 4 hour hard limit
'MaxWaitTimeInSeconds': 3600, # 1 hour spot wait
},
# ... other config ...
)
Decision Framework
For each AI workload, answer four questions. What is the expected utilisation? (If below 30%, use a managed API instead of self-hosting.) What is the data residency requirement? (If data cannot leave your VPC, self-hosting or Bedrock is mandatory.) What is the latency budget? (If sub-100ms, you need a dedicated endpoint; if 2–5 seconds is acceptable, serverless inference works.) What is the experimentation cadence? (If the team is running daily experiments, invest in a shared training cluster with scheduling; if monthly, on-demand is fine.)
Failure Modes
The worst failure is invisible spend. A data scientist spins up a notebook instance on a Friday afternoon, runs a training job, and forgets to shut down the instance. It runs for three weeks. This happens in every organisation that does not enforce auto-stop policies on development instances.
Model sprawl is the AI equivalent of server sprawl. Teams deploy multiple model endpoints for different features without a shared registry. Each endpoint has its own GPU allocation. Consolidate where possible—a single endpoint serving multiple use cases with routing logic is cheaper than three endpoints at 15% utilisation each.
AI cost governance is not optional—it is the difference between AI features that improve margin and AI features that erode it. Build the visibility and controls before the spend forces you to.