Kubernetes Cost Control: Requests, Limits, and the Traps That Inflate Bills

By Shayan Ghasemnezhad on March 25, 2025 · 3 min read

autoscalingcost-controlfinopskubernetes

Misconfigured resource requests are the top driver of Kubernetes overspend. How to right-size, autoscale, and allocate costs per namespace.

Kubernetes makes it easy to deploy workloads and remarkably easy to overpay for them. The abstraction that simplifies operations also hides the connection between what you request, what you use, and what you pay for. Most Kubernetes clusters run at 20–40% actual utilisation—the rest is allocated but idle capacity that appears on your cloud bill every month.

Requests vs. Limits: The Misunderstanding That Costs Money

Resource requests are what the scheduler uses to place pods on nodes. If you request 2 CPU and 4Gi memory, the scheduler reserves that capacity on a node—whether your pod uses it or not. Resource limits are the ceiling: the maximum your pod can burst to before it gets throttled (CPU) or killed (memory).

The common mistake is setting requests equal to limits. This eliminates burstability and forces the cluster to reserve peak capacity for every pod at all times. If your service uses 200m CPU at steady state and spikes to 1 CPU during deployments, a request of 1 CPU wastes 800m continuously. Multiply that across 50 services and you are paying for a cluster that is four times larger than it needs to be.

Right-Sizing Methodology

Use actual metrics, not guesswork. Pull P95 and P99 CPU and memory usage from Prometheus over a 14-day window that includes peak traffic. Set requests at P95 steady-state usage plus a 20% buffer. Set limits at P99 peak usage plus a 30% buffer. This gives pods room to burst without reserving capacity that will never be used.

# Right-sized resource spec based on observed metrics
resources:
  requests:
    cpu: 250m      # P95 steady-state: 200m + 20% buffer
    memory: 384Mi  # P95 steady-state: 320Mi + 20%
  limits:
    cpu: '1'       # P99 peak: 750m + 30% buffer
    memory: 512Mi  # P99 peak: 400Mi + 30% (OOMKill guard)

Autoscaler Configuration Traps

The Horizontal Pod Autoscaler (HPA) scales pods based on observed metrics. The Cluster Autoscaler adds or removes nodes to fit the pods. These two systems interact, and misconfiguring them creates cost traps.

Trap one: HPA target utilisation set too low. If you target 30% CPU utilisation, the HPA scales out aggressively, adding pods that are mostly idle. A target of 60–70% is more cost-efficient for most web services. Trap two: Cluster Autoscaler with over-provisioned node groups. If your node group uses m5.2xlarge (8 vCPU) but most pods request 250m CPU, you waste capacity on every node. Mix node sizes or use Karpenter for just-in-time provisioning that matches pod requirements to instance types.

Trap three: scale-down cooldowns that are too conservative. The default Cluster Autoscaler scale-down delay is 10 minutes. For workloads with daily traffic patterns, nodes spun up for morning peak may not scale down until well after traffic drops. Tune the --scale-down-unneeded-time flag to match your traffic profile.

Namespace Cost Allocation

Cost allocation in Kubernetes requires mapping resource consumption to teams or services. Use namespace-level resource quotas to set upper bounds, and deploy Kubecost or OpenCost for per-namespace cost reports. Without this, cost conversations in platform teams devolve into “who is using all the memory?” with no data to answer.

Decision Framework

Right-sizing is high-impact and low-risk—start there. Then address autoscaler tuning. Then move to cost allocation. The sequence matters because right-sizing reduces the noise in your cost data, making autoscaler tuning and allocation conversations more productive. Do not buy a cost management platform before you have set requests correctly on your top 10 workloads.

Failure Modes

The most expensive failure: no resource requests at all. Pods without requests are BestEffort QoS class and will be evicted first under pressure, causing cascading restarts. Teams then over-request to compensate, and the cycle continues.

Memory limits set too tight cause OOMKills that look like application bugs. Engineers add memory to fix the “bug,” which increases cost without addressing the root cause. Always check OOMKill events before approving memory increases—the application may have a leak, not a capacity problem.

Kubernetes cost management is not a one-time exercise. Traffic patterns change, services get rewritten, and new workloads appear. Build a quarterly review cycle: pull request-vs-usage ratios, identify the top 10 over-provisioned workloads, and right-size them. The cluster will thank you with a smaller bill.