Mastering AWS Costs: A CTO's Guide to FinOps
By Shayan Ghasemnezhad on July 11, 2025 · 4 min read
Cloud spend scales with product success—until it scales faster. A practical framework for cost visibility, accountability, and control.
Every growing product eventually hits the same inflection point: AWS spend starts climbing faster than revenue. The bill is not the problem—the gap between what you spend and what you understand about that spend is. FinOps gives engineering and finance a shared operating model for cost decisions without throttling delivery.
Why Cost Discipline Breaks Down
Cost management fails when it lives in a spreadsheet that finance maintains and engineering never sees. The people who make architecture decisions—engineers—rarely see cost data in context. The people who see the invoice—finance—cannot evaluate whether a given line item is waste or investment. FinOps bridges this by embedding cost awareness into engineering workflows, not by adding approval gates that slow down shipping.
The tension is real. Optimise too aggressively and you throttle delivery velocity—teams cannot ship features when every resource request triggers a procurement cycle. Ignore costs and you erode the margin that funds next quarter’s roadmap. The right approach treats cloud spend as an engineering metric: visible, measured, and discussed in the same forums where architecture choices happen.
The FinOps Operating Model
FinOps is not a tool or a dashboard. It is an operating model built on three iterative phases: Inform, Optimise, Operate. Most teams try to skip to Optimise—buying Savings Plans or right-sizing instances—without building the visibility that tells them whether those changes are working. That sequence is backwards.
Inform: Visibility comes before action. Tag every resource with team, service, and environment. Use AWS Cost Explorer and Cost and Usage Reports (CUR) piped to Athena for SQL access to granular billing data. Build dashboards that engineering leads actually check—if nobody looks at the data, the tooling is theatre.
Optimise: Start with the highest-impact, lowest-risk changes. Right-size instances using CloudWatch metrics—look for sustained CPU below 20% or memory below 30%. Adopt Savings Plans over Reserved Instances for flexibility. Clean up unattached EBS volumes, idle load balancers, and forgotten NAT gateways. A single unused NAT gateway in eu-west-1 costs roughly €33/month for doing nothing.
Operate: The hard part is not finding savings—it is keeping them. Set budget alerts per team. Include cost impact in architecture review templates. Run a monthly cost review where engineering leads present their team’s spend delta and explain the drivers. Make it a 15-minute standing agenda item, not a quarterly panic.
Tagging: Boring and Non-Negotiable
Untagged resources are invisible resources. If you cannot attribute spend to a team or service, every cost conversation devolves into guesswork. Enforce tagging at the CI/CD layer: reject deployments that create resources without the required tag set. Use AWS Organizations tag policies to prevent non-compliant tags at the API level.
{
"tags": {
"Team": {
"tag_key": { "@@assign": "Team" },
"enforced_for": {
"@@assign": ["ec2:instance", "rds:db", "s3:bucket"]
}
},
"Environment": {
"tag_key": { "@@assign": "Environment" },
"tag_value": {
"@@assign": ["production", "staging", "dev"]
}
}
}
}
Reserved Capacity Without Lock-In Risk
Savings Plans deserve a disciplined approach. Commit only to what you can forecast with 90% confidence. Start with Compute Savings Plans—they are flexible across instance families, regions, and even Fargate—rather than EC2 Instance Savings Plans, which lock you to a specific family. Review utilisation monthly and adjust commitment at each renewal window. A 40–60% commitment ratio (committed versus on-demand) is a reasonable starting point for most Series A to Series C workloads.
Decision Framework
When evaluating any cost change, ask three questions. First: what is the annual saving? Second: what is the engineering effort to implement? Third: what is the blast radius if it goes wrong? A change that saves €2,000 per year but requires two weeks of engineering time and risks a production incident is not worth it. A 10-minute tag cleanup that saves €500 per month is.
Prioritise changes that are reversible over those that are not. Turning off an unused staging environment is reversible in minutes. Migrating a database to a smaller instance class during peak season is not. Sequence matters.
Implementation Notes
Start with CUR data in Athena. This gives you SQL access to line-item billing without third-party tooling. Build a QuickSight dashboard or use Grafana with the Athena plugin. Layer in AWS Budgets for alerting—one budget per team, one per environment. Third-party tools like Vantage or Kubecost add value once the basics are solid, not before. Do not buy a €40k/year FinOps platform when you have not tagged your resources yet.
# Quick check: find EC2 instances with <10% avg CPU (14d)
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 86400 \
--statistics Average \
--dimensions Name=InstanceId,Value=i-0example1234
Failure Modes
Over-optimisation is the silent killer. Teams that right-size too aggressively hit capacity walls during traffic spikes. Teams that commit to Savings Plans based on optimistic forecasts end up paying for unused capacity when product direction shifts. Leave headroom.
Tag drift is another common failure. Tagging works at deployment time, but resources created manually in the console or through ad-hoc scripts bypass enforcement. Run weekly tag compliance reports and surface violations in Slack—social pressure outperforms policy documents.
The subtlest failure: treating cost reduction as a one-time project rather than an ongoing practice. Costs creep back within three months if the operating cadence stops. Build the habit, not the initiative.
Cost discipline is a competitive advantage, not a constraint. Teams that understand their cost structure make better architecture decisions, negotiate vendor contracts from a position of knowledge, and protect the margin that funds product investment. The tools are commodity. The operating model is what separates teams that manage cost from teams that react to it.