DevOps
Pulled a churning B2B SaaS out of weekly outage cycles and onto a calm 99.99% baseline with IaC, multi-AZ failover, and a runbook the on-call rotation actually trusts.
DevOps
99.993%
Uptime
A logistics-adjacent B2B SaaS with about 40 enterprise customers across DACH and the UK. The product was a single Rails monolith fronted by a small React app, deployed to a hand-managed cluster of EC2 instances. There was no Infrastructure as Code, no real observability beyond CloudWatch defaults, and incident response lived in a private Slack channel where tribal knowledge died with whoever was on holiday. The CEO had quietly stopped quoting uptime in sales calls because she could not back it up.
The challenge
A B2B SaaS scale-up was hitting paid-tier outages every other week. Customer-success was burning hours apologising, two anchor accounts had renewal clauses tied to uptime, and the engineering team was deploying defensively because nobody was confident a normal release wouldn’t break production.
Approach
Reviewed six months of incidents to identify the top three failure modes — single-AZ database, manual deploys, and silent third-party degradations.
Codified the entire production estate in Terraform; nothing in production was allowed to be a snowflake again.
Moved Postgres to RDS Multi-AZ with read-replicas and rebuilt the application tier behind an ALB across three AZs.
Added structured logging, RED-method dashboards, and SLO-based alerting that pages on burn rate, not on raw thresholds.
Wrote and rehearsed runbooks for the six most-likely incidents; introduced blameless post-mortems and a weekly reliability review.
The solution
A high-availability rebuild on AWS Multi-AZ with the entire estate moved into Terraform, and a monitoring stack that pages on symptoms users feel rather than on raw infra metrics. Incident response time dropped from hours to minutes because the runbooks finally matched reality.
Deployments became boring on purpose — a rolling release through a real CI/CD pipeline, automated database migrations behind feature flags, and a freeze window for the highest-risk customer SLAs. The engineering team got their evenings back.
Uptime
MTTR
Deploy Frequency
Reflections
Uptime is mostly a leadership problem dressed up as a technology problem. The tools weren’t exotic; the discipline was. Once the team trusted the runbooks and the SLOs, they shipped more, not less — five deploys a day became normal, and the incident channel went weeks at a time without a real-life page.
Continue exploring
Engagements are scoped around outcomes, not hours. The fastest path is a short discovery call—no slides, just questions.