DevOps

B2B SaaS Achieves 99.99% Uptime

Pulled a churning B2B SaaS out of weekly outage cycles and onto a calm 99.99% baseline with IaC, multi-AZ failover, and a runbook the on-call rotation actually trusts.

Client: B2B SaaS Scale-up
Role: Platform Lead
Duration: 10 weeks

DevOps

99.993%

Uptime

Context

A logistics-adjacent B2B SaaS with about 40 enterprise customers across DACH and the UK. The product was a single Rails monolith fronted by a small React app, deployed to a hand-managed cluster of EC2 instances. There was no Infrastructure as Code, no real observability beyond CloudWatch defaults, and incident response lived in a private Slack channel where tribal knowledge died with whoever was on holiday. The CEO had quietly stopped quoting uptime in sales calls because she could not back it up.

The challenge

A B2B SaaS scale-up was hitting paid-tier outages every other week. Customer-success was burning hours apologising, two anchor accounts had renewal clauses tied to uptime, and the engineering team was deploying defensively because nobody was confident a normal release wouldn’t break production.

Approach

Reviewed six months of incidents to identify the top three failure modes — single-AZ database, manual deploys, and silent third-party degradations.
Codified the entire production estate in Terraform; nothing in production was allowed to be a snowflake again.
Moved Postgres to RDS Multi-AZ with read-replicas and rebuilt the application tier behind an ALB across three AZs.
Added structured logging, RED-method dashboards, and SLO-based alerting that pages on burn rate, not on raw thresholds.
Wrote and rehearsed runbooks for the six most-likely incidents; introduced blameless post-mortems and a weekly reliability review.

The solution

A high-availability rebuild on AWS Multi-AZ with the entire estate moved into Terraform, and a monitoring stack that pages on symptoms users feel rather than on raw infra metrics. Incident response time dropped from hours to minutes because the runbooks finally matched reality.

Deployments became boring on purpose — a rolling release through a real CI/CD pipeline, automated database migrations behind feature flags, and a freeze window for the highest-risk customer SLAs. The engineering team got their evenings back.

Outcome

99.993%

Uptime

4 mins

MTTR

5x/day

Deploy Frequency

Tech stack

AWS
Terraform
RDS Multi-AZ
ECS
Datadog
PagerDuty
GitHub Actions

Reflections

Uptime is mostly a leadership problem dressed up as a technology problem. The tools weren’t exotic; the discipline was. Once the team trusted the runbooks and the SLOs, they shipped more, not less — five deploys a day became normal, and the incident channel went weeks at a time without a real-life page.

Continue exploring

Want this kind of outcome?

Engagements are scoped around outcomes, not hours. The fastest path is a short discovery call—no slides, just questions.

Book a discovery call Explore services

Back to case studies

DevOps

B2B SaaS Achieves 99.99% Uptime

Pulled a churning B2B SaaS out of weekly outage cycles and onto a calm 99.99% baseline with IaC, multi-AZ failover, and a runbook the on-call rotation actually trusts.

Client: B2B SaaS Scale-up
Role: Platform Lead
Duration: 10 weeks

DevOps

99.993%

Uptime

Context

The challenge

Approach

Reviewed six months of incidents to identify the top three failure modes — single-AZ database, manual deploys, and silent third-party degradations.
Codified the entire production estate in Terraform; nothing in production was allowed to be a snowflake again.
Moved Postgres to RDS Multi-AZ with read-replicas and rebuilt the application tier behind an ALB across three AZs.
Added structured logging, RED-method dashboards, and SLO-based alerting that pages on burn rate, not on raw thresholds.
Wrote and rehearsed runbooks for the six most-likely incidents; introduced blameless post-mortems and a weekly reliability review.

The solution

Outcome

99.993%

Uptime

4 mins

MTTR

5x/day

Deploy Frequency

Tech stack

AWS
Terraform
RDS Multi-AZ
ECS
Datadog
PagerDuty
GitHub Actions

Reflections

Continue exploring

Want this kind of outcome?

Engagements are scoped around outcomes, not hours. The fastest path is a short discovery call—no slides, just questions.

Book a discovery call Explore services