Incident Response for Startups: What to Have Ready Before the First Page
By Shayan Ghasemnezhad on August 14, 2025 · 3 min read
Your first major incident will happen. The difference between 20 minutes of downtime and 4 hours is what you prepared before the page fired.
Startups do not think about incident response until they have an incident. Then everyone is in a Slack channel, nobody knows who is in charge, three people are making changes to production simultaneously, and the CEO is asking “what is happening?” every five minutes. The incident takes four hours to resolve and two days to recover from. Most of that time is coordination failure, not technical complexity.
The Minimum Viable Incident Process
You do not need PagerDuty’s full incident management framework. You need four things: severity levels, a response checklist, a communication template, and a post-incident review process. These can live in a single Notion page or a markdown file in your repo. What matters is that everyone knows they exist and where to find them.
Severity Levels
Define three severity levels. More than three adds debate about classification during an incident when you should be fixing it.
- SEV1 — Service down: Core product functionality is unavailable for all or most users. All hands on deck. CEO gets notified. External communication within 30 minutes.
- SEV2 — Degraded: Service is impaired but not fully down. Some users affected, or a non-critical feature is out. Engineering lead notified. External communication if user-facing impact exceeds 30 minutes.
- SEV3 — Minor: Internal issue, no user impact. Tracked as a bug. Fixed in normal sprint work.
The Response Checklist
When an alert fires or a user reports an issue, the first responder follows a checklist:
- Acknowledge: Claim the incident in the Slack channel. “I am looking at this.”
- Classify: Assign a severity level.
- Communicate: Post a status update to the stakeholder channel.
- Investigate: Check dashboards, logs, recent deployments.
- Mitigate: Prioritise restoring service over finding root cause. Rollback if a recent deployment is suspect.
- Resolve: Confirm service is restored. Post final status update.
- Follow up: Schedule post-incident review within 48 hours.
Communication Templates
Writing clear communication under stress is hard. Templates remove that burden. Prepare three templates: initial acknowledgement (“We are aware of an issue affecting [X]. We are investigating.”), progress update (“We have identified the cause as [X]. Estimated resolution: [time].”), and resolution (“The issue affecting [X] has been resolved. [Brief explanation].”).
Post these to your status page, Slack, and anywhere customers check. The goal is to reduce inbound “is it down?” queries, which consume responder attention during the incident.
Post-Incident Reviews
A post-incident review (PIR) is not a blame session. It is a learning session. Cover four questions: What happened? (Timeline of events.) Why did it happen? (Contributing factors, not root cause—complex systems rarely have a single root cause.) How did we respond? (What worked, what did not.) What will we change? (Specific, assignable action items with deadlines.)
Write the PIR in a shared document. Publish it to the team. The goal is institutional learning: the next incident should be easier, faster, or prevented entirely because of what you learned from this one.
Implementation Notes
Set up basic alerting before you need it: uptime checks (Pingdom, Better Uptime, or a CloudWatch synthetic canary), error rate monitoring (Sentry, Datadog), and infrastructure alerts (CPU, memory, disk). Route alerts to a Slack channel. Assign an on-call rotation—even if it is just two engineers alternating weeks.
# Minimal CloudWatch alarm for API error rate
AWSTemplateFormatVersion: '2010-09-09'
Resources:
ApiErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: api-5xx-rate
MetricName: 5XXError
Namespace: AWS/ApiGateway
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 10
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlertTopic
Failure Modes
The most common failure: alert fatigue. Too many alerts, most of which are not actionable, train the team to ignore them. Every alert should have a corresponding runbook entry that explains what to check and what to do. If you cannot write a runbook for an alert, the alert should not exist.
The second failure: skipping post-incident reviews because the team is too busy. The cost of not learning from incidents is repeat incidents. A 30-minute PIR that produces two action items is worth more than a week of feature work if it prevents the next four-hour outage.
Your first major incident will happen. The preparation you do now—severity levels, checklists, templates, review process—is the difference between a controlled response and organised chaos. An afternoon of preparation saves days of recovery.