By Arjun Mehta
It's 2am. Your pager goes off. A critical system is down. Users can't process payments. Revenue is draining by the minute. Your team scrambles to fix it.
This is incident management in action.
Incident management isn't just about fixing things fast when they break. It's about having a process for:
- Detecting problems early (before they impact many users)
- Responding efficiently (getting the right people working on the right thing)
- Recovering quickly (minimizing the time systems are down)
- Learning thoroughly (understanding what happened and preventing it next time)
Teams with good incident management recover from outages in 15 minutes. Teams without it might take hours. The difference is process, not heroics.
The Incident Management Lifecycle
Detection
You can't fix what you don't know about. Detection means knowing something is broken before customers complain.
Good detection systems monitor:
- Error rates (are requests failing?)
- Latency (are requests slow?)
- Business metrics (are payments processing?)
- Health checks (is the database responsive?)
When metrics exceed thresholds, alerts fire. A human (on-call engineer) is notified.
Response
Someone gets paged. They:
- Acknowledge the alert (so others know someone is handling it)
- Assess severity (is this critical or minor?)
- Gather information (logs, metrics, recent changes)
- Escalate if needed (call in experts if they can't handle it alone)
- Communicate (update status page, notify stakeholders)
Mitigation
Fast mitigation is more important than perfect fixes.
If your new deployment broke something, roll it back. You fix the bug and redeploy later.
If a dependency is failing, use a cache or fallback. You switch to proper behavior later.
If a database is overloaded, enable rate limiting. You scale it later.
The goal: restore service as fast as possible. Get users unblocked. Then fix the root cause.
Recovery
Service is restored. Users can use the system again. Revenue is flowing.
Now you take stock: what happened, how did we detect it, what did we do to recover?
Post-Incident Review
This is where learning happens. A few hours after the incident, the team gathers to discuss:
Timeline: When did the problem start? When did we detect it? When was it fixed?
Root Cause: Why did this happen? A bug in new code? Unexpected load? Dependency failure?
Impact: How many users were affected? How long? How much revenue was lost?
Response: Did our alerting work? Did people respond quickly? What went well? What was confusing?
Follow-ups: What can we do to prevent this next time?
Critically, this review is blameless. The goal isn't to find who caused the problem. The goal is to find how our systems, processes, and practices failed. Because if a human can make a mistake, your systems should catch it.
Building for Incident Management
1. Monitoring and Alerting
You need visibility into your system:
HTTP Error Rate > 5% for 1 minute
-> Alert: "Error Rate High"
Payment Processing Latency > 5 seconds
-> Alert: "Payment Slowness"
Database Replication Lag > 10 seconds
-> Alert: "Replication Lag High"
Alerts should be:
- Actionable: If the alert fires, someone can do something about it
- Reliable: Actual problems trigger alerts, but false alarms are rare
- Specific: Don't just say "something is wrong." Say "database replication lag is 30 seconds."
2. On-Call Rotations
Someone is always on-call to respond to alerts. Design rotations to avoid burnout:
- Each engineer is on-call one week per month
- During on-call, they should be responsive but not constantly working
- If on-call is constantly firing, you have reliability problems (fix those, don't just accept being paged constantly)
3. Runbooks
When an alert fires, the on-call engineer needs guidance:
Alert: "Payment Processing Latency High"
1. Check the payment processor status page
- If they're down, acknowledge it in status page and wait for them to recover
2. Check payment service logs
- Look for errors connecting to payment processor
3. Check database performance
- Query slow log for slow queries
4. If payment service is healthy:
- Increase timeout slightly
- Check downstream services for backlog
5. If nothing helps:
- Page payment service owner (see contacts)
Runbooks don't need to be perfect. They just need to guide someone unfamiliar with a system through diagnosis and mitigation.
4. War Rooms
For critical incidents, gather the relevant people in a chat room or call:
- Incident Commander: Coordinates the response, gathers information, makes calls
- Technical Leads: Diagnose and fix the problem
- Communications Lead: Updates status page and customers
Everyone has a role. Clear communication prevents chaos.
5. Status Page
Customers need to know what's happening:
Status Page:
"2:15am - We're aware of an issue with payment processing. Engineers are investigating. More updates in 5 minutes."
2:20am - "We've identified the issue and are working on a fix. Estimated recovery in 10 minutes."
2:28am - "Issue resolved. All systems operating normally. Incident review starting."
Status updates every 15 minutes during an incident. Honesty is important. Users can handle problems, but they hate not knowing what's happening.
Learning from Incidents
The post-incident review is where you get better.
Good post-incident reviews:
- Are blameless (focus on systems, not people)
- Identify root causes, not just symptoms
- Generate action items
- Are documented
- Track follow-ups
Example:
Incident: Payment Processing Outage (2 hours)
Timeline:
- 2:15am: Alerts fire for high error rate
- 2:18am: On-call investigates, identifies payment processor unreachability
- 2:22am: Escalates to payment processor team
- 2:35am: Payment processor recovers
- 2:37am: Our system recovers
Root Cause:
Payment processor had an internal issue causing them to drop connections. This was their problem, not ours.
But we had no fallback. When they were down, our service failed completely.
Impact:
- 2 hours of outage
- 500 transactions failed
- Estimated $50k revenue impact
Follow-ups:
1. Implement circuit breaker for payment processor calls
- If processor is down, fail gracefully and queue transactions
2. Add monitoring for processor health
- Alert before customers are impacted
3. Add retry logic with exponential backoff
- Many transient failures resolve in seconds
Follow-ups need owners and deadlines. Otherwise they're just ideas.
Preventing Incidents
The best incident is the one that never happens.
Chaos Engineering: Deliberately break things in production (in a controlled way) to find weaknesses.
Load Testing: Before peak season, simulate heavy load and find limits.
Canary Deployments: Deploy to 1% of servers first, monitor for issues before deploying to all.
Feature Flags: Deploy features hidden, gradually roll out. If something breaks, disable it instantly.
Runbook Testing: Periodically run through runbooks to ensure they still work. Assign someone.
Regular Drills: Simulate incidents (like payment processor going down) and practice response.
On-Call Culture
On-call is stressful. Good teams make it manageable:
Reasonable alerting: Only page for actual emergencies. Warn on trends, but don't page constantly.
Compensation: If on-call is burdensome, compensate people (extra pay, comp time, etc.).
Support: Don't leave on-call alone. Senior engineers mentor juniors. Experienced on-calls mentor new ones.
Sustainable pace: If incidents are constant, the problem isn't on-call rotation. It's system reliability. Fix that.
Blameless culture: Engineers should feel safe reporting incidents. If they fear blame, they'll hide problems.
Tools for Incident Management
Alerting:
- PagerDuty: On-call scheduling and alert routing
- Opsgenie: Similar to PagerDuty
- Alertmanager: Open-source, integrates with Prometheus
Communication:
- Slack: Real-time team communication
- Status page tools: Statuspage.io, Incident.io
Post-Incident:
- Google Docs: Simple, works for writing reviews
- Incident.io: Dedicated incident tracking and reviews
Building Reliability Over Time
Good incident management isn't about being perfect. It's about:
- Detecting problems fast
- Responding efficiently
- Recovering quickly
- Learning from every incident
Over time, this process identifies weak spots. You fix them. You have fewer incidents. On-call becomes less burdensome.
Teams that excel at incident management end up with remarkably reliable systems, not because they're perfect, but because they learn from every failure.
Frequently Asked Questions
Q: How often should we have incidents?
A: It depends on your system and risk tolerance. A critical payment system might aim for 99.99% uptime (roughly 45 minutes of downtime per year). A less critical system might accept 99.9% (8 hours per year). Some incidents are inevitable. The goal is minimizing them and recovering fast.
Q: Should everyone be on-call?
A: Ideally, yes. Developers should understand production. But junior engineers might not be ready. Structure your on-call so experienced engineers mentor less experienced ones.
Q: What's the difference between an incident and a bug?
A: A bug is a problem in your code. An incident is a user-facing issue. A bug in code that's not live isn't an incident. A bug in code that impacts users is an incident.
Q: How long should post-incident reviews take?
A: Start within 24 hours while details are fresh. Initial review might be 30 minutes. Deep dives on complex incidents might take longer. Document it, assign follow-ups, move on.