By Arjun Mehta
I remember a production incident: a service was timing out, causing cascading failures. Resolution time: 45 minutes. Postmortem question: "Why did this cause cascading failures?"
Answer: "Because the service that depends on this one doesn't have a timeout."
This wasn't about monitoring or alerting (both were good). It was about system design. The system allowed a failure in one part to break everything downstream. The post-mortem identified the structural issue, and we fixed it. The next time that service timed out, the impact was contained.
A good incident management process doesn't prevent incidents. It accelerates resolution and prevents recurrence.
Incident Management in 60 Seconds
Incidents happen. Good incident management: detects quickly, communicates clearly, resolves fast, and learns thoroughly. The full lifecycle: detection (alert fires), triage (how bad?), communication (who needs to know?), resolution (fix it), and post-mortem (what should change so this doesn't happen again?). Most teams are good at the first four. Post-mortems are where learning happens, and they're often neglected.
Why Incident Management Matters Now
First: impact. Fast resolution means less downtime, less customer impact, less business loss.
Second: engineering culture. How you handle incidents shapes the team. Blame-focused post-mortems create fear and hiding. Blameless post-mortems create psychological safety and learning.
Third: predictability. Good incident management is predictable. You know your on-call engineer will triage in 5 minutes, resolve in 30 (or escalate), and schedule a post-mortem. This predictability reduces stress.
The Incident Lifecycle
Detection: An alert fires. A customer reports a problem. A metric exceeds a threshold. The earlier detection happens, the better. Ideally automated alerts, not customer reports.
Triage: What's broken? How bad? Is it affecting production? Is it affecting customers? Triage should happen in minutes. Severity levels: "SEV1 - user-facing, no workaround" (emergency), "SEV2 - user-facing, workaround exists," "SEV3 - not user-facing," "SEV4 - cosmetic."
Communication: Everyone who might be needed is notified (on-call engineer, on-call manager, relevant service owners). Updates every 10 minutes if it's not resolved. Status page updated (if user-facing).
Resolution: Root cause found, fix applied, verification that it's fixed.
Post-mortem: Scheduled within 48 hours (while it's fresh) or within a week (later is worse than never, but sooner is better).
What Makes Incident Response Fast and Effective
Clear severity levels. Everyone knows what SEV1 means - no ambiguity.
On-call rotations that don't burn out engineers. A 2-week rotation is fine. A 24/7 rotation that's also your day job is not.
Runbooks that are maintained. A runbook saying "if service X fails, follow these steps" should be tested. Runbooks that are never tested are worse than no runbook - you follow it, it doesn't work, now you're panicking and distracted.
Communication channels that reduce noise. Too many channels (Slack, email, PagerDuty, war room, incident tracking) means messages get lost. One channel for incident discussion, with clear owner communicating to status page and stakeholders.
Escalation paths. If the on-call engineer can't resolve in 30 minutes, escalate. Don't let one person spin for 2 hours.
The Post-Mortem ( - ) The Most Valuable and Most Neglected Part
A post-mortem has one job: identify what should change to prevent recurrence. Not: blame anyone, or even identify root cause (though that helps). The job is prevention.
Most post-mortems stop at "root cause." Ours was "service timeout." That's not actionable. Why did timeout cause failures? "Because downstream services don't have timeouts." Now it's actionable.
A good post-mortem answers four questions:
What made this hard to detect? (Our alert threshold was too high. We detected 15 minutes after customers saw problems.)
What made it hard to diagnose? (No logs showing that service A timed out. We had to guess.)
What made it hard to resolve? (Rollback required database migrations. Took 10 minutes just to execute the rollback plan.)
What structural change prevents recurrence? (Add timeout to downstream services. Lower alert threshold. Add better logging. Simplify rollback process.)
Most post-mortems answer the first three. The fourth question is where prevention happens.
How to Run a Blameless Post-Mortem
Assume everyone acted with good intent, given the information they had at the time. The engineer who deployed the code thought it was safe. It wasn't. That's not failure - that's the human condition. The goal is systems that catch these problems.
Focus on systemic issues, not individual failures. "The engineer didn't notice the alert" is a person problem. "The alert was too low-priority and was buried in notification noise" is a system problem. Fix the system.
Involve everyone. The engineer who fixed it, the on-call manager, engineers from dependent services. Diverse perspectives catch system issues.
Document and share. The post-mortem is only valuable if it results in action and if the lessons are shared.
Common Incident Patterns and What They Reveal
Cascading failures: One service fails, takes down everything downstream. Reveals: lack of timeouts, lack of circuit breakers, tight coupling.
Database bloat: Queries slow down, service times out, downtime. Reveals: no data retention policy, no query performance testing, database design issues.
Config errors: Change to config file breaks the system. Reveals: config changes not tested, no gradual rollout for config changes, no way to quickly revert.
Resource exhaustion: Memory or CPU fills up, service crashes. Reveals: no resource limits, no autoscaling, no alerting on resource usage.
How Glue Helps
When incident response starts, Glue accelerates diagnosis by showing: which code changed recently, which services are affected, what the dependencies are, and who owns the affected code. Instead of "let me grep for where this service is called," Glue tells you immediately.
This accelerates MTTR. Faster diagnosis means faster resolution.
Glue also helps with post-mortems. "What made this hard to diagnose?" Glue shows whether the code change that caused the incident was visible in monitoring and alerting. If a service changed but no metrics changed, that's a signal (and a preventive action).
Frequently Asked Questions
Q: How much monitoring is too much?
A: Monitor what matters. User impact (latency, errors, throughput), system health (CPU, memory, disk), and business metrics (revenue, feature usage). Avoid monitoring every internal metric - that's noise.
Q: Should post-mortems be mandatory?
A: Yes, for any incident that affects customers or is SEV1/2. Minor incidents (SEV3) might not need a formal post-mortem, but at least a brief writeup of what happened and what changed.
Q: How often should we do incident response drills?
A: Quarterly at minimum. Simulate an incident, run the response process, see what breaks. Most teams skip drills until they have a real incident and then are surprised by what doesn't work.
Related Reading
- What Is Engineering Intelligence?
- What Is Codebase Intelligence?
- The CTO's Guide to Product Visibility
- Glue for Competitive Gap Analysis
- What Is Product Knowledge Management?