Stop writing great runbooks. Instead, fix your production issues.
When production has an issue, someone gets paged. A common approach to handling pages in production looks like:
- Make sure every page has a clear runbook
- Make sure everyone is trained up on how to run that runbook
Runbooks are a set of manual tasks. On-call literature often has things like “no alert should exist without a runbook.” The runbook is exalted.
The problem is that runbooks are documented failure: failure to fix things properly, failure to automate solutions, failure to prioritize. Furthermore, runbooks are often praised for being clear and easy to use. In reality, the clearer and easier a runbook is to execute, the more unforgivable it is that you haven’t automated the solution.
The Catch-22 Of Runbooks is Any runbook that is easy to execute should not exist
.
If the runbook says press a button, make something do that automatically when the page goes off. If the runbook says read a graph and then press a button, make something do that automatically when the page goes off. There is almost nothing in an on-call that is both easy to execute and unable to be automated.
So is there any place for runbooks?
Runbooks should be training. Instead of thinking of runbooks as reactive playbooks, good runbooks are training that give you the confidence and know-how to quickly and effectively solve unforeseen production issues when they happen.
And that makes sense. Other people that go on-call - e.g. doctors, firefighters, detectives - aren’t pulling out binders of playbooks when time is of the essence.
Here’s five more thoughts on healthy on-call rotations before we go:
- Pages are not a way to learn about the system, they’re a call for action. If you don’t take action on a page, get rid of it.
- The ways to get rid of known, problematic page are: reduce the threshold, introduce a product limitation, fix a bug, automate handling of the issue. Every single known page can be avoided by one of these methods.
- For teams that don’t have follow-the-sun coverage, you can modify your paging thresholds off-hours. Your team should and will react faster while they’re at work and you can codify that.
- PMs should both know and prioritize-based-off-of paging volume.
- Managers should be in paging rotations.