This is my light-hearted attempt to help engineers get the most value out of a downtime incident.
Introduction
So you had an incident? Condolences.
On the bright side, perhaps this is an opportunity to learn and make things better?
Process
I will offer you a series of questions that you should ask yourself about the incident and its root cause.
The goal here is to suggest a way forward based on the incident’s underlying cause.
Please note, the questions are ordered based on priority.
Questionnaire
- Many incidents are caused by a mistake. Assuming this incident is caused by a mistake, is this a mistake that you could personally make?
- If No, please proceed to question 3
- Is there any knowledge that you have now or still need to acquire, that would have prevented this incident?
- If Yes
- How do you plan to acquire this knowledge, and by when?
- How do you plan to disseminate this knowledge to your colleagues and other teams?
- If Yes
- Did you have an automated test that should have prevented the failure?
- If Yes, why did the existing tests not catch it?
If your answer includes the word “other service” or “data store” or some reference to an external resource, please proceed to question 5
- If Yes, why did the existing tests not catch it?
- Have you added a test to confirm the issue and prevent recurrence?
- If No, why not?
There are no acceptable answers to this, except perhaps, “I am in an ambulance on the way to the hospital.”
- If No, why not?
- Are you able to test your service without any external resources (data sources can be exempted)?
- If No, please consider the fact that with adequate test coverage and tests that are appropriately isolated from external services, then the remaining failures are typically reduced to only config and external dependency issues.
- Do you assume that all external resources could fail (return an error or become slow) and adjusted your code accordingly?
- If No, has this work been identified and scheduled?
You should strongly consider adding timeouts, circuit breakers, retries, and, when possible, fallbacks to all external calls.
- If No, has this work been identified and scheduled?
- Was this incident caused by some practice (approach to work) that you or your team use?
- If Yes, what are you doing to address this?
Have you considered proposing new practices to your team/manager?
- If Yes, what are you doing to address this?
- Was the incident caused by human error?
- If Yes, can we remove the human from the process?
Can we remove their ability to make a mistake without making the process unduly arduous?
- If Yes, can we remove the human from the process?
- Is there anything we could have done to detect this issue faster?
- If Yes, is this work scheduled?
We should be looking for automated monitoring and alerts. Ideally, issues are fixed before the users even notice.
- If Yes, is this work scheduled?
- Is there anything we could have done to restore full functionality faster?
- If Yes, is this work scheduled?
Frequently this will be rolling back to a previous deployment or using a feature flag to turn on/off the troublesome feature.
- If Yes, is this work scheduled?
- What else can we do to prevent this issue from happening again?
- There is seldom nothing possible; that said, sometimes to the cost of prevention is high.
This is a conversation that should be had with the relevant stakeholders.
The business might decide that the fixing cost is higher than the potential for the issue to happen again.
- There is seldom nothing possible; that said, sometimes to the cost of prevention is high.
- Is it possible that other teams could have the same issue?
- If Yes, please consider documenting and sharing your experiences.
While it is not possible to avoid all downtime incident there is seldom an incident that we cannot learn from.
If you found this useful, check out the companion article Post-Incident Questionnaire for Managers
If you like this content and would like to be notified when there are new posts or would like to be kept informed regarding the upcoming book launch, please join my Google Group (very low traffic and no spam).
Image by Anita S. from Pixabay