Goals – Haider's Blog

Recently, I’ve been involved with Incident Management for large scale services, and I feel it’s still a part of the tech industry that is still largely unexplored and could do with improvement. The next few posts are going to be focused on this topic – staring from API level monitoring up to processes for incident management. You would be surprised where the challenges lie. Let’s start with our basic flow:

Discover Error
Run Recovery
Did Recovery succeed?
Any other recoveries?
All/Partial recovery fail
Escalate to Engineer
Resolve Error
[Engineer] Update Recovery
Post Mortem

Again this is our basic flow, and there are many different ways this problem could flow, and because of it’s ability to generalize across many industries, there are also standards around incident management that I’ll also go over in my posts. Today – lets focus on steps 5 – 7.

Most times, if we know a certain set of actions will resolve the problem, or reduce noise without impact to performance etc, we will add it as recovery step for the error. i.e if A failed, run B to fix A. This is the simplest of solutions and of course begs the theoretical question of “A should never fail, focus should be on fixing A’. I’m not going to get into that, I’m going to focus on other more important questions – Did running B change how A runs? Did anyone notice B run? How did it A look when was B was Run? How long did it run for?… list goes on. Now partial recovery of a component is very common, and the problem with that is we dont have a well defined Success criteria. Component’s need to have well defined Red-Yellow-Green statuses. That is how most component operate: Red means total stoppage/failure, Yellow means degraded or partial failure, and Green is flowing/operational. Thats step 1 – Identify how my component works. Always define success criteria This is your min bar, any perf/behavior out of bounds of this is either a partial or total failure.

Escalating to the Engineering. This is the easy part – my assumption is everyone knows who owns which area.

Last but not least – Resolving error. This is a time of great relief – and one can only hope there was minimal customer or worse – SLA impact. After the work done by the Engineer – it will either boil down to configuration issue or code bug. If it’s a configuration issue, SOP’s must be put in place to prevent another failure, and if it’s a code bug the scenario should be appropriately addressed. Actions must to be taken even after a partial failure.