Blog

  • Understanding Incident Management

    Recently, I’ve been involved with Incident Management for large scale services, and I feel it’s still a part of the tech industry that is still largely unexplored and could do with improvement. The next few posts are going to be focused on this topic – staring from API level monitoring up to processes for incident management. You would be surprised where the challenges lie. Let’s start with our basic flow:

    1. Discover Error
    2. Run Recovery
    3. Did Recovery succeed?
    4. Any other recoveries?
    5. All/Partial recovery fail
    6. Escalate to Engineer
    7. Resolve Error
    8. [Engineer] Update Recovery
    9. Post Mortem

    Again this is our basic flow, and there are many different ways this problem could flow, and because of it’s ability to generalize across many industries, there are also standards around incident management that I’ll also go over in my posts. Today – lets focus on steps 5 – 7.

    Most times, if we know a certain set of actions will resolve the problem, or reduce noise without impact to performance etc, we will add it as recovery step for the error. i.e if A failed, run B to fix A. This is the simplest of solutions and of course begs the theoretical question of “A should never fail, focus should be on fixing A’. I’m not going to get into that, I’m going to focus on other more important questions – Did running B change how A runs? Did anyone notice B run? How did it A look when was B was Run? How long did it run for?… list goes on. Now partial recovery of a component is very common, and the problem with that is we dont have a well defined Success criteria. Component’s need to have well defined Red-Yellow-Green statuses. That is how most component operate: Red means total stoppage/failure, Yellow means degraded or partial failure, and Green is flowing/operational. Thats step 1 – Identify how my component works. Always define success criteria This is your min bar, any perf/behavior out of bounds of this is either a partial or total failure.

    Escalating to the Engineering. This is the easy part – my assumption is everyone knows who owns which area.

    Last but not least – Resolving error. This is a time of great relief – and one can only hope there was minimal customer or worse – SLA impact. After the work done by the Engineer – it will either boil down to configuration issue or code bug. If it’s a configuration issue, SOP’s must be put in place to prevent another failure, and if it’s a code bug the scenario should be appropriately addressed. Actions must to be taken even after a partial failure. 

  • Sometimes taking a break from the norm is the right thing

    So the last couple of weeks have seen some changes in my daily routine from e.g. a year ago. These have been positive changes, and there are too many people to thank for that, but I’m sure they know who they are. Starting from the running, to the cycling, to the moving, and to the relaxing 🙂

    And I’ve been more in flux lately, so it was the old anchors that were put in place and a forced break coming up that’s helping me keep a straight course.

    (more…)

  • Life and its Guiding Principles

    There are some things that I’m proud of – I don’t drink and I don’t go to stripclubs.

     

    (more…)

  • Socrates (469 – 399 BC)

    No citizen has a right to be an amateur in the matter of physical training…what a disgrace it is for a man to grow old without ever seeing the beauty and strength of which his body is capable.