Author: Haider

  • Understanding Incident Management

    Recently, I’ve been involved with Incident Management for large scale services, and I feel it’s still a part of the tech industry that is still largely unexplored and could do with improvement. The next few posts are going to be focused on this topic – staring from API level monitoring up to processes for incident management. You would be surprised where the challenges lie. Let’s start with our basic flow:

    1. Discover Error
    2. Run Recovery
    3. Did Recovery succeed?
    4. Any other recoveries?
    5. All/Partial recovery fail
    6. Escalate to Engineer
    7. Resolve Error
    8. [Engineer] Update Recovery
    9. Post Mortem

    Again this is our basic flow, and there are many different ways this problem could flow, and because of it’s ability to generalize across many industries, there are also standards around incident management that I’ll also go over in my posts. Today – lets focus on steps 5 – 7.

    Most times, if we know a certain set of actions will resolve the problem, or reduce noise without impact to performance etc, we will add it as recovery step for the error. i.e if A failed, run B to fix A. This is the simplest of solutions and of course begs the theoretical question of “A should never fail, focus should be on fixing A’. I’m not going to get into that, I’m going to focus on other more important questions – Did running B change how A runs? Did anyone notice B run? How did it A look when was B was Run? How long did it run for?… list goes on. Now partial recovery of a component is very common, and the problem with that is we dont have a well defined Success criteria. Component’s need to have well defined Red-Yellow-Green statuses. That is how most component operate: Red means total stoppage/failure, Yellow means degraded or partial failure, and Green is flowing/operational. Thats step 1 – Identify how my component works. Always define success criteria This is your min bar, any perf/behavior out of bounds of this is either a partial or total failure.

    Escalating to the Engineering. This is the easy part – my assumption is everyone knows who owns which area.

    Last but not least – Resolving error. This is a time of great relief – and one can only hope there was minimal customer or worse – SLA impact. After the work done by the Engineer – it will either boil down to configuration issue or code bug. If it’s a configuration issue, SOP’s must be put in place to prevent another failure, and if it’s a code bug the scenario should be appropriately addressed. Actions must to be taken even after a partial failure. 

  • Sometimes taking a break from the norm is the right thing

    So the last couple of weeks have seen some changes in my daily routine from e.g. a year ago. These have been positive changes, and there are too many people to thank for that, but I’m sure they know who they are. Starting from the running, to the cycling, to the moving, and to the relaxing 🙂

    And I’ve been more in flux lately, so it was the old anchors that were put in place and a forced break coming up that’s helping me keep a straight course.

    (more…)

  • Socrates (469 – 399 BC)

    No citizen has a right to be an amateur in the matter of physical training…what a disgrace it is for a man to grow old without ever seeing the beauty and strength of which his body is capable.

  • So, 2013 is the year I’m going to lose weight.

    as I stated 2 posts ago, which also happens to be 3 months ago… I was going to blog more. Here it is:

    This year is going to be defined as the year I lost weight. Don’t get me wrong 2012 was a good year too, but 2013 won’t even hold a candle against it. Does that means setting the bar at 45lbs….? Who knows, my public goal is to get to 180 by June-ish.

    Current milestones ahead for this awesome year:

    6/22 – Seattle Rock’n’Roll Marathon
    7/13 – Seattle To Portland bike ride

    On-going projects that help keep a good steady state are Soccer and Kickboxing, so time has become the limiting factor 🙁 I’m going to try and move to a morning routine for the runs, but don’t think that I’ll be able to do that, instead what I’m thinking of doing is moving to an earlier work day, and leaving at 5 on the dot… Should be an interesting experiment, and definitely difficult to do for the first few weeks.

    (more…)