I'm currently reading blink by Malcolm Gladwell and it's a facinating book. No time for a full review right now. But here's a quote that caught my eye (he's talking about an algorithm for deciding the seve rity of heart-attack-like symptoms in ER patients, and has just listed several high-heart-attack-risk lifestyle factors):
- ... It certainly seems like he ought to be admitted to the coronary care unit right away. But the algorithm says he shouldn't be.
What Goldman's algorithm indicates, though, is that the role of those other factors is so small in determining what is happening to the man right now that an accurate diagnosis can be made without them. In fact <snip> that information is more than useless. It's harmful. It confuses the issues. What screws up doctors when they are trying to predict heart attacks is that they take too much information into account.
(The book, by the way, attempts to explain intuition and how it is that we can get such strong (and often correct) intuitions without being able to understand exactly why. It also attempts to analyse the cases in which our intuition is terribly wrong. See also this entry by Trevor for more about intuition.)
This is cool because our hunch over the past few years has been that it will only take a few metrics to actually predict a given failure scenario, but deciding which ones to pick is the hard thing. So the kinds of systems we are trying to build end up being quite similar to what (I just found out) humans are doing. We're constantly taking hundreds or thousands of input variables (subtle changes in a persons face or 'body language', things seen in the periphery of our vision, etc.) and doing some realtime statistical analysis on them. Except our consciousness is never burdened with any of that. Our subconscious builds and refines these elaborate statistical models over time. Then, it can bubble up signals (in the form of intuition) to our conscious mind with very limited information because it has already made models about which variables are important enough to matter.
How does this apply to metrics and monitoring? It's infeasible and foolhardy to track the state of every possible instrumentable variable in your system in realtime and use that to drive failure detection and root cause analysis. But
- you may be able to design a system that can collect lots of metrics and analyse them in an 'offline' manner without impacting your system.
- the output of (1), a list of 'important' metrics, is fed into an alarming/monitoring system
- whenever an alarm is diagnosed (or confirmed) the result of that is fed back into (1) to correct or reinforce the prediction.
If failure detection is like the pit in your stomach or lump in your throat, and root cause analysis is like the logical reasoning that we sometimes go through when making decisions, then maybe we have to accept that failure detection is a much faster process than root cause analysis. Our group has always looked at those as two different processes, but never acknowledged that they may require different amounts of information.
On one level, that looks hopeless; "what good is it to know that something is wrong if you don't know what it is?" But we do that all the time. A lot of us learn to trust our instincts (don't walk down that alley) even if we can't tell exactly what's wrong (it's well lit, there are people around, but it just feels shady).
How could that help in managing distributed systems? The only example I can think of right now is: if a host 'feels like it's unhealthy' it could just take itself out of a load balancer without knowing what was wrong.
It does tell me is that it may be worth completely separating the process of detection and root-cause analysis. So that the feedback in (3) above is not "the root cause of this disturbance was xyz", nor is the list of 'important' metrics in (2) a list for each possibel root cause. (i.e. you don't output something that says that metrics A and B are important for predicting a disk crash, but metrics D and F predict a web server failure and metrics C and E predict that your application is deadlocked). That's how antivirus software is modeled. It builds up fingerprints of different viruses and tries to match the fingerprint. It does both detection and root-cause-analysis in a single step. (OK, maybe modern antivirus software does more than that, but stick with me for a moment).
Instead, maybe the right but counterintuitive (no pun intended) thing to do here is to only store whether or not "Bad" things happened, and store the set of metrics which are good predictors of "Bad"ness. You'd probably need more than a binary notion of Badness. This doesn't get us closer to solving problems, but maybe it can help reduce downtime in the first place, because we've got a very good early-warning system.
It'll be interesting to see if any more insights come out of watching a large system running (the group I've been working with in Bangalore is getting closer to releasing an internal, scaled-down version of what will eventually be a large self-healing distributed system). Since I've been in development mode for the past few months (vs. supporting a live system), I feel a little unqualified to rant too much about this stuff. :)