Sunday, June 12, 2005

lightpost

posting will be light for the next few weeks.
In a few hours I fly to Delhi.
Tomorrow morning I fly to Leh, Ladakh for a 5 day trek in the Indian Himalayas.
After that I head back to Delhi and on to Pune for about 10 days during which time I will relinquish my bachelorhood.
I may be online and checking mail from Pune but probably not posting. Will be back to the grind sometime in July.

prediction

I wrote parts of amazon's remote-service-invokation framework a few years ago, after which a few of us worked on collecting detailed metrics from the framework so as to guage the health of the system and try to predict business-impacting problems before they actually, well, impacted business.

I'm currently reading blink by Malcolm Gladwell and it's a facinating book. No time for a full review right now. But here's a quote that caught my eye (he's talking about an algorithm for deciding the seve rity of heart-attack-like symptoms in ER patients, and has just listed several high-heart-attack-risk lifestyle factors):

    ... It certainly seems like he ought to be admitted to the coronary care unit right away. But the algorithm says he shouldn't be.
    <snip>
    What Goldman's algorithm indicates, though, is that the role of those other factors is so small in determining what is happening to the man right now that an accurate diagnosis can be made without them. In fact <snip> that information is more than useless. It's harmful. It confuses the issues. What screws up doctors when they are trying to predict heart attacks is that they take too much information into account.


(The book, by the way, attempts to explain intuition and how it is that we can get such strong (and often correct) intuitions without being able to understand exactly why. It also attempts to analyse the cases in which our intuition is terribly wrong. See also this entry by Trevor for more about intuition.)

This is cool because our hunch over the past few years has been that it will only take a few metrics to actually predict a given failure scenario, but deciding which ones to pick is the hard thing. So the kinds of systems we are trying to build end up being quite similar to what (I just found out) humans are doing. We're constantly taking hundreds or thousands of input variables (subtle changes in a persons face or 'body language', things seen in the periphery of our vision, etc.) and doing some realtime statistical analysis on them. Except our consciousness is never burdened with any of that. Our subconscious builds and refines these elaborate statistical models over time. Then, it can bubble up signals (in the form of intuition) to our conscious mind with very limited information because it has already made models about which variables are important enough to matter.

How does this apply to metrics and monitoring? It's infeasible and foolhardy to track the state of every possible instrumentable variable in your system in realtime and use that to drive failure detection and root cause analysis. But
  1. you may be able to design a system that can collect lots of metrics and analyse them in an 'offline' manner without impacting your system.
  2. the output of (1), a list of 'important' metrics, is fed into an alarming/monitoring system
  3. whenever an alarm is diagnosed (or confirmed) the result of that is fed back into (1) to correct or reinforce the prediction.


If failure detection is like the pit in your stomach or lump in your throat, and root cause analysis is like the logical reasoning that we sometimes go through when making decisions, then maybe we have to accept that failure detection is a much faster process than root cause analysis. Our group has always looked at those as two different processes, but never acknowledged that they may require different amounts of information.

On one level, that looks hopeless; "what good is it to know that something is wrong if you don't know what it is?" But we do that all the time. A lot of us learn to trust our instincts (don't walk down that alley) even if we can't tell exactly what's wrong (it's well lit, there are people around, but it just feels shady).

How could that help in managing distributed systems? The only example I can think of right now is: if a host 'feels like it's unhealthy' it could just take itself out of a load balancer without knowing what was wrong.

It does tell me is that it may be worth completely separating the process of detection and root-cause analysis. So that the feedback in (3) above is not "the root cause of this disturbance was xyz", nor is the list of 'important' metrics in (2) a list for each possibel root cause. (i.e. you don't output something that says that metrics A and B are important for predicting a disk crash, but metrics D and F predict a web server failure and metrics C and E predict that your application is deadlocked). That's how antivirus software is modeled. It builds up fingerprints of different viruses and tries to match the fingerprint. It does both detection and root-cause-analysis in a single step. (OK, maybe modern antivirus software does more than that, but stick with me for a moment).

Instead, maybe the right but counterintuitive (no pun intended) thing to do here is to only store whether or not "Bad" things happened, and store the set of metrics which are good predictors of "Bad"ness. You'd probably need more than a binary notion of Badness. This doesn't get us closer to solving problems, but maybe it can help reduce downtime in the first place, because we've got a very good early-warning system.

It'll be interesting to see if any more insights come out of watching a large system running (the group I've been working with in Bangalore is getting closer to releasing an internal, scaled-down version of what will eventually be a large self-healing distributed system). Since I've been in development mode for the past few months (vs. supporting a live system), I feel a little unqualified to rant too much about this stuff. :)

Saturday, June 11, 2005

d.s.

Someone left comments (twice!) asking me to follow up on my earlier idea of starting a distributed systems blog. Thanks! But I realized that, although I do a lot of work in the area, I don't know enough to do justice to a blog - I still have many miles to walk.

So every time I run across something interesting, I'll be sure to post on it. But until then, I'll be sticking to reading and learning and growing.

Friday, June 10, 2005

FC reading.

catching up again on some FC reading.
About Despair, Inc.
    The point is that most people should work to make money. They shouldn't expect a company to make them happy. A company can be friendly and good, but it can't really make you happy. At the same time, it shouldn't insult you. It shouldn't say, 'We're a family and have values,' and then act like Enron."
    <snip>
    Jamie Malanowski is the features editor at Playboy. He's happy in his work.

I love the last line about the author of the article! :)

An awesome post about being a bouncer:
    The bouncer ethos, in point of fact, stands in diametric opposition to that of any other position in the service industry. Simply put, if you, as a bouncer, stand there and take crap from the customers, you won't be employed for very long, because everyone on the staff will consider you a pussy, and they won't want you around. Therefore, when people -- as they invariably will -- act like assholes, I'm getting paid to fulfill the one, singular fantasy harbored by everyone who has ever served a drink or waited on a table:

    I can do it back.


On cubicles sucking:
    The solution, Tompkin says, is to customize space to various types of work. Give those who need uninterrupted time a quiet place to work and those who need to collaborate a more social space. That may mean a glass-walled office for heads-down work, and a variety of gathering places for group work. "As the workforce becomes more mobile," Tompkin says, "the office will be the main tool companies use to build a shared culture."

I totally agree. I wish we'd have a healthy mix between offices and cubes. Or even moderately sized rooms with cubes in them, instead of a huge room full of cubes. That way I can only potentially get distracted by 4-6 other people instead of 70.

Sounds like GE Durham has taken a page out of Gore's (makers of GoreTex) book (Reference to gore from Malcom Gladwell's tipping point. Lots of great stuff here:
    GE/Durham has more than 170 employees but just one boss: the plant manager. Everyone in the place reports to her. Which means that on a day-to-day basis, the people who work here have no boss. They essentially run themselves.
    <snip>
    So how can something so complicated, so demanding, so fraught with risk, be trusted to people who answer only to themselves? Trust is a funny thing. It is the mystery -- and the genius -- of what goes on at GE/Durham.
    <snip>
    "The interview, now that was one heck of an experience," he says. "It lasted eight hours. I talked to five different people. I participated in three group activities with other job candidates. I even had to do a presentation: I had 15 minutes to prepare a 5-minute presentation."
    <snip>
    At GE/Durham, candidates are rated in 11 areas. "Only one of those involves technical competence or experience," says Keith McKee, 27, a tech-3 on Team Raven. "You have to be above the bar in all 11 of the areas: helping skills, team skills, communication skills, diversity, flexibility, coaching ability, work ethic, and so forth. Even if just one thing out of the 11 knocks you down, you don't come to work here."


Some of the stuf here reminds me of what we learnt in the storytelling training I attended recently.

Thursday, June 09, 2005

cliched conversations

As a foreigner in the US, I was always struck by the very predictable 'conversations' that people had:
- so how was your labor day?
- what're you doing this weekend?
- how was the weekend?
- got plans for july 4th?
- going anywhere for memorial day?

I put the word conversations in quotes because it seemed a lot of times that people weren't even interested in the answers, they were just asked because of a strange notion of politeness.

I'm sure it's a universal phenomenon and not at all restricted to the US. But I can't come up with similar examples in Bangalore.

Except for one... for the past three weeks, people have constantly been asking me how preparations are going for the wedding and if I'm all set. Worse still, the same person will ask me the same question two days in a row, as if my 'preparations' change on a minute-to-minute basis.

Just like the conversations above, a lot of times it feels like people are just asking me to be polite. Especially given that I've been telling them that my mom is doing all the preparations. (An upside of having the wedding in India is that my parents and family are doing a lot of the preparations).

Actually I'm kind of nervous that I haven't had to do much. You know that dream where you show up in school and everyone's laughing and you look down to realize you forgot to wear your pants? I have that same feeling. Like I'm going to show up in Pune and realize that I forgot to do something really basic. :)

Anyways, I'm leaving for Delhi this Monday, after which I'll spend a week trekking in gorgeous Ladakh. From there I fly straight to Pune to tie the proverbial knot.

Tuesday, June 07, 2005

phishing

Here's an interesting way to beat phishers and their scams:
    If you get phishing e-mail, go the web sites and enter false data. Make up everything -- name, sign-on name, password, credit card numbers, everything. Instead of one million messages yielding 100 good replies, now the phisher will have one million messages yielding 100,000 replies of which 100 are good, but WHICH 100?

    This technique kills phishing two ways. It certainly increases the phishing labor requirement by about 10,000X. But even more importantly, if banks and e-commerce sites limit the number of failed sign-on attempts from a single IP address to, say, 10 per day, theft as an outcome of phishing becomes close to impossible.

we're not that special.

post by Ming talks about some experiments on monkeys by economists.

We're just another species in the evolutionary race that Nature is hosting.

There's nothing special about humans. There's nothing inferior about the other one hundred million species on the planet.

Monday, June 06, 2005

danger: darwin harmful

Darwin got an two honorable mentions here. The damn idiots. Now I'm ashamed at not having read all the books on that list.

Thursday, June 02, 2005

lost moments

I was driving back from work just now, at about 2:30am. I saw this amazing sight on the road. There were 4 people riding bicycles in a rectangular formation in the middle of the road. Across their heads was draped a huge something made of plywood. I couldn't get close enough to see, but it might have been a billboard. I'm guessing the distance between the front and rear cyclists was about 12-15 feet, and the distance between two adjacent cyclists was about 8-10 feet.

Only in India. I wish I'd had a camera with me!