the other side of twitter failures

by Christopher Blizzard

Update: I just added twitter.com to /etc/hosts and pointed it at a site that doesn’t have a webserver. Works for now until twitter comes back.

Having some lunch and I thought it might be worth a small post while my burrito cools.

I just had to disable polling on whoisi because twitter is down. Again. Whiosi’s polling system, in case you were wondering is basically as dumb as a wooden post right now. I’m not trying to pretend that I thought it would work forever, nor that it was very good. But it works well as long as the internet is pretty healthy and the number of failures is evenly spread out among sites on the web.

Here’s how the poll system works right now for each site in the database: refresh every site every n minutes where n is a random number between 1 and 30. That’s it. And it does that for everything. No backoff, no per-site limits, etc. It’s easy to plug that kind of thing into the code, but it’s yet another thing on the “not yet done list.” Designed to be smart, but without the brains behind it.

You also have to understand how jobs are run. Jobs come from two sources, the “master service” (which I’ll describe in a later post) and the web site. But they all run through the same job queue. So when you try and add a new person it tries to go out and make a little preview of the site. That job has to compete with site refreshes that are also underway. The limit on the number of jobs that can be run at the same time is also dumb. Right now it’s 50 at once. Not 3/sec or 50 waiting for I/O, just 50 in progress.

So when you have a few hundred twitter accounts you’re polling and they fail by having to time out, the queue gets backed up. Given how many people are adding accounts right now I thought it would be good if the site interacted well instead of having things refresh instantly. It’s a tough choice but it’s how it is until twitter recovers from whatever its latest pain is.

I wish that twitter would fail by giving an immediate 500 or even a connection refused. The slow death of waiting for a response is basically the worst possible thing that can happen. Fail faster. Please.

Not that I should throw stones for even a second, given how dumb my code is. But just a lesson and what happens when a (dare I say important?) service dies.