I was one of the many affected by yesterday’s Gmail outage.
The reason for all of these problems?
Here’s what happened: This morning (Pacific Time) we took a small fraction of Gmail’s servers offline to perform routine upgrades. This isn’t in itself a problem — we do this all the time, and Gmail’s web interface runs in many locations and just sends traffic to other locations when one is offline.
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don’t use the same routers.
It would seem that Gmail and other Google services live in a pretty fragile existence if only a few servers going down can cascade to the rest of the system and cause massive outages. Flexibility and scalability is a two-way street: if traffic is automatically redirected, things like this can occur as easily as the addition of capacity.
Realisitcally, will this outage force me to change my mail provider? No. Will I continue to use (and be productive with) the Google Tools? Of course. But they took a hefty hit on their stock price yesterday because of this outage. It will be interesting to see some of the short- and longer-term ramifications from yesterday’s problems.