Gmail down

Gmail went down for the majority of its tens of millions of users on Tuesday, September 1, 2009. The Boston Globe’s story is available here, and a Google News search for “gmail” returns many valuable results, at least for now. When I was affected, I thought that one of my labs features had broken, until I found others experienced it, too. Gmail said that it took some servers down for routine repair, and underestimated the additional load on other servers. The Gmail team was alerted of the problem within seconds, but had to continue the maintenance. Google’s report is here.

It says that:

As a result, at approximately 12:30 PDT, a few request routers became overloaded and responded by

refusing all incoming requests.

On Tuesday, September 1, a small portion of Gmail’s web capacity was taken offline during a routine
upgrade and service update. This is normal operating procedure as the Gmail web interface runs in
multiple locations, and Gmail’s request routing automatically directs users’ requests to available servers.
However, we underestimated the increased load that some of the new updates placed on request
routing.
As a result, at approximately 12:30 PDT, a few request routers became overloaded and responded by
refusing all incoming requests. This response transferred the load to the other request routers, and as the
effect rippled through the system, almost all of the request routers became overloaded. As a result, users
could not access Gmail through the web interface since their requests could not be routed to a Gmail
server. Gmail processing and access through the IMAP/POP interfaces continued as usual because
these processes use different request systems.
Upon receiving the error alerts, the Gmail Engineering team immediately began analyzing the issue and
initiated a series of actions to help alleviate the symptoms. After determining the root cause to be
insufficient available capacity, the Engineering team deployed a large-scale addition of request routers
through Google’s flexible capacity server systems. As they distributed incoming traffic across the
expanded pool of request routers, access to the Gmail web interface returned to normal

And they responded by:

After determining the root cause to be

insufficient available capacity, the Engineering team deployed a large-scale addition of request routers

through Google’s flexible capacity server systems. As they distributed incoming traffic across the

expanded pool of request routers, access to the Gmail web interface returned to normal.

It seems that having request routers refuse all traffic when there’s too much traffic coming to them is now a problem of insufficient capacity.

Next up: SPAM, SPAM, SPAM, SPAM, SPAM…

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: