Note: I followed this story because I've been considering outsourcing (cloud computing) several core applications and services. When a provider like Google has a massive outage like this, it requires that I further scrutinize hosted service providers.
---
Updated at 12:25 p.m. PDT with word that Google has confirmed an
error on its end caused the outage, and at 3:30 p.m. PDT with Google's
comment on McAfee's description of the events.
Widespread outages involving several Google services--including
search, Google Docs, and Gmail--were caused by an upgrade gone awry
inside of Google, according to engineers.
Dmitri Alperovitch, vice president of threat research for McAfee, said
that Google this morning attempted to make changes to key Internet
routing numbers--known as autonomous system numbers--as
part of its ongoing transition from an older networking standard to a
newer one called IPv6. An unknown "bug" inside Google's network
involving some sort of hardware failure or glitch prevented Internet
service providers from finding Google's new ASNs on the Internet--effectively sealing it off from many customers, he said.
Not all Internet users were affected, but some that use larger
providers--such as AT&T or Verizon--appeared to be
disproportionately hurt because large ISPs "peer" with Google, or
interconnect their networks with Google's networks in order to improve
speed and reduce bandwith costs, Alperovitch said. Not all customers at
those providers were affected, and smaller ISPs that didn't
interconnect their networks were able to route around the problem. But
just like when a bad
car accident shuts down a key highway, the ripple effects were felt elsewhere.
The
outage began at 8:13 a.m. PDT, according to McAfee's data, and was
fixed by 9:14 a.m. PDT. The issue was discussed inside forums dedicated
for ISPs and their engineers, such as the North American Network Operators Group. McAfee's customers reported the issue to the security company, which monitors network traffic for some customers.
Google is a major fan of IPv6
and makes many of its services available through the new network
technology. However, IPv6 has been slow to arrive overall, in part
because it's a very difficult transition from the current IPv4 network.
Google spokesman Eitan Bencuya wouldn't confirm what caused the
problem but said the company plans to detail what happened in a company
blog to be published "shortly."
Update at 12:25 p.m. PDT: Google has confirmed that "an error in
one of our systems caused us to direct some of our Web traffic through
Asia, which created a traffic jam." The company did not elaborate on
what caused the error in a blog post, but claimed just 14 percent of users were affected.
"We've been working hard to make our services ultrafast and 'always
on,' so it's especially embarrassing when a glitch like this one
happens. We're very sorry that it happened, and you can be sure that
we'll be working even harder to make sure that a similar problem won't
happen again," Google wrote.
Updated 3:30 p.m. PDT: Google has denied that work on the
transition to IPv6 is to blame for this morning's outage, but will not
specify what was to blame. "This issue is unrelated to any work we are
doing in transitioning toward supporting IPv6," a spokesman said.
McAfee said it obtained its information from Google on a private
mailing list for networking professionals of which Google is a member,
but declined to provide a copy of the thread in question.