today's extended gmail outage

6 views
Skip to first unread message

Dan Kearns

unread,
Aug 6, 2008, 5:42:46 PM8/6/08
to cloud-c...@googlegroups.com

If the CC group had a survey tool we could speculate more efficiently:
  • I support Google since gmail is a pioneering service and it's so new and until more companies are in this space... oh wait.
  • This is why you need an SLA...
  • They must be using (or should be using): rails, coherence, etc. (your pick)
  • The outage is probably due to processing required by a national security letter, but of course they can't tell us that
  • The singularity was reached, and decided "reduce" is more important than "map"
  • Google infrastructure is a large and complicated system, and.....
;-)

On a serious note... as (for example) trustsaas.com doesn't seem to have detected the outage, and obviously my gmail isn't affected, it does bring up the question of whether 3rd party non-cooperative monitoring can actually be capable of correctly determining the availability status of any sufficiently large, complicated, federated service.

-d

Your users may be experiencing issues accessing Google Apps services- As of
1:40:50 PM PDT on August 6, 2008

Services impacted: Email
Our team is working quickly to resolve this situation as soon as possible.
We will continue to post updates here as we learn more.

Sam Johnston

unread,
Aug 7, 2008, 2:40:16 AM8/7/08
to cloud-c...@googlegroups.com
On Wed, Aug 6, 2008 at 11:42 PM, Dan Kearns <d...@thekearns.org> wrote:

On a serious note... as (for example) trustsaas.com doesn't seem to have detected the outage, and obviously my gmail isn't affected, it does bring up the question of whether 3rd party non-cooperative monitoring can actually be capable of correctly determining the availability status of any sufficiently large, complicated, federated service.

Actually we did get them (this was their first outage since we started making our monitoring public last month), just not for the entire time - did they 'reboot' gmail I wonder or were our users just briefly affected? ;) We were aware of the issue through the partner channel at Google so we could have done manual notifications and also could have set up customer specific probes which would have reliably picked up outages affecting their entire domain.

The way we (currently) work is that we pick as sensitive a point as we can find (ideally something dependent on as many systems as possible - databases, authentication, logic, etc.), and then we poke it something like half a million times a year (1/min). If our poke (in whatever form) fails then we try again from somewhere else, and if that fails we (actually our underlying provider, Pingdom) mark that minute down. That's why our uptime scores are precise. We could have opted to try for a statistically significant sample but a> what is a statistically significant sample for gmail, b> we'd probably just get blacklisted and c> it would push up the price of the service (we're still working those details out - it's only a month old!). That's why our uptime scores aren't necessarily 100% accurate (though they're better than anything else that was out there before and they will definitely improve over time).

On the other hand thanks to the very nature of cloud computing failures are rarely of the 'epic fail' variey - they come in the form of degraded service and/or outages for subsets of users (apparently not the users used for the probes this time round). The result is that we can miss some outages and 'unfairly' punish providers for others, and short of having defined standards for what response times are acceptable (eg SLAs for individual requests) we'll hang around for a few seconds longer than your average user waiting for a response - this should tighten up over time.

Our belief is that cloud computing services should always be available (given the architecture 100% uptime is achievable and before today Gmail did have 100% uptime, at least while we were publicly watching them). Accordingly any failure should be considered an 'epic fail' (after all you have no control over whether the downed mailbox is going to be the cleaner or the CEO). This sets the bar for providers where it should be, and like Amazon we believe 'any downtime is unacceptable and we won't be satisfied until performance is statistically indistinguishable from perfect'.

Sam

--
Sam Johnston
Australian Online Solutions
http://www.aos.net.au/

David Kavanagh

unread,
Aug 7, 2008, 8:24:41 AM8/7/08
to cloud-c...@googlegroups.com
I have a corporate gmail account and a personal one. Neither went down for me.

Paco NATHAN

unread,
Aug 7, 2008, 8:33:07 AM8/7/08
to cloud-c...@googlegroups.com
My accounts there were working as well.

Paco

Peter Shenkin

unread,
Aug 7, 2008, 10:06:28 AM8/7/08
to cloud-c...@googlegroups.com
I was unable to find anything on the web about a gmail outage at that time.

The initial message posted here seemed to refer to a failure of email
services in the Google Apps context, which is something different.

That message read:

> Your users may be experiencing issues accessing Google Apps services- As of
> 1:40:50 PM PDT on August 6, 2008
>
> Services impacted: Email
> Our team is working quickly to resolve this situation as soon as possible.
> We will continue to post updates here as we learn more.

-P.

Jason N. Meiers

unread,
Aug 7, 2008, 11:27:01 AM8/7/08
to cloud-c...@googlegroups.com
This is an interesting video of how Google designs large architectures in the cloud. Check out minute 35:23. How they do release management for 1 million servers. It may make sense to have actual utility analysis and diagnostics for your cloud application rather than ping google.com.

Min 35:23

Q: How does google push changes to applications in production?
A: " ... applications just pickup changes eventually ... few  weeks ..."

http://video.google.com/videoplay?docid=-5699448884004201579

Hope this helps.

JM

Monitoring-as-a-Service(TM)

http://www.utilitystatus.com

Bob Lozano

unread,
Aug 7, 2008, 5:33:03 PM8/7/08
to cloud-c...@googlegroups.com
Gmail Suffers 15-Hour Partial Fail
Gmail is recovering from a fairly long-lasting partial fail.

The first posting came at 1:06 p.m. Wednesday:
"The Gmail Team is currently aware of a subset of users being affected by the 502 error on login. Our engineers are actively investigating the issue, and we will provide updates as soon as we have them. We appreciate your patience, and we apologize for any inconvenience this may have caused."



--
Bob Lozano

Appistry
www.appistry.com/blogs/bob (professional blog)
hopeitis.com (personal blog)

Cameron

unread,
Aug 7, 2008, 3:38:53 PM8/7/08
to Cloud Computing
> .. it does bring up the question of whether 3rd party non-cooperative
> monitoring can actually be capable of correctly determining the
> availability status of any sufficiently large, complicated, federated
> service.

it brings up the question of what a "service" is when "the service" is
much bigger than "the system" ;-)

this seems eerily familiar considering the S3 outage .. systems that
easily span beyond the boundaries of data centers ..

Peace,

Cameron Purdy | Oracle
http://www.oracle.com/technology/products/coherence/index.html

On Aug 6, 5:42 pm, "Dan Kearns" <d...@thekearns.org> wrote:
> If the CC group had a survey tool we could speculate more efficiently:
>
> - I support Google since gmail is a pioneering service and it's so new
> and until more companies are in this space... oh wait.
> - This is why you need an SLA...
> - They must be using (or should be using): rails, coherence, etc. (your
> pick)
> - The outage is probably due to processing required by a national
> security letter, but of course they can't tell us that
> - The singularity was reached, and decided "reduce" is more important
> than "map"
> - Google infrastructure is a large and complicated system, and.....

Bob Lozano

unread,
Aug 7, 2008, 6:17:28 PM8/7/08
to cloud-c...@googlegroups.com
I definitely agree - what does a partial outage mean? How to accurately measure?

It'll be easy to catch the big explosion total outages - read about them just about anywhere - but what about the rolling outage, sporadically effecting say 20% of the customers?

Those are far more difficult to detect (much less accurately characterize), and consequently far harder to write a meaningful, enforceable SLA and live by it.

Bob

jamesurquhart

unread,
Aug 8, 2008, 12:05:29 PM8/8/08
to Cloud Computing
It just occurred to me that an interesting feature that a cloud
provider could offer would be a running list of their support
organization's top priority calls. What is garnering the most
attention from the technical team at any given moment? That way, if
an outage call gets escalated because it has been verified, it would
immediately show up at the top of the list (one would assume).

Another way would be to have an RSS feed of all cases escalated to
"Severity 1" (perhaps with customer detailed blacked out), which
anyone could subscribe to and be informed of critical support events.
The events would be deemed critical just by the fact that they were
raised to top priority.

Just brainstorming. Maybe these are both bad ideas.

James

Cesar

unread,
Aug 8, 2008, 4:35:54 PM8/8/08
to Cloud Computing
The gmail outage ocurred to a subset of users...I was one of them. It
lasted over 15 hours on Wednesday, August 6th. All users affected got
the infamous "502 Temporary Error". There wasn't a lot of press
coverage, probably because it was only a "subset" of users...perhaps a
few thousand? However, following the apps blog that day, there were
complaints from users with corporate accounts who were unable to
access gmail for hours. Communication from from the gmail team was
sporadic and general, at best. Given the level of redundancy and
fault tolerance inherent in Google's architecture, it is puzzling why
this outage happened for so long and it took the ops team so many
hours to identify the problem. We did get an apology, but no
explanation on the cause. Lastly, this type of outage (502 Error) has
ocurred a number of times in the past few months with gmail, to
different subset of users. I do agree that in cloud computing 100%
availability is of paramount importance...failures such as these (12+
hours), even on a small number of users is a failure of system.

-Cesar

On Aug 7, 3:17 pm, "Bob Lozano" <bobloz...@gmail.com> wrote:

David

unread,
Aug 9, 2008, 7:45:39 AM8/9/08
to Cloud Computing

Application monitoring is never complete when performed from outside.
I assume services like TrustSaas will find it hard to monitor ALL
Gmail accounts. They can, however, report a general outage of a
service.

These monitoring services use synthetic transactions technologies in a
polling mode (E.g. mimic the common Gmail transactions such as
creating a new account, logging in, sending/receiving an email).

To get a complete picture, one must also monitor the SaaS system from
the inside either by using the SaaS vendor's API or understanding the
SaaS architecture and monitoring the underlying systems via
performance and availability KPIs (Key Performance Indicators).


David Habusha | Omroyo
http://www.omroyo.com

On Aug 7, 10:38 pm, Cameron <cameron.pu...@gmail.com> wrote:
> > .. it does bring up the question of whether 3rd party non-cooperative
> > monitoring can actually be capable of correctly determining the
> > availability status of any sufficiently large, complicated, federated
> > service.
>
> it brings up the question of what a "service" is when "the service" is
> much bigger than "the system" ;-)
>
> this seems eerily familiar considering the S3 outage .. systems that
> easily span beyond the boundaries of data centers ..
>
> Peace,
>
> Cameron Purdy | Oraclehttp://www.oracle.com/technology/products/coherence/index.html

Krishna Sankar (ksankar)

unread,
Aug 9, 2008, 12:12:25 PM8/9/08
to cloud-c...@googlegroups.com
Without any comments ;o)

http://www.gridtoday.com/grid/2448025.html

" ... these technologies and trends offer users flexibility,
adaptability, maximum utilization and, perhaps most importantly, access
to resources when and where they are needed.

But they are not grid computing. As complementary as
virtualization and grid computing might be, no one would call them one
and the same. While cloud computing and grid computing share a common
architecture, they are not necessarily interchangeable.

And unlike grid computing, which seems to have settled into its
niche powering CPU-intensive, and often time-sensitive, tasks like film
rendering and electronic trading -- a fine niche to be in, no doubt --
enterprises are leveraging this new batch of technologies company-wide.
When it comes to cloud computing, there are those among us who predict
it will become the dominant computing paradigm within the next decade."

Cheers
<k/>

Reuven Cohen

unread,
Aug 9, 2008, 12:18:40 PM8/9/08
to cloud-c...@googlegroups.com
> --~--~---------~--~----~------------~-------~--~----~


I had a chat with Dennis Barker about this very topic about two months
ago, this was right around the time I started to formulate my "Grid is
Dead" theory.
I'm happy to see that others are thinking alone the same lines.

Btw, I never said parallelism or distributed computing is dead, just
the term Grid.

Ruv

Reply all
Reply to author
Forward
0 new messages