Massive EC2 outage

207 views
Skip to first unread message

Jeff Schnitzer

unread,
Apr 21, 2011, 1:02:54 PM4/21/11
to Google App Engine
I'm not suggesting schadenfreude here, but for all those folks
doubting the viability of appengine for reliability reasons:

http://eu.techcrunch.com/2011/04/21/amazon-ec2-goes-down-taking-with-it-reddit-foursquare-and-quora/

Amazon's North Virginia datacenter tripped and fell over in the early
AM this morning, and several major sites (Foursquare and Quora) are
still down more than *eight hours* later. Ouch.

You can read the gory details here: http://status.aws.amazon.com/

Jeff

Derrick Schneider

unread,
Apr 21, 2011, 1:04:52 PM4/21/11
to google-a...@googlegroups.com
While it's certainly unfortunate for companies that have bet on EC2, they'd be smarter to distribute their servers a bit so that a single data center outage does not take out their entire company.


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.




--
Writer. Programmer. Puzzle Designer.
http://www.obsessionwithfood.com

Jeff Schnitzer

unread,
Apr 21, 2011, 1:14:24 PM4/21/11
to google-a...@googlegroups.com
Easy to say, not so easy to do :-)

All that frantic running-around-in-panic that GAE engineers and ops
people do when something goes wonky inside the datastore? That would
be you right now, trying to figure out how to failover your database
(and any other persistent data) to a different datacenter... and then
when the original comes back up you get to merge any lost
transactions.

I've been there and I'm happy to leave this job to the professionals.

Jeff

Derrick Schneider

unread,
Apr 21, 2011, 1:31:38 PM4/21/11
to google-a...@googlegroups.com
Not easy, sure, though EC2 has plenty of support for different availability zones (as does RDS) so it's not impossible (though you pay normal bandwidth charges between availability zones, as opposed to getting it for free).

So I guess it boils down to figuring out how much outages cost you. In the case of these businesses, probably not enough to warrant investing the time and skills to have plans and infrastructures for small-scale disasters like this, given that they're relatively uncommon. 

Derrick

Calvin

unread,
Apr 21, 2011, 2:16:57 PM4/21/11
to google-a...@googlegroups.com
Schadenfreude was exactly the word I was going to use when I considered posting this news.  I love that word.

I've got to say that I feel really good after reading the outage postmortem yesterday and then hearing this news this morning.

Whenever there's an App Engine outage, or increase in error rate, it seems really dire to us because we get a lot of people charging to this forum to alert people to the outage.  But yesterday's postmortem explained that while the March 8th outage lasted a while, it didn't effect everyone, and Google was aware of the problem pretty quickly.

For me the bottom line is that I'd rather have a Googler's pager go off when there's a problem than have to set up elaborate failsafes on EC2.

saidimu apale

unread,
Apr 21, 2011, 4:58:41 PM4/21/11
to google-a...@googlegroups.com

I've been there and I'm happy to leave this job to the professionals.

It sure helps that you have a batphone to the GAE datastore team. Or is that a different Jeff Schnitzer?

saidimu

saidimu apale

unread,
Apr 21, 2011, 5:01:22 PM4/21/11
to google-a...@googlegroups.com
It's just a matter of time before GAE has another major outage. Will that materially change your decision to be on GAE? I don't think it should; the cloud is young and reliability is not (yet, if ever) at "dial-tone" reliability.

saidimu

--

vlad

unread,
Apr 21, 2011, 5:20:46 PM4/21/11
to google-a...@googlegroups.com
I wish appengine outages would make national news. When that happens GAE is a success story.

Jeff Schnitzer

unread,
Apr 21, 2011, 6:12:17 PM4/21/11
to google-a...@googlegroups.com
On Thu, Apr 21, 2011 at 1:58 PM, saidimu apale <sai...@gmail.com> wrote:
>
> It sure helps that you have a batphone to the GAE datastore team. Or is that
> a different Jeff Schnitzer?

Don't interpret that too literally. A couple of the Google developers
have been receptive to some api changes that would make Objectify work
better (Future interception, and a way to expose entity version
timestamps) but my apps run on the same hardware yours does. When I
have production questions, I post them here.

Jeff

saidimu apale

unread,
Apr 21, 2011, 7:09:58 PM4/21/11
to google-a...@googlegroups.com
FWIW, here's DotCloud's take on the outage which is still affecting them:



saidimu



Jeff

Robert Kluin

unread,
Apr 22, 2011, 1:33:17 AM4/22/11
to google-a...@googlegroups.com
Hopefully they can get everything sorted out soon.

Depends on your app, but having a bit of downtime sprinkled across the
year in small doses is probably easier to handle than a massive outage
that shuts down a site for a day.

Any chance a Googler could comment about how well distributed and
independent the HR datacenters are? Would HR apps still be up if the
state (or maybe the western half of the US) holding the primary HR
datacenter suddenly lost all power and connectivity?


Robert

Philip

unread,
Apr 22, 2011, 4:53:32 AM4/22/11
to Google App Engine
At least Amazon has announced that Skynet has nothing to do with this
outage: https://forums.aws.amazon.com/message.jspa?messageID=238872#jive-message-238934

What surprises me is that this outage is not covered by any SLA
according to Amazon. I don't think its smart to refuse compensations
especially since Amazon is most likely insured for such a outage.

On Apr 21, 7:02 pm, Jeff Schnitzer <j...@infohazard.org> wrote:
> I'm not suggesting schadenfreude here, but for all those folks
> doubting the viability of appengine for reliability reasons:
>
> http://eu.techcrunch.com/2011/04/21/amazon-ec2-goes-down-taking-with-...

Alfred

unread,
Apr 22, 2011, 3:09:19 PM4/22/11
to Google App Engine
(I swear I didn't pay Robert to ask this question)

The High Replication Datastore can loose multiple data centers and
still function (without any data-loss, down time or elevated error
rates). It is also not dependent on a single geographical region. Of
course the availability of an App Engine application is dependent on
the entire App Engine stack (not just the datastore) so if a problem
affects other parts of the stack there might be a short period of user
visible issues (likely just the time it takes to realize there is a
problem, which is typically pretty short).

Matt and I will be talking about how this works in detail in the
Google IO session: "More 9s Please: Under The Covers of the High
Replication Datastore"

- Alfred

On Apr 21, 10:33 pm, Robert Kluin <robert.kl...@gmail.com> wrote:
> Hopefully they can get everything sorted out soon.
>
> Depends on your app, but having a bit of downtime sprinkled across the
> year in small doses is probably easier to handle than a massive outage
> that shuts down a site for a day.
>
> Any chance a Googler could comment about how well distributed and
> independent the HR datacenters are?  Would HR apps still be up if the
> state (or maybe the western half of the US) holding the primary HR
> datacenter suddenly lost all power and connectivity?
>
> Robert
>
>
>
>
>
>
>
> On Fri, Apr 22, 2011 at 08:09, saidimu apale <said...@gmail.com> wrote:
> > FWIW, here's DotCloud's take on the outage which is still affecting them:
> >http://blog.dotcloud.com/working-around-the-ec2-outage
>
> > saidimu
>
> > On Thu, Apr 21, 2011 at 6:12 PM, Jeff Schnitzer <j...@infohazard.org> wrote:

kowsik

unread,
Apr 21, 2011, 2:17:30 PM4/21/11
to google-a...@googlegroups.com
It's less about the data-center going down, but more about App design
and distributing it across regions. All the apps that relied on their
PaaS vendors that only provisioned in the us-east-1 region are all
hurting right now (including us). If only PaaS offerings enabled
regional affinity along with DNS fail-over to alternate regions, that
would be awesome. Does GAE do this?

K.
---
http://blitz.io (currently down)
http://twitter.com/pcapr

On Thu, Apr 21, 2011 at 10:31 AM, Derrick Schneider

Ikai Lan (Google)

unread,
Apr 25, 2011, 2:31:34 PM4/25/11
to Google App Engine
In the event of a primary data center outage, we would failover to a secondary data center. If you are using High Replication datastore, you should not experience any downtime while this happens. In addition, any data that are successfully committed will have been guaranteed to have been written.

Ikai Lan 
Developer Programs Engineer, Google App Engine

Brandon Wirtz

unread,
Apr 25, 2011, 2:54:39 PM4/25/11
to google-a...@googlegroups.com

If aliens attack and shut down all the communication satellites will High Replication protect us?

Geoffrey Spear

unread,
Apr 25, 2011, 3:41:01 PM4/25/11
to Google App Engine


On Apr 25, 2:54 pm, "Brandon Wirtz" <drak...@digerat.com> wrote:
> If aliens attack and shut down all the communication satellites will High
> Replication protect us?

Unless they take out lots of fiber optic cables as well, they won't
have much effect. Satellites introduce way too much latency.

Alfred Fuller

unread,
Apr 25, 2011, 4:15:28 PM4/25/11
to google-a...@googlegroups.com
Ya, I would worry more about orbital to ground attacks, and even then they would probably target high population areas, military bases or governing bodies (or the power plants/facilities near these targets). In this case I think the odds are more that users will lose the ability to access your site from their end, rather then a high replication datastore issue.


--

Brandon Wirtz

unread,
Apr 25, 2011, 7:21:56 PM4/25/11
to google-a...@googlegroups.com

I was hoping to get some Googler to reply so I could quote them in an article that said “Google says GAE Engineered to Survive Attacks By Aliens”

kowsik

unread,
Apr 25, 2011, 7:29:29 PM4/25/11
to google-a...@googlegroups.com
If aliens attacked, they would probably cover up all the clouds with
their war ships. So, doubt if replication would work, unless it's a
basement private cloud. :)

K.
---
http://blitz.io
http://twitter.com/pcapr

Stephen

unread,
Apr 26, 2011, 5:27:05 AM4/26/11
to google-a...@googlegroups.com
On Tue, Apr 26, 2011 at 12:21 AM, Brandon Wirtz <dra...@digerat.com> wrote:
> I was hoping to get some Googler to reply so I could quote them in an
> article that said “Google says GAE Engineered to Survive Attacks By Aliens”


On the prospect of Appengine suffering an outage due to alien attack:

"the odds are more that users will lose the ability to access your
site from their end, rather then a high replication datastore issue"

-- Alfred Fuller, Google Appengine Team

Reply all
Reply to author
Forward
0 new messages