Amazon outage and ramifications

Rakesh

unread,

Jul 2, 2012, 5:35:09 AM7/2/12

to javaposse

Hi,

I'm sure you've all heard about the outage over the last few days at Amazon, specifically their Virginia-East-1 availability zone.

What I'm struggling to find out is why is caused such massive outages to Netflix and Heroku.

I'm making an assumption that both Heroku and Netflix weren't incompetent to just use one zone.

Anyone able to shed light on this?

Rakesh

Steel City Phantom

unread,

Jul 2, 2012, 6:33:32 AM7/2/12

to java...@googlegroups.com

I don't know if its still relevant, but MAE East is located in Tysons corner virginia. MAE East at one point routed over 50% of US internet traffic and routed over 80% of over-seas traffic over the Global Crossing fiber optic lines running to Europe. You take out that building, and yea, you have major outages.

but again, this was 10 years ago, i don't know if its still accurate today.

Rakesh

--
You received this message because you are subscribed to the Google Groups "Java Posse" group.
To post to this group, send email to java...@googlegroups.com.
To unsubscribe from this group, send email to javaposse+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/javaposse?hl=en.

--
You want it fast, cheap, or right. Pick two!!

Robert Casto

unread,

Jul 2, 2012, 8:26:44 AM7/2/12

to java...@googlegroups.com

Amazon's backup systems don't seem to have worked and many people said the automated fail over to other data centers didn't work either. Then it was also supposed to be a short outage but turned into a very long one that corrupted EBS volumes. It took hours to recover databases and file systems. A power outage of this type should have been a non-event. A data center near me lost power and went to generators and didn't have any issues at all. I'm wondering if they grew the data center larger than their backup systems could handle.

Robert Casto
www.robertcasto.com
www.sellerstoolbox.com

www.lakotaeastbands.org

Steel City Phantom

unread,

Jul 2, 2012, 8:43:48 AM7/2/12

to java...@googlegroups.com

wouldn't be the first to grow to fast, far from the last. Its a lot easier to sell an executive team on a 100,000 primary server network than it is 100,000 primary servers and 75,000 backups.

granted, thats only one configuration, there are dozens of different configurations that come to mind.

Casper Bang

unread,

Jul 2, 2012, 3:28:05 PM7/2/12

to java...@googlegroups.com

I wonder if this is also related to the leap second which caused Reddit (using Cassandra, on the JVM), Mozilla (using Hadoop, also based on the JVM) and FourSquare, LinkedIN, StumbleUpon and Gawker to show hickups.

FourSquare, LinkedIN, StumbleUpon og blognetværket Gawker

Fabrizio Giudici

unread,

Jul 3, 2012, 5:13:57 AM7/3/12

to java...@googlegroups.com, Casper Bang

On Mon, 02 Jul 2012 21:28:05 +0200, Casper Bang <caspe...@gmail.com>
wrote:

> I wonder if this is also related to the leap second which caused Reddit
> (using Cassandra, on the JVM), Mozilla (using Hadoop, also based on the
> JVM) and FourSquare, LinkedIN, StumbleUpon and Gawker to show hickups.

... which somewhat surprised me, not because I can't figure out how a leap
second can break things, but because there have been at least two dozens
leap seconds since 1970, so I presume a good deal of leap seconds in the
latest ten years, when Linux and Java already had a relevant share. So,
what the news are? Did really happened something new or things broke also
in the past, but there was no such news coverage?

--
Fabrizio Giudici - Java Architect, Project Manager
Tidalwave s.a.s. - "We make Java work. Everywhere."
fabrizio...@tidalwave.it
http://tidalwave.it - http://fabriziogiudici.it

Casper Bang

unread,

Jul 3, 2012, 5:23:51 AM7/3/12

to java...@googlegroups.com, Casper Bang

Where I work, we're used to dealing with time-series data and this really sounds like developers making the mistake to assume that there are always 60 seconds in a minute. Working with time is not as trivial as it sounds, once you start rolling up data views and interpolating within and between steps. In other words, I doubt if Java or the JVM is directly to blame here, sounds more like erroneous assumptions - but detail on the matter is limited so far.

Fabrizio Giudici

unread,

Jul 3, 2012, 5:30:37 AM7/3/12

to java...@googlegroups.com, Casper Bang

On Tue, 03 Jul 2012 11:23:51 +0200, Casper Bang <caspe...@gmail.com>
wrote:

> Where I work, we're used to dealing with time-series data and this really
> sounds like developers making the mistake to assume that there are always
> 60 seconds in a minute. Working with time is not as trivial as it sounds,
> once you start rolling up data views and interpolating within and between
> steps. In other words, I doubt if Java or the JVM is directly to blame
> here, sounds more like erroneous assumptions - but detail on the matter
> is
> limited so far.

If you read around, Google blogged a few weeks ago that their strategy is
to slow down the company NTP clock so that at the end of the day with the
leap second they are still in sync with the global time, but their systems
never saw the leap second:

http://www.ciol.com/News/News-Reports/Leap-Second-How-Google-saved-its-websites/164026/0/

This sounds as an admission that there's no safe code around to deal with
leap seconds :-)

Also, that paper referring to the "experience of 2005" perhaps confirms
that the problem is not new, but there were less coverage of the news a
few years ago.

Casper Bang

unread,

Jul 3, 2012, 10:06:11 AM7/3/12

to java...@googlegroups.com, Casper Bang

Also, that paper referring to the "experience of 2005" perhaps confirms
that the problem is not new, but there were less coverage of the news a
few years ago.

It also happened in 2007 apparently, and it's not isolated to just one issue (but rather, 5):

http://landslidecoding.blogspot.dk/2012/07/linuxs-leap-second-deadlocks.html

Fabrizio Giudici

unread,

Jul 3, 2012, 11:24:49 AM7/3/12

to java...@googlegroups.com, Casper Bang

On Tue, 03 Jul 2012 16:06:11 +0200, Casper Bang <caspe...@gmail.com>
wrote:

>
>>

More fun: my (rather large) hosting provider sent me an email inviting me
to check my Linux boxes and eventually perform a soft reboot. They have
detected a rather large peak increase on consumed power in their farm and
found that several Linux boxes have CPU that went crazy.

Actually I've checked and I have some parts of the kernel (ksoftirqd) that
are sucking most of the CPU.

Scary!

Fabrizio Giudici

unread,

Jul 3, 2012, 11:29:54 AM7/3/12

to java...@googlegroups.com, Casper Bang

On Tue, 03 Jul 2012 17:24:49 +0200, Fabrizio Giudici
<Fabrizio...@tidalwave.it> wrote:

> Scary!

For reference:

http://www.h-online.com/open/news/item/Leap-second-Linux-can-freeze-1629805.html

Casper Bang

unread,

Jul 3, 2012, 12:11:00 PM7/3/12

to java...@googlegroups.com, Casper Bang

More fun: my (rather large) hosting provider sent me an email inviting me
to check my Linux boxes and eventually perform a soft reboot. They have
detected a rather large peak increase on consumed power in their farm and
found that several Linux boxes have CPU that went crazy.

Actually I've checked and I have some parts of the kernel (ksoftirqd) that
are sucking most of the CPU.

Scary!

Yeah I've gotten a similar notification:

"During the night of 30.06.2012 to 01.07.2012 our internal
monitoring systems registered an increase in the level of
IT power usage by approximately one megawatt.

The reason for this huge surge is the additional switched
leap second which can lead to permanent CPU load on Linux
servers."

Interesting times when we need to start writing unit-tests which asserts power-consumption levels! I'm still searching for more detail, considering how much of my stuff runs one form of Linux or another (router, NAS, phone, HD recorder etc.) it would be interesting to get an idea of just how widespread this issue is.

Fabrizio Giudici

unread,

Jul 3, 2012, 4:12:50 PM7/3/12

to java...@googlegroups.com, Casper Bang

On Tue, 03 Jul 2012 18:11:00 +0200, Casper Bang <caspe...@gmail.com>
wrote:

> Yeah I've gotten a similar notification:
>
> "During the night of 30.06.2012 to 01.07.2012 our internal
> monitoring systems registered an increase in the level of
> IT power usage by approximately one megawatt.

We share the same provider :-)

> Interesting times when we need to start writing unit-tests which asserts
> power-consumption levels! I'm still searching for more detail,
> considering
> how much of my stuff runs one form of Linux or another (router, NAS,
> phone,
> HD recorder etc.) it would be interesting to get an idea of just how
> widespread this issue is.

Same thoughts. As I said, rather scary.

Message has been deleted

Carl of the Posse

unread,

Jul 5, 2012, 2:28:32 PM7/5/12

to java...@googlegroups.com

Amazon posted a nice summary of what went wrong with their systems:

http://aws.amazon.com/message/67457/

Problems with backup power, and then most importantly, problems with load balancing control were what made the zone outage hard to work around.

We (Netflix) might post a blog explaining how that affected us, the internal issues that resulted and what we are doing about it. I'll reply to this group if we do.

Ricky Clarkson

unread,

Jul 5, 2012, 4:54:40 PM7/5/12

to java...@googlegroups.com

If "Release It! Design and Deploy Production-Ready Software" gets another edition I'm sure this will be in it as an example. It covers this kind of failure cascade beautifully.

--
Skype: ricky_clarkson

--

You received this message because you are subscribed to the Google Groups "Java Posse" group.

To view this discussion on the web visit https://groups.google.com/d/msg/javaposse/-/cg9wd7EhHGcJ.

Kevin Wright

unread,

Jul 5, 2012, 5:48:19 PM7/5/12

to java...@googlegroups.com

Interestingly, there was another failure mode not outlined in that summary. Mostly because it was "by design" and can only be considered a failure mode for users of Amazon's cloud, but not a failure of the cloud itself.

What we (Zeebox) noticed was that machines in other regions were going down as well. Most notably in us-west, but even some in eu-west. The machines in question were "spot" instances, where you bid a price and the real-time value of an instance is based on demand. If the value is below your bid, then you have a machine. When it goes over your bid, you lose it.

It's an ideal model in may circumstances. I'll leave you to decide whether it worked exactly as it should in this instance, or if it can be classed as another level of the cascading failures :)

As us-east went down, people turned to spot instances to make up for the lost capacity. In turn, this drove up the price, and anyone who had a spot instance happily doing its thing found themselves outbid and machine-less.

And yes, it happened to us; though I'll add that we don't use spot instances for anything which would affect the user experience! They're better suited for continuous load testing and other similar tasks where a vanishing machine isn't too painful.

Robert Casto

unread,

Jul 5, 2012, 10:12:33 PM7/5/12

to java...@googlegroups.com

I think you meant that people were spinning up boxes (not spot instances) and so everyone with a spot instance was loosing theirs. I would have to say that market forces were at work here so that part of the system worked correctly. What was bad is that the EBS's were loosing power without being told to shutdown properly. Thus when they came back up they were marked inconsistent so you had to rebuild the volume. Everyone else is doing the same thing so there is a huge contention for resources. Every problem they experience just makes the service all that much better. My colo wouldn't have had this trouble, but then they are not trying to house Pinterest, Netflix, and every other company out there.

--
You received this message because you are subscribed to the Google Groups "Java Posse" group.

To post to this group, send email to java...@googlegroups.com.
To unsubscribe from this group, send email to javaposse+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/javaposse?hl=en.

Joe Sondow

unread,

Jul 6, 2012, 12:57:47 PM7/6/12

to java...@googlegroups.com

The Netflix blog post is up now.

Lessons Netflix Learned from the AWS Storm

http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html

Reply all

Reply to author

Forward