AWS unreliable?

35 views
Skip to first unread message

Justin Collum

unread,
Dec 27, 2012, 12:35:59 AM12/27/12
to Portland ALT.NET
Seems like there have been a number of high profile outages on AWS lately. 


Amazon (NSDQ:AMZN) Web Services (AWS) couldn't have picked a worse time to have a service outage for its Netflix customers. Just as thousands of Netflix users were settling in on Christmas Eve to watch a movie at home, AWS went down.

"We're sorry for the Christmas Eve outage. Terrible timing! Engineers are working on it now. Stay tuned to @Netflixhelps for updates," Netflix tweeted at 4:25 PST, Dec. 25.

Lee Harding

unread,
Dec 27, 2012, 11:02:02 AM12/27/12
to pdxalt...@googlegroups.com
Hyperbole aside, it is ironic that the customer playing a marque role onstage at re:Invent conference was a few weeks later suffering a high-profile outage.  If you are really, really concerned with having your site available every second of every day, then support this uservoice request for static website support for Visual Studio -- it's the only way to get there:




Adron Hall

unread,
Dec 27, 2012, 6:01:53 PM12/27/12
to pdxalt...@googlegroups.com
?  Not to be a hipster, but stop waiting for Microsoft. Just grab some Node.js/Generation/Github whatever it's called or the kazillion solutions out there for that already. There's a ton of them.

Also, not sure how static site generation is going to help in a serious outage like what happened.

Also, AWS was "partly" down, Netflix is usually prepared, but what happened was a strange mix of EBS issues. Netflix auto-recover/magic monkey didn't fling the sites back online when they were supposed to. Outage is a strong way to put it in reflection of AWS and the cloud, but more of a malfuction.  This was by no means what happened back in April of 2011. Even then, it was only 1 data center.

Summary. If you're worried about HA, you gotta be prepared for failure all the time. Netflix recovered amazingly fast in regards to what happened. It will doubtfully ever happen again in that situation - on Netflix and AWS's side of things.

But anyway, enough defending and harping.  ;)

Static sites == good if you have no real dynamic interaction or content. i.e. it wouldn't have worked for Netflix.
Dynamic content == need a good multi-datacenter replication ability in place. Unfortunately very few applications are prepared and even fewer technologies actually focus on this. Some of the good ones though are multi-node systems that you don't have to know are multi-node - such as Cloud foundry/Iron Foundry and on the back end, Riak or static chached Redis + Riak.

Oh, and on that Note, yeah I work for Basho if you guys want to talk about hard core data resiliency among outages like this.  ;)   You guys should sign up for this too:  http://www.meetup.com/Portland-Riak/events/95049492/

Seriously, it'll be a lot of fun. Come hang out, have some pizza/food/whatever we have and probably beer/drinks etc. and talk .NET, ALT.NET, Big Data and Riak awesomeness.

-Adron

Adron B Hall
Iron Foundry Projecthttp://www.ironfoundry.org

Troy Howard

unread,
Dec 27, 2012, 7:17:45 PM12/27/12
to pdxalt...@googlegroups.com
OK, So AWS wasn't down, per-se. Specifically, some of the ELBs fell over. ELBs are "Elastic Load Balancers" which are basically just a routing machine which balancing incoming load to backends that handle the traffic. You can terminate SSL there, and create sticky sessions to keep users going to the same backend node, which is great for caching performance. 

Netflix, unfortunately, relies on AWS's ELBs for their frontend. They COULD have built their own load balancing layer, using HAProxy or the like, but getting the feature set and tight integration that ELBs provide would require a lot of work. 

Here's some interesting details about ELBs for more reading: http://harish11g.blogspot.com/2012/07/aws-elastic-load-balancing-elb-amazon.html

ELBs are supposed to be fault-tolerant. But as point #4 in the above article mentions, ELBs aren't so great a flash traffic. I imagine that's what happened to Netflix. A *sudden* rush for connections, caused many of the ELB instances to die, and that left a significant portion of the backend webservers unavailable to serve content, since the routing path to them was broken. This is most poingant when all of the ELBs in a certain zone go down, because there's not even anyway to bring those instances back into the fold in a outage situation, given AWS's architecture.

Netflix has a *very* resilient architecture. They invented the Chaos Monkey, which is a system specifically designed to trash, at random, their infrastructure, *IN PRODUCTION* to make sure it's resilient. So Netflix is always killing instances, all the time, just to make sure it can always deal with it. They trusted AWS ELB to be as fault tolerant as their own systems, but apparently, that trust was misguided. 

There are grumblings out there that it's possible that "foul play" was involved, since Netflix competes w/ Amazon's video streaming offering. I think that is a silly proposition, but certainly possible.

In the end, for a company that values fault tolerance so highly, it's surprising that they let this very important piece of infrastructure be managed by Amazon instead of rolling their own. They certainly have the resources to build that themselves, and I'd be very surprised if they didn't start working on that RIGHT NOW. :)

Regarding static vs dynamic... There is no magic bullet here, and a VS feature won't make your site 'mo better. The boundary between static and dynamic is the cache and the cache expiry rules. Anything that can go static should, and that means, build your "web app" as a REST API, and build your "web UX" using a client-side javascript framework like Knockout/Backbone which talks to that API. Put a reverse proxy in there with a decent cache (varnish or nginx) and make sure to include reasonable cache expiry headers in your API responses.

All that said, you'll still need a reliable load balancing layer, and you'll probably just have to build that damned thing yourself. :) 

Well.. actually, unless you're building something with the traffic that Netflix gets, then you probably dont need a LB layer at all, or can get away with and OTS or service like ELB.

Oh and to Justin's original comment: Yes, AWS has been having a ton of reliability problems recently WRT ELB, RDS, and some EBS problems. This is not your imagination. I think they are getting more use and that we're seeing it get unstable because of that. 

For now, because no  one is really using it, go run your stuff on Windows Azure, Google Compute Engine, or HP Cloud. That last one, so far, has been *rock solid* for us at AppFog and performs quite well. Not sure how they stack up on price/COGS, but we've had no trouble at all. My personal opinion: avoid Rackspace, as it's low usage and pretty unstable. 

Another option is just to use a private VPS and avoid the crowds all together. Working great for my side project.  

Thanks, 
Troy
Reply all
Reply to author
Forward
0 new messages