Following the major outage that AWS suffered in their east coast US
facility this week, after the dust settles, what lessons can customers
actually learn from the events of this last week?
Here are the five key lessons we've highlighted to customers:
Lesson 1: Both Cloud and Dedicated Computing Have Single Points of
Failure
Lesson 2: Size is No Protection from Outages without Redundancy
Lesson 3: All Data Centres Are Not Equal
Lesson 4: The Price-Performance-Reliability Metric
Lesson 5: Achieving a highly robust set-up is cheaper and easier in
the Cloud
Customers need openness from vendors about their infrastructure
choices and locations in order to create price-performance-reliability
comparisons between clouds. This is a key development needed if people
are to make the right decisions and create the appropriate strategies
in line with their computing needs in the cloud.
Best wishes,
Robert
CTO
CloudSigma
Full test of our blog post on this subject can be found at
http://www.cloudsigma.com/en/blog/2011/04/23/21-cloud-outages-lessons-learned
.
--
~~~~~
3rd Annual Cloud Slam 2011 Conference * April 18-22, 2011 * Mountain View, CA * http://cloudslam.org
UP 2010 Conference: http://www.up-con.com
Posting guidelines: http://bit.ly/bL3u3v
Follow us on Twitter @cloudcomp_group
Post Job/Resume at http://cloudjobs.net
Get hundreds of conference sessions and panels on cloud computing on DVD at
http://www.amazon.com/gp/product/B002H07SEC, http://www.amazon.com/gp/product/B004L1755W, http://www.amazon.com/gp/product/B002H0IW1U or GET instant access to downloadable versions at
- http://www.up-con.com/register
- http://cloudslam09.com/content/registration-5.html
- http://cloudslam10.com/content/registration
~~~~~
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
To post to this group, send email to cloud-c...@googlegroups.com
To unsubscribe from this group, send email to cloud-computi...@googlegroups.com
No IT service based on the mechanically bound unit of a Datacenter can
overcome the eventual failure of the mechanics. Power will be lost, network
interrupted, equipment fail, cloud provider will go out of business.
Building apps that are Datacenter bound is a GAURENTEE that you will have
downtime at some point in the app's lifecycle. If it's an HR app it's
probably no big deal. If it's an ecommerce app generating a million dollars
a minute in revenue, it will be a very big deal.
The lessons learned by the AWS victims and onlookers alike this week should
be that applications designed from the ground up to survive mechanical
failure will have less downtime that their hardware bound cousins. Netflix
is the most visible of applications built like this and, in fact, it's most
recent outage was not cloud related but, rather, a failure in one of its few
remaining datacenters. Netflix survived this week's AWS outage because it
redesigned its application off of its hardware dependency. To take this
fault tolerant strategy a step further, Netflix runs a continual set of
tests to ensure its capability to deal with infrastructure failure.
If you are going to build applications in the cloud, at a minimum, you need
to have a cloud provider with more than one datacenter in more than one
geography. Otherwise you might as well accept that, sometime in the future,
you will have a significant outage.
Best wishes,
Khazret, to give credibility to what you say, see http://status.aws.amazon.com/
California EC is up, N. Virginia is down.
But why it should be the worry of the customers to move things
around?
AWS should have mechanisms to move automatically to other facilities.
It doesn't
~~~~~
3rd Annual Cloud Slam 2011 Conference * April 18-22, 2011 * Mountain View, CA * http://cloudslam.org
UP 2010 Conference: http://www.up-con.com
Posting guidelines: http://bit.ly/bL3u3v
Follow us on Twitter @cloudcomp_group
Post Job/Resume at http://cloudjobs.net
Get hundreds of conference sessions and panels on cloud computing on DVD at
http://www.amazon.com/gp/product/B002H07SEC, http://www.amazon.com/gp/product/B004L1755W, http://www.amazon.com/gp/product/B002H0IW1U or GET instant access to downloadable versions at
- http://www.up-con.com/register
- http://cloudslam09.com/content/registration-5.html
The other key is to avoid using cloud provider specific features like RDS. This is killing some AWS customers as RDS seems to be having an especially hard time recovering and there is no RDS equivalent in other cloud provider feature inventories to fail over to.
--~~~~~
3rd Annual Cloud Slam 2011 Conference * April 18-22, 2011 * Mountain View, CA * http://cloudslam.org
UP 2010 Conference: http://www.up-con.com
Posting guidelines: http://bit.ly/bL3u3v
Follow us on Twitter @cloudcomp_group
Post Job/Resume at http://cloudjobs.net
Get hundreds of conference sessions and panels on cloud computing on DVD at
http://www.amazon.com/gp/product/B002H07SEC, http://www.amazon.com/gp/product/B004L1755W, http://www.amazon.com/gp/product/B002H0IW1U or GET instant access to downloadable versions at
- http://www.up-con.com/register
- http://cloudslam09.com/content/registration-5.html
- http://cloudslam10.com/content/registration
~~~~~
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
To post to this group, send email to cloud-c...@googlegroups.com
To unsubscribe from this group, send email to cloud-computi...@googlegroups.com
+1. AWS is the single largest multi-region IaaS vendor that provides a
consistent API (across regions). If the apps were designed for failing
over to an alternate region, all the bare metal capabilities do
already exist to solve this problem (ELB, anycast DNS). Netflix, for
example, runs in 3 different regions (see @adrianco's tweets).
I wouldn't say the same to PaaS vendors (built on top of AWS) that put
all their apps in a single location. That was fail.
--
~~~~~
3rd Annual Cloud Slam 2011 Conference * April 18-22, 2011 * Mountain View, CA * http://cloudslam.org
UP 2010 Conference: http://www.up-con.com
Posting guidelines: http://bit.ly/bL3u3v
Follow us on Twitter @cloudcomp_group
Post Job/Resume at http://cloudjobs.net
Get hundreds of conference sessions and panels on cloud computing on DVD at
http://www.amazon.com/gp/product/B002H07SEC, http://www.amazon.com/gp/product/B004L1755W, http://www.amazon.com/gp/product/B002H0IW1U or GET instant access to downloadable versions at
- http://www.up-con.com/register
- http://cloudslam09.com/content/registration-5.html
- http://cloudslam10.com/content/registration
~~~~~
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
To post to this group, send email to cloud-c...@googlegroups.com
To unsubscribe from this group, send email to cloud-computi...@googlegroups.com
Ray
-----Original Message-----
From: cloud-c...@googlegroups.com
[mailto:cloud-c...@googlegroups.com] On Behalf Of kowsik
Sent: Saturday, April 23, 2011 5:31 PM
To: cloud-c...@googlegroups.com
Subject: Re: [ Cloud Computing ] Re: 5 Key Lessons for Customers of the
Cloud
--
Since there are no fully-baked open source solutions or ratified standards documents, all cloud platforms are "proprietary" today. Saying that AWS is "one of t he most proprietary" makes no sense.
The AWS API represents the nearest thing to a de facto standard, however. It's supported by OpenStack, Eucalyptus, and others. So if you define "proprietary" as "leading to lock-in", AWS is (relatively) non-proprietary!
In practice, none of the leading public cloud operators have much incentive to make it easy for customers to leave. This means that interoperation and high-level abstraction is going to be driven by customer groups (like TM Forum) and orchestration vendors like RightScale.
Personally, I'd like to see a two-tier API standard emerge: EC2/EBS/... for the lower, procedural interface, and something like the Oracle Cloud Resource Model (http://www.oracle.com/us/corporate/press/184426) for high-level declarative deployments. I think that the Oracle - formerly Sun - model is more powerful and expressive than either vCloud or CloudFormation, but there's still a lot of work to be done in this area.
Now if they host on 3 regions, quite simply this means a cost is 3X
versus 1X. Say they got a deal. They still pay , say 2X, and it should
be two HUGE Xs.
Netflix has a group of rock stars that claim credit for the success of
Netflix port to AWS and now their survival of AWS failure. But not every
company has the same capabilities. And even if they have the same
capabilities (like Zencoder and Quora, both with superb engineering
talent) they do not have the same $ to spend.
The Zecoder blog entry is must read
http://blog.zencoder.com/2011/04/22/skynet-ec2-and-zencoder/
As a trainee engineer, I was told the following definition of an
engineer: "An engineer is someone who can build with $1M the same
thing that any fool can with $100M"
Cheers,
Miha
" The problem with EBS is that it doesn't have a particularly steady state. To explain why we need to look at the underlying architecture. I don't know the details of how EBS is implemented, but there is enough information available to explain how it behaves."
Miha, the problem with EBS is that, like all other parts of the AWS architecture, it’s shared. So the question is, if you know this and don’t design you app or at least your DR strategy with this in mind, what’s broken? AWS or your app’s architecture.
I would argue it’s the later and it reminds me of the first move from mainframe to client/server apps. There was much experimentation and invention before multi-tiered, distributed application stacks came to be. That’s what’s happening now with the cloud. The learning curve is steep and teams like reddit and Quara are getting hard lessons now.
The lesson here is that lessons will be the norm for some time until popular design patterns emerge and people understand how to build apps for the cloud.
Ray
From: cloud-c...@googlegroups.com [mailto:cloud-c...@googlegroups.com] On Behalf Of Miha Ahronovitz
Sent: Sunday, April 24, 2011 7:32 PM
To: cloud-c...@googlegroups.com
Subject: Re: [ Cloud Computing ] Re: 5 Key Lessons for Customers of the Cloud
Adrian thanks. It is magic what you guys did. Thanks for teaching us it does not cost so much more to run in three zones.
--
Kowsik, Very well said. One key lesson is for customers to not auto-magically believe that because they port their apps & data to either an IAAS or PaaS provider that the basic best practices for DR, Redundancy and Fail over need not apply to achieve continuous system ops in a public cloud. For the point about automatically moving things around to other facilities - that is a slippery slope given that Amazon is an IAAS provider with Pops in several countries. There would need to be more workflows, tools for customers for audit, etc to ensure that regulated apps and data were not moved to the wrong Pop. As a company they have a responsibility to insure applications are compliant not just with technical SLAs but also Business Directives (regulatory, security, or people).... Nice post. Regards, Jeanne --- On Sat, 4/23/11, kowsik <kow...@gmail.com> wrote: |
|
And… it all comes back to the databases, as it always seems to.
Multi-site is great in theory. But whether it’s hot-hot-load-balanced, or hot-cold-cut-over, or something in the middle, the issue of data replication and data integrity continue to be the core problem.
Amazon would likely do what some here have suggested, and automagically move people around when there’s an outage like this, except they likely can’t replicate all the data, maintain integrity, and keep up with the volume of changes on a second-to-second basis. SAN-to-SAN in a single data center is feasible, but from East-to-West coast is just not practical, today, on the scale of what Amazon is dealing with. And even so, this doesn’t really deal with the database/transactional integrity issue, which adds a whole additional layer.
So we agree that IaaS vendors don’t give us free, transparent DR/COOP, and we need to build that in just like we did before. But what about the increasing number of PaaS/SaaS (real) cloud vendors, where we have no insight or control over the infrastructure? Don’t those PaaS/SaaS (real) cloud vendors have to build in their own DR/COOP? And shouldn’t they disclose that to us?
I hope that products like NimbusDB (based on that great description a few weeks back) come to fruition and finally free us from location-centric-with-hacked-replication DB models. But it seems we’re still not there, and so multi-site cloud will still suffer from this currently not-intractable-but-really-pain-in-the-neck problem.