Ray, why should I know, as a principle the insides and the misbehavior of an element from my providers architecture, designed to give service a customer who does not work there ?
If you remember we defined a long time the cloud from the customer perspective. Customers have the "illusion of infinite resources" (elasticity, meaning constant quality of service), they pay only what they use (billing predictable) and they have no idea - and want to have no idea or have steep learning curves - to learn the cloud internals.
AWS - because of EBS - violates this definition of a cloud many times over. A well designed cloud does not require it's customers to learn nothing new. They simply must take their apps, as they run them today on their physical servers, and place it on the
cloud. Are they providers like that? Yes they are, but they are not as big as AWS. For now.
Miha, the problem with EBS is that, like all other parts of the AWS architecture, it’s shared. So the question is, if you know this and don’t design you app or at least your DR strategy with this in mind, what’s broken? AWS or your app’s architecture.
I would argue it’s the later and it reminds me of the first move from mainframe to client/server apps. There was much experimentation and invention before multi-tiered, distributed application stacks came to be. That’s what’s happening now with the cloud. The learning curve is steep and teams like reddit and Quara are getting
hard lessons now.
The lesson here is that lessons will be the norm for some time until popular design patterns emerge and people understand how to build apps for the cloud.
Adrian thanks. It is magic what you guys did. Thanks for teaching us it does not cost so much more to run in three zones.
However, as you twitted, Netflix was lucky that all AWS failed during the night. It appears during the peak day hours, the operation the reduction from 3 to 2 zone might not have had gone so smoothly.
Also in the immense flow of explanations and article coming from all
directions, how come such smart teams like Zencoder , reddit or quora did not do the same as Netflix? Actually reddit said " "EBS also has reliability issues. Even before AWS fail, we had random disks degrading multiple times a week".
" The problem with EBS is that it doesn't have a particularly steady state. To explain why we need to look at the underlying architecture. I don't know the details of how EBS is implemented, but there is enough information available to explain how it behaves."
some pragmatic tests to " collect response time and throughput and plot your data over time. You need to run long enough that the performance shows steady state behavior." How many enterprises have teams to do these experiments with a product offered by most prestigious IaaS and PaaS provider in the known Universe?
So my remark of the definition of the engineer, fits best the the team who designed the EBS product to begin with. This is not a finished product. There must be a way to " provide a reliable place to store data that doesn't go away when EC2 instances are dropped," without mounting EBS volumes on a single EC instance until is crashes"
BTW, other providers have persistent storage, You know them.
I do recognize your team contribution, you passed an extraordinary test. But, AWS has some intrinsic problems, EBS is one of them, that in spite of the warnings coming from all directions, it was not fixed. We are all
humans, we err, sure we understand that.
It is not engineering. If AWS opens and invites third parties to develop solutions to make persistent storage in EC instances, we will be surprise what the worls can come with. But if they keep it in house, invoking a monopoly. People who believe they can do everything themselves, are punished by then Divine, whom we can not ever replace. They loose their gift of prophecy, and may fall down like an apple froma tree.
2 cents and thanks for keeping the debate interesting.
On 4/24/2011 3:18 PM, AdrianC wrote:
If you think it costs 3X to be in three availability zones then you do
need better engineers...
We run a third of our systems in each zone
normally, to avoid the bad
zone we moved to be running half our systems in each of two zones.
About the same number of systems total, and the cost of the systems is
the dominant cost.
Since the cloud is elastic (remember, that was the point :-) we don't
need to pre-allocate capacity in each zone to take the extra load,
unlike a datacenter DR solution.
The cost of running in multiple zones is a minor increase in network
cost, slightly more latency, and that you have to decide to do it up
front. This works for any scale, nothing to do with running large
scale. If you don't have enough instances to spread three ways, use