And… it all comes back to the databases, as it always seems to.
Multi-site is great in theory. But whether it’s hot-hot-load-balanced, or hot-cold-cut-over, or something in the middle, the issue of data replication and data integrity continue to be the core problem.
Amazon would likely do what some here have suggested, and automagically move people around when there’s an outage like this, except they likely can’t replicate all the data, maintain integrity, and keep up with the volume of changes on a second-to-second basis. SAN-to-SAN in a single data center is feasible, but from East-to-West coast is just not practical, today, on the scale of what Amazon is dealing with. And even so, this doesn’t really deal with the database/transactional integrity issue, which adds a whole additional layer.
So we agree that IaaS vendors don’t give us free, transparent DR/COOP, and we need to build that in just like we did before. But what about the increasing number of PaaS/SaaS (real) cloud vendors, where we have no insight or control over the infrastructure? Don’t those PaaS/SaaS (real) cloud vendors have to build in their own DR/COOP? And shouldn’t they disclose that to us?
I hope that products like NimbusDB (based on that great description a few weeks back) come to fruition and finally free us from location-centric-with-hacked-replication DB models. But it seems we’re still not there, and so multi-site cloud will still suffer from this currently not-intractable-but-really-pain-in-the-neck problem.
Adrian makes an excellent point that replication doesn't necessarily mean like for like increases in cost. Yes you'll have a degree of base load maintaining data across multiple locations but can then elastically move work loads between locations. You can have this baked into a good set-up so whether it is relative customer demand or a location going down, the system should react automatically and accordingly to maintain service. This was our final point in the post, this is a lot easier to achieve with a good cloud set-up than with dedicated hardware. Actually with dedicated you would likely see the sort of replication that Miha was talking about because you don't have the cloud elasticity to play with so you need full redundancy of capacity.
This outage will likely encourage many users of IaaS to look at adopting such strategies which can only be a good thing but in doing so they would be in danger of fighting the last war not the next to use an old military adage. Single points of failure be that a location or a system aren't good. I think the next big disruption will be vendor specific, meaning an issue with a vendor be it software or corporate. That's another single point of failure and regardless of size, absolutely no company is beyond failure or problems. To deny this is, IMHO hubris. That's a lesson I think everyone should see from the recent financial collapse, Enron scandals before it etc. Given that there are many credible choices in the IaaS space, a company of the size wanting a proper multi location strategy should also implement a multi-vendor strategy.
For example, what happens if a company you are doing business with has their assets frozen? Or perhaps more likely, an exploit is found to some software they are using. That could bring down ALL their locations and potentially destroy all their data too in an extreme case. It isn't likely but with so much data and computing moving to the cloud, there will definitely be exploits found in some of the plethora of cloud management software suites being pushed out currently and that is the sort of vendor specific multi-location problem you could see.
Don't get me wrong, the cloud is in my opinion far more robust than a dedicated deployment and I don't wish to be accused of fear mongering because I think the track record of IaaS to date speaks for itself. My point is, doesn't it make sense to address all single points of failure that can be identified and understood, especially when they can be relatively easily addressed?
Kowsik, Very well said. One key lesson is for customers to not auto-magically believe that because they port their apps & data to either an IAAS or PaaS provider that the basic best practices for DR, Redundancy and Fail over need not apply to achieve continuous system ops in a public cloud.
For the point about automatically moving things around to other facilities - that is a slippery slope given that Amazon is an IAAS provider with Pops in several countries. There would need to be more workflows, tools for customers for audit, etc to ensure that regulated apps and data were not moved to the wrong Pop. As a company they have a responsibility to insure applications are compliant not just with technical SLAs but also Business Directives (regulatory, security, or people)....
On Sat, Apr 23, 2011 at 12:06 PM, Khazret Sapenov <firstname.lastname@example.org> wrote: > On Sat, Apr 23, 2011 at 2:58 PM, Miha Ahronovitz <email@example.com> > wrote: >> >> Khazret, to give credibility to what you say, see >> http://status.aws.amazon.com/ >> California EC is up, N. Virginia is down. > > Thanks, Miha for enhancing credibility of my post, but I've posted this link > already yesterday and it's referenced everywhere. > >> >> But why it should be the worry of the customers to move things >> around? >> AWS should have mechanisms to move automatically to other facilities. >> It doesn't > > Your statement contradicts your employer's notion of IaaS providing only > bare minimum with all 'non-relevant' functionality outsourced to external > parties. > Now you want your competition to make an extra step, that sounds logical, > but not implemented by many [IaaS] yet. > Nice.
+1. AWS is the single largest multi-region IaaS vendor that provides a consistent API (across regions). If the apps were designed for failing over to an alternate region, all the bare metal capabilities do already exist to solve this problem (ELB, anycast DNS). Netflix, for example, runs in 3 different regions (see @adrianco's tweets).
I wouldn't say the same to PaaS vendors (built on top of AWS) that put all their apps in a single location. That was fail.
==================== This email is from CLOUDSIGMA AG. The contents of this email and any attachments are confidential to the intended recipient. They may not be disclosed to or used by or copied in any way by anyone other than the intended recipient. If this email is received in error, please contact CLOUDSIGMA AG on +41 (0)44 585 39 07 quoting the name of the sender and the email address to which it has been sent and then delete it. Please note that neither CLOUDSIGMA AG nor the sender accepts any responsibility for viruses and it is your responsibility to scan or otherwise check this email and any attachments. CLOUDSIGMA AG is a public limited company registered in Canton Zürich, Switzerland (registered number CH-020.3.034.422-0) with registered offices at Sägereistrasse 29, 8152 Glattbrugg, Switzerland. For further information, please refer to www.cloudsigma.com . ====================