Some thoughts on offering IaaS (or some things I learned building SimpleDB)

16 views
Skip to first unread message

PatRansil

unread,
Jun 11, 2009, 6:29:06 PM6/11/09
to Cloud Computing Interoperability Forum (CCIF)
I don't claim to be an expert but I have a few observations that might
interest others. This post is about things that I think are important
if you are planning to offer IaaS, either as a startup or a new line
of business at an existing company.

As with any new venture, it helps if you start with an 'unfair
advantage' over others who might want to enter the space. For IaaS,
this could mean:
- You already have a huge datacenter that pays for itself. This means
you have economies of scale which can bring your cost down below what
smaller rivals could achieve.
- You have expertise in running datacenters efficiently. Running a DC
is a very complex task and there are *many* decisions to make that
will affect your uptime, cost, and the morale of your staff who needs
to keep the system running and fight fires when issues arise.
- You have customers who are asking about and can benefit from IaaS.
- You have expertise in designing scalable self-healing systems. You
understand how to design "so that scale is your friend, not your
enemy" to paraphrase Werner Vogels (Amazon CTO).

Amazon had the advantage of strength in all these areas. Many hosting
companies are strong in the first 3. I will follow with another post
with more details on the last item

Greg Pfister

unread,
Jun 13, 2009, 6:41:33 PM6/13/09
to Cloud Computing Interoperability Forum (CCIF)
Hi, Pat. I like your list. Couldn't it be summarized as "experience in
developing and running a your own private IaaS cloud"?

Your list is still needed, of course, since it indicates what that
means. I'm stating it that way specifically to raise the hackles of
those who regularly post here about how "private clouds don't make
sense / aren't clouds / are the work of the devil / etc."

And several months ago I tired of responding with the question "What
was AWS one day before it was publically announced?"

Greg Pfister
http://perilsofparallel.blogspot.com/

Pat Ransil

unread,
Jun 13, 2009, 7:47:31 PM6/13/09
to cloud...@googlegroups.com
Greg,
Your summary is pretty close but I separated some of the items because you could have a top team with this experience but if they are starting a new company they don't have the economies of scale or the customer list.

The other thing to make explicit is that you could run a private datacenter, or even a public hosting facility, but if you use normal hosting architectures and procedures, your manpower needed to keep things running scales roughly linearly (maybe a bit better) with the hardware. But if you can build the system so that hardware failures are a normal part of every day and the system has some marginal excess capacity and can 'self heal,' then scale is your friend and your staffing costs can come down a lot. At really large scale, hardware fails all the time. When you have petabytes and 10's of thousands of computers, that's just normal. But with good design, when a drive or a server goes down, the system realizes that its redundancy count is down by one so it spins up another without human intervention. Alarms don't go off and nobody gets paged. The next day you look at the dashboard and see where your capacity margin is and the system has an estimate for when you need to order more hardware. It knows the time for ordering and getting new devices online and is tracking your failure rate. Over time, with scale, this gets pretty accurate. Until something changes, which happens too often. Like new hardware vendors or a change in finacial controls. You also have things lightning a strike that frys a whole room full of devices. But in my experience, the most frequent unforseen cause of problems is human error. Humans...can't live with them, can't live without them. ;-)

Pat

Chris Marino

unread,
Jun 13, 2009, 11:23:52 PM6/13/09
to cloud...@googlegroups.com
Pat, I think your 4 criteria are pretty accurate.

As you say, the hosting companies have 1-3, and not much of 4.   The key difference between hosting and IaaS is 4. There was a (long) thread a while back focused on Amazon's Reserved Instances and whether or not that was an indication of the market demanding more of #1-3, than #4.

I argued then (and still belive today) that IaaS and managed hosting were rapidly converging.  The ones that can deliver on #4 will get some price premium over the rest while the ones that can't deliver anything credible on #4 are just going to disappear.

CM

Andrew Badera

unread,
Jun 14, 2009, 1:09:16 AM6/14/09
to cloud...@googlegroups.com
I think that may be one of the most excellent definitions or clear
delineations of IaaS I've encountered to date. Thanks for that.

Thanks-
- Andy Badera
- and...@badera.us
- Google me: http://www.google.com/search?q=andrew+badera
- This email is: [ ] bloggable [x] ask first [ ] private

Rao Dronamraju

unread,
Jun 14, 2009, 1:15:43 AM6/14/09
to cloud...@googlegroups.com

Pat,

 

I agree with you, Greg earlier and Chris, but I think there are few more things needed to be looked at IaaS in general and Cloud in particular.

 

Your #3, “You have customers who are asking about and can benefit from IaaS”.

Now this is a very important point. If we look at present day markets, who are these customers and who has them?....

 

The guys who have these customers (SME and Enterprise) wrapped up atleast in the last 30+ years are the enterprise computing companies like IBM, HP, Sun/Oracle, Microsoft, SAP etc etc.

Amongst these traditional computing players, only Microsoft has the mega data center operational experience mainly because of its msn.com in addition to their own corporate data centers which all other players also have. For instance, HP is consolidating something like 30 to 40 of its data centers from across the world into 6 super data centers.

 

But interestingly, Amazon to some extent from size of the data center point of view and Google and Yahoo have the mega data center operational experience but they do not have the customers. So the market scenario is quite interesting. Those players who have the customers do not have the mega scale data center operational experience and those players who have the experience do not have the customers. Infact an interesting thing that we have seen recently is the partnerships between the traditional players and data center players – HP and Verizon. I think IBM signed some partnership agreement with Amazon.

 

In addition, I do not think enterprise companies are going to move to Public Clouds at all for quite some time. There are too many strong players and forces that will make sure that the path to Public Clouds is through Private Clouds, Hybrid Clouds and then to Public Clouds. There is a lot to loose for the traditional computing companies in the PPU revenue model. Do you think HP, IBM, Sun and other HW players would like a untested, unpredictable PPU model when they are today able to make decent if not substantial profits through sale model. Oracle and MS will do everything possible to delay if not totally squash the PPU model. Oracle is making 23% profits on its software, MS is not any less. In addition, they also make a LOT of money through services, especially IBM and HP will get there in 3 to 5 years. On top of it, the CIOs of enterprises are not ready for Public Clouds due to their real or unreal concerns about Security, Compliance, Privacy etc etc issues.

 

So coming back to your point that you have customers who are asking for IaaS, In the short run it will be SMEs who will be asking for IaaS in the Public Clouds and Enterprises in the Private Cloud space without PPU. Ofcourse there are also startups and ISVs who will be interested in IaaS in the Public Clouds.

 

“You have expertise in designing scalable self-healing systems. You understand how to design "so that scale is your friend, not your enemy" to paraphrase Werner Vogels (Amazon CTO).”

 

This is also a very interesting point. About scalable systems, you can design scalable systems depending on what kind of applications you are running on them. Folks talk about and always point to Google as an example of super scalable COTS HW systems in use for their map reduce. I think Google is just one and a very specific example. The SEARCH application that Google runs is inherently parallelizable and scalable to large proportions due to the non-dependency of data. Whereas data center applications are not search applications. Infact very few applications in real life are parallelization friendly. So the scalability needed when data centers of enterprises move to clouds is a different kind of scalability. It is lot more easy to group unrelated workloads onto multiple cores/multiple CPUs and multiple servers and utilize them to the fullest extent. This is infact  lot more suitable for multi-tenancy which is gong to be a central aspect of clouds not only from scalability perspective but also economies of scale perspective. It is not any different than how server farms have been utilized in the web/internet scale operations for say e-commerce workloads.

 

About self-healing, although companies have been talking about self-healing for a long time, the autonomic computing technology has not progresses to very far enough to facilitate a fully automated and managed data center let alone a cloud. I think there are plenty of opportunities of entrepreneurs here but today’s technology is woefully inadequate to meet the self-healing requirements of a cloud. We have not even achieved a lights-out data center yet, autonomic data center is few years off and so is an autonomic cloud.

 

Many hosting companies are not really in a position to be Cloud Service Providers. They lack the expertise and service personnel to service the Enterprise Companies and their solutions. A hosting provider today cannot host say Fidelity or Coca Cola and provide the SOLUTION EXPERTISE that IBM, HP, Sun/Oracle and MS provide today. They are equipped to HOST the HW and SW, not the SOLUTIONS and SERVICES. Infact, one thing they might want to do is hire all the service and technology experts that these big wigs are laying off foolishly and turn the tables on them.

 

Anyway, interesting discussion thought of chiming in.

 

Rao

 

 


Greg Pfister

unread,
Jun 14, 2009, 12:55:48 PM6/14/09
to Cloud Computing Interoperability Forum (CCIF)
Pat,

Understood that the "private cloud" has to be large enough to invoke /
require / incorporate failure as a normal occurrence -- good point.

I wonder where the scaling breakpoint lies when you have to do that,
and where that failure support lies.

The reason I ask is that I know of HPC server farms with 1000s of
nodes that do not do that, and still have small staffing counts. They
work by religiously making every node utterly identical, and not
allowing any node-local storage by users. (When I was at IBM, one such
customer would not install a new shipment of nodes because they ran 5%
faster than the ones already installed.)

The grid software can presumably avoid dead nodes when scheduling
jobs, but the jobs themselves are often not failure tolerant.

So in addition to the infrastructure plumbing, the applications
themselves also have to incorporate node-failure tolerance, right?
Whether that's by implicit redundancy, checkpointing, or whatever, it
has to be there. It's not just a characteristic of the underlying
system.

And I suspect the breakpoint lies somewhere between 1,000 and 10,000
nodes.

Greg Pfister
http://perilsofparallel.blogspot.com/

Pat Ransil

unread,
Jun 14, 2009, 11:51:23 PM6/14/09
to cloud...@googlegroups.com
Greg,
I am not sure there is a real "breakpoint" in the sense that the old model becomes less efficient. If you can run a DC with X computers by using Y people, you can probably multiply both X and Y by the same number and work just as well. But to do significantly better it really helps if you can make everything routine, eliminating firefighting and off-hours paging. This is where I found the self-healing/autonomic properties to be so valuable. You won't ever get *everything*  into this category, but things that happen repeatedly are the place to start.

As you point out, forcing uniformity is an effective way to reduce staffing costs. The problem is that your customers want choice. Amazon started out with one EC2 version. Now they have a menu of sizes and CPU/memory configs because customers asked for them. And Amazon still does not have to page anyone when an instance goes down. If you setup your system well, another instance spins up, connects to the EBS and you are off and running with no human intervention required.

Pat

Pat Ransil

unread,
Jun 15, 2009, 2:38:38 AM6/15/09
to cloud...@googlegroups.com
Rao,
Good point about the companies with the experience running the biggest datacenters not being in the hosting business and therefore not having the customer base of companies who are a natural fit for the cloud. I also agree that most enterprise companies won't quickly use public clouds for business critical purposes. Private clouds and hybrid clouds will be important for them. Even in the long term, hybrid solutions may be best for enterprise companies. They have the scale and expertise to run efficient datacenters but can greatly benefit by using public clouds for flex capacity. That way they can purchase only the capacity that they know they will need and use the public cloud for peak usage or if they under estimated.

You are also correct that autonomic computing has a long way to go. Completely self-healing systems are an ideal that we are not likely to ever fully achieve. Still, my experience has been that you can automate some types of failure recovery and that this is significant. I know I ended up getting more nights of sleep because of automated recovery. We built systems that first detected problems and suggested recovery actions but waited for human confirmation before doing anything. It makes sense to run HITL (human in the loop) until you get enough confidence to automate that scenario.

You point out that few applications are easy to parallelize. That is true but I am guessing that part of the reason is that our current parallelization systems are not very flexible. Map-Reduce is a great improvement, allowing more work to be parallelized but we need to go further. As more designers become familiar with Map-Reduce and other recent developments in parallel computing (and not so recent work) *we* will build better tools and get better at re-casting problems so that they parallelize more easily. Look at what Aster Data is doing with Map-Reduce and SQL as a case in point.

Pat

Ravi

unread,
Jun 18, 2009, 6:36:45 PM6/18/09
to Cloud Computing Interoperability Forum (CCIF)
Rao and Pat,

You both make very good points. Coming from an application hosting
background, I see first hand how hard it is to manage enterprise
applications. As Pat points out, we are far far away from completely
autonomic systems. Cloud or no cloud, vendors like Oracle are not
making it any easier for us.

Manageability is a very important consideration for any application to
be cloud capable. This manageability should be encapsulated within the
application itself so that routine tasks such as backups, monitoring,
patching, startup/shutdown follow a standard (which is yet to be
defined as far as I know). In my opinion this is where there is much
challenge and hence much opportunity as well. Manageability By Design
is a very compelling concept.

(I tried to post it directly from my gmail account, but it didn't
appear. So I am directly replying to the post)

Regards,

Ravi

On Jun 14, 11:38 pm, Pat Ransil <pat.ran...@gmail.com> wrote:
> Rao,Good point about the companies with the experience running the biggest
> >  ------------------------------
>
> > *From:* cloud...@googlegroups.com [mailto:cloud...@googlegroups.com] *On
> > Behalf Of *Chris Marino
> > *Sent:* Saturday, June 13, 2009 10:24 PM
> > *To:* cloud...@googlegroups.com
> > *Subject:* Re: Some thoughts on offering IaaS (or some things I learned
> > building SimpleDB)
>
> > Pat, I think your 4 criteria are pretty accurate.
>
> > As you say, the hosting companies have 1-3, and not much of 4.   The key
> > difference between hosting and IaaS is 4. There was a (long) thread
> > <http://groups.google.com/group/cloud-computing/browse_thread/thread/b...>a
> > while back focused on Amazon's Reserved Instances and whether or not that
> > was an indication of the market demanding more of #1-3, than #4.
>
> > I argued then (and still belive today) that IaaS and managed hosting were
> > rapidly converging.  The ones that can deliver on #4 will get some price
> > premium over the rest while the ones that can't deliver anything credible on
> > #4 are just going to disappear.
>
> > CM
>
> > On Thu, Jun 11, 2009 at 3:29 PM, PatRansil <Pat.Ran...@gmail.com> wrote:
>
> > I don't claim to be an expert but I have a few observations that might
> > interest others. This post is about things that I think are important
> > if you are planning to offer IaaS, either as a startup or a new line
> > of business at an existing company.
>
> > As with any new venture, it helps if you start with an 'unfair
> > advantage' over others who might want to enter the space. For IaaS,
> > this could mean:
> >  - You already have a huge datacenter that pays for itself. This means
> > you have economies of scale which can bring your cost down below what
> > smaller rivals could achieve.
> >  - You have expertise in running datacenters efficiently. Running a DC
> > is a very complex task and there are *many* decisions to make that
> > will affect your uptime, cost, and the morale of your staff who needs
> > to keep the system running and fight fires when issues arise.
> >  - You have customers who are asking about and can benefit from IaaS.
> >  - You have expertise in designing scalable self-healing systems. You
> > understand how to design "so that scale is your friend, not your
>
> ...
>
> read more »
Reply all
Reply to author
Forward
0 new messages