More with Moore, more or less

6 views
Skip to first unread message

Reuven Cohen

unread,
Jun 22, 2008, 7:47:54 PM6/22/08
to cloud-computing
Recently I've been asked about the benefits of cloud computing in
comparison to that of virtualization. Generally my answer has been
they are an ideal match. For the most part virtualization has been
about doing more with less (consolidation). VMware in particular
positioned their products and pricing in a way that encourages you to
use the least amount of servers possible. In comparison the
interesting thing about cloud computing is it's about doing more with
more. Or if you're Intel, doing more with Moore.

At Intel's core, they are a company driven by one singular mantra,
"Moore's Law". According to wikipedia, Moore's law describes an
important trend in the history of computer hardware: that the number
of transistors that can be inexpensively placed on an integrated
circuit is increasing exponentially, doubling approximately every two
years. The observation was first made by Intel co-founder Gordon E.
Moore in a 1965 paper.

Over the last couple years we have been working very closely with
Intel, specifically in the areas of virtualization. During this time
we have learned a lot about how they think and what drives them as an
organization. In one of my early pitches we described our approach to
virtualization as "Doing more with Moore" A kind of play on the common
phases "doing more with less" combined with some of the ideas behind
"Moores Law" which is all about growth and greater efficiencies. They
loved the idea, for the first time someone was looking at
virtualization not purely as a way to consolidate a data center but as
a way to more effectively scale your overall capacity.

What is interesting about Moores law in regards to cloud computing is
it is no longer just about how many transistors you can get on a
single CPU, but more about how effectively you spread your compute
capacity on more then one CPU, be it multi-core chips, or among
hundreds, or even thousands of connected servers. Historically the
faster the CPU gets the more demanding the applications built for it
become. I am curious if we're on the verge of seeing a similar "Moores
Law" applied to the cloud? And if so, will it follow the same
principals? Will we start to see a "Ruv's Law" where every 18 months
the amount of cloud capacity will double or will we reach a point
where there is never enough excess capacity to meet the demand?

(Original Post:
http://elasticvapor.com/2008/06/more-with-moore-more-or-less.html)
--
--

Reuven Cohen
Founder & Chief Technologist, Enomaly Inc.

blog > www.elasticvapor.com

Ian Rae

unread,
Jun 22, 2008, 8:11:46 PM6/22/08
to cloud-c...@googlegroups.com
I think to become a law it needs to stand up to many tests, call it a
hypothesis?

I see the cloud is more of an ecosystem, which is likely to defy any
simple, linear type rule due to emergent properties. I wouldn't be shocked
if it was as hard to predict as the weather ("5 day forecast calls for a
200 msec second standard deviation in latency with 10% probability of the
jitters"). I'm only slightly joking - my early experiences with sharing
hosted grid computing resources were pretty blustery.

On that note I am much more interested in the phase transitions, critical
junctures where the properties change.

Ian @
http://infreemation.net

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Ray Nugent

unread,
Jun 22, 2008, 10:32:35 PM6/22/08
to cloud-c...@googlegroups.com
More likely you see "Ray's Law" (sorry, couldn't help myself ;-) where the amount of computing one can do on a given piece of the cloud increases exponentially while the cost for said piece drops by half...

Ray

Paco NATHAN

unread,
Jun 23, 2008, 12:13:15 AM6/23/08
to cloud-c...@googlegroups.com
Seems like there is some raw data available about trend lines for
cloud computing.

Jeff Dean showed some interesting numbers about cluster sizes and job
throughput in this talk:
http://sites.google.com/site/io/underneath-the-covers-at-google-current-systems-and-future-directions

Maybe add a little historical price data for commodity hardware...

Paco


On Sun, Jun 22, 2008 at 9:32 PM, Ray Nugent <rnu...@yahoo.com> wrote:
> More likely you see "Ray's Law" (sorry, couldn't help myself ;-) where the
> amount of computing one can do on a given piece of the cloud increases
> exponentially while the cost for said piece drops by half...
>
> Ray
>
> ----- Original Message ----
> From: Reuven Cohen <r...@enomaly.com>
...
> At Intel's core, they are a company driven by one singular mantra,
> "Moore's Law". According to wikipedia, Moore's law describes an
> important trend in the history of computer hardware: that the number
> of transistors that can be inexpensively placed on an integrated
> circuit is increasing exponentially, doubling approximately every two
> years. The observation was first made by Intel co-founder Gordon E.
> Moore in a 1965 paper.

...

Theodore Omtzigt

unread,
Jun 23, 2008, 9:31:46 AM6/23/08
to cloud-c...@googlegroups.com
Paco:

That is an absolutely brilliant link!

Attributes Aug, '04 Mar, '05 Mar, '06 Sep, '07
Number of jobs 29K 72K 171K 2,217K
Average completion time (secs) 634 934 874 395
Machine years used 217 981 2,002 11,081
Input data read (TB) 3,288 12,571 52,254 403,152
Intermediate data (TB) 758 2,756 6,743 34,774
Output data written (TB) 193 941 2,970 14,018
Average worker machines 157 232 268 394

2M jobs per month, 400PBytes read and an average job size of 400. That in a
nutshell is why Yahoo and MSN are being run out of town. That is using
computing to add value to your business. Google has to be the poster child
of "Competing on Analytics"

Of course, it takes serious money to get to that scale and in some way you
can think of this IT scale as the new business reality of distribution
power. Whereas big companies used to own the market because they were able
to distribute physical widgets, now big companies own the market because
they can aggregate information.

Does anyone have practical experience with the actual computes that take
place and how much they would cost using EC-2 instances? I would be
interested to hear what the top 5 algorithms are that are run at Google and
what the attributes, as mentioned above, are for these 5 algorithms. If we
know that we could see how the economics map onto other cloud models (IBM,
Sun, AWS).

Theo

Khazret Sapenov

unread,
Jun 23, 2008, 9:48:32 AM6/23/08
to cloud-c...@googlegroups.com
I agree with Theodore, pretty interesting presentation. Part of Dean's story is devoted to an anatomy of Google query, claiming, that 1000s of machines are involved. I wonder how many watts does one google query like "restaurants new york" actually burn and if there's any intention to optimize the green side of the structure.
 
cheers,
Khazr3t Sapenov

Theodore Omtzigt

unread,
Jun 23, 2008, 10:07:09 AM6/23/08
to cloud-c...@googlegroups.com
Khazret:
 
  We should be able to do this from first principals: 0.25s and let's say they use 4000 servers. I think the size of the index server banks is large since they benefit from having the whole index in memory. Whole index in memory + redundancy + 1000s queries per sec indicates a large cluster for this (I was trying to dig up info on the size of the index but can't find it right now so it is a bit of a shot from the hip).
 
0.25s on 4000 servers
1 index server is 400Watts given that these have large memory configs and fastish cpus
assume the network takes about 1/4 of the power of the cluster it serves
 
so approximate the query consumes of the order of 500kJoules. Let's see if we can put this in perspective: A hair dryer is about 1500Watts: so the query takes as much energy as running the hair dryer for 5.5 minutes. Please double check my math.
 
Theo


From: cloud-c...@googlegroups.com [mailto:cloud-c...@googlegroups.com] On Behalf Of Khazret Sapenov
Sent: Monday, June 23, 2008 9:49 AM
To: cloud-c...@googlegroups.com
Subject: Re: Top 5 computes at Google?

Reuven Cohen

unread,
Jun 23, 2008, 2:46:02 PM6/23/08
to cloud-c...@googlegroups.com
Paco thanks for the link. That presentation is very informative, we
should invite Jeff@google to join or our group!

I actually just read a wired article that touched up similar ideas. (I
was hoping to get quoted in the article but it was not to be,
interesting none the less)

http://www.wired.com/science/discoveries/magazine/16-07/pb_intro

reuven

--
--

Reuven Cohen
Founder & Chief Technologist, Enomaly Inc.

www.enomaly.com :: 416 848 6036 x 1
skype: ruv.net // aol: ruv6

blog > www.elasticvapor.com
-
Get Linked in> http://linkedin.com/pub/0/b72/7b4

Khazret Sapenov

unread,
Jun 23, 2008, 8:56:03 PM6/23/08
to cloud-c...@googlegroups.com
On Mon, Jun 23, 2008 at 10:07 AM, Theodore Omtzigt <th...@stillwater-sc.com> wrote:
Khazret:
 
  We should be able to do this from first principals: 0.25s and let's say they use 4000 servers. I think the size of the index server banks is large since they benefit from having the whole index in memory. Whole index in memory + redundancy + 1000s queries per sec indicates a large cluster for this (I was trying to dig up info on the size of the index but can't find it right now so it is a bit of a shot from the hip).
 
0.25s on 4000 servers
1 index server is 400Watts given that these have large memory configs and fastish cpus
assume the network takes about 1/4 of the power of the cluster it serves
 
so approximate the query consumes of the order of 500kJoules. Let's see if we can put this in perspective: A hair dryer is about 1500Watts: so the query takes as much energy as running the hair dryer for 5.5 minutes. Please double check my math.
 
Theo
 
If we take your numbers as starting point, then extrapolation gives us following energy consumption for September 2007:
 
500 kilojoules is about 138.88 watt-hour, so 2,217,000 jobs/month gives 307.89696 megawatt-hour,
 
which equals about 205,264.64 of pretty American women standing in front of mirror with hairdryer. :)

Greg Pfister

unread,
Jun 23, 2008, 9:00:52 PM6/23/08
to Cloud Computing
Reuven,

I understand what you are saying, but I fear I must respectfully
disagree. In fact, I think you have it precisely backwards -- unless,
perhaps, you are specifically talking about Enomaly, as opposed to
cloud computing in general.

Moore's law is about larger numbers of transistors per device, per
system. Cloud computing, like clusters, farms, and grids, is about
using multiple systems. A key example is Google. As Paco's excellent
reference to "behind the scenes" at Google pointed out, Google is all
about using low-end systems. LOTS of low-end systems. Not the high-end
systems that are at any time the current best expression of Moore's
law. "Single machine performance is not interesting" (p. 4). I've seen
no indication they're the sligntest bit interested in virtualization.

While cloud computing can *use* virtualization to pack more separate
systems into a smaller volume (with lower power, etc.), it is the
virtualization alone that exploits more processors per chip, more bits
per RAM DIP, etc.

Enomaly's implementation may tame the management of virtualization,
making it simpler to use. But taming, or eliminating, or outsourcing
management of many systems is a significant part of cloud computing --
whether those systems are virtual or not.

(Postscript -- other issues: (1) It seems to me that you are
implicitly equating "number of transistors" and "performance." There
used to be a direct relationship between them, but no longer. Big
subject. (2) Intel is by far not the only company "driven" by Moore's
Law. Everybody in the computing industry has been, and is. Clusterers/
farmers/cloudies, however, are far less so because of the multi-system
aspect.)

--
Greg Pfister

Reuven Cohen

unread,
Jun 23, 2008, 9:35:29 PM6/23/08
to cloud-c...@googlegroups.com
I guess what I was trying to say was like Moore's Laws, there is an
ever increasing amount of computing capacity needed within
enterprises. I was curious if this demand for compute capactity would
increase in a similar formula to that of Moore's Law. Whether its a
high and or low end system, virtualized or PXE booted is not
important. The amount of aggregate cloud capacity is the import metric
I was attempting to examine. As for the Enomaly, of course I have a
bias point of view, I started the company.

reuven

--
--

Geoffrey Fox

unread,
Jun 23, 2008, 9:40:16 PM6/23/08
to cloud-c...@googlegroups.com
On transistors and performance, I note a post I did in May when we discussed "green computing". Namely
"I often used a slide deck from Peter Kogge from 10 years ago plotting
performance per transistor versus number of transistors. Performance
per transistor peaked at around a million transistors.
See http://www.old-npac.org/users/gcf/PPTKoggepimtalk/seporgimagedir/037.jpg
As chips went from one million to one hundred million transistors, the architecture
improvements did get a factor of 10 or so in performance but not linear
in number of transistors."

So for years Intel tried its hardest to support sequential computing (as parallel computing was/is considered too hard) but in a way that was highly inefficient; once you agree to parallelism -- as Google does -- you can move back to simpler slower chips but with better performance per hairdryer hour using unit introduced earlier.
-- 
:
: Geoffrey Fox  g...@indiana.edu FAX 8128567972 http://www.infomall.org
: Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977
: SkypeIn 812-669-0772 with voicemail, International cell 8123910207

Sassa NF

unread,
Jun 27, 2008, 6:58:53 AM6/27/08
to cloud-c...@googlegroups.com
There is a curious law about the chip size and the performance.

To solve a problem of complexity N in time T, the chip area A design
should satisfy a relationship A*T^2 >= c*f(N), where c is a constant,
and f(N) is a function of the problem complexity.

The curious thing here is that if you have a chip of size n^2*A that
solves the problem in time T/n, the efficiency of the single-unit
design is A*T^2. The same efficiency can be achieved by making n^2
units of area A solving the problem in time T (i.e. n times slower
than the large unit). However, the system of n^2 units is capable of
splitting the work into n^2 pieces, perhaps ending up finishing the
whole job in T/n^2, i.e. n times faster than the large unit... Get it?
:-)

I wonder if the same goes for grids/clouds.


Sassa

2008/6/24 Geoffrey Fox <g...@grids.ucs.indiana.edu>:

Khazret Sapenov

unread,
Jun 27, 2008, 9:50:52 AM6/27/08
to cloud-c...@googlegroups.com
On Fri, Jun 27, 2008 at 6:58 AM, Sassa NF <sassa.nf@gmail.com> wrote:

There is a curious law about the chip size and the performance.

To solve a problem of complexity N in time T, the chip area A design
should satisfy a relationship A*T^2 >= c*f(N), where c is a constant,
and f(N) is a function of the problem complexity.

The curious thing here is that if you have a chip of size n^2*A that
solves the problem in time T/n, the efficiency of the single-unit
design is A*T^2. The same efficiency can be achieved by making n^2
units of area A solving the problem in time T (i.e. n times slower
than the large unit). However, the system of n^2 units is capable of
splitting the work into n^2 pieces, perhaps ending up finishing the
whole job in T/n^2, i.e. n times faster than the large unit... Get it?
:-)

I wonder if the same goes for grids/clouds.


Sassa
 
In my opinion, general approach to this class of problems would be solving problem of discrete optimization, to achieve best outcome (maximization/minimization) for specified requirements/constraints, represented as linear equations. 
 
It might be expressed in canonical form as:
 
Maximize/Minimize(c^t x) (1)
subject to Ax <=b (2)
 
where x is vector of variables, while c and b are vectors on known coefficients and A is known matrix of coefficients. (1) is objective function, equations (2) are constraints.
 
For example in context of cloud/grid, here is what I encountered during work:
 
- optimal fit between cloud resources and workloads for known constraints (bandwidth, disk space, power, total cost of ownership, consolidation ratio etc);
 
- power and cooling costs for green computing initiatives;
 
As a product of optimization, we receive specific plan (e.g. consolidation plan) that allows to minimize resource contention and maximize utilization of cloud infrastructure.
 
cheers,
--
Khaz Sapenov,
Director of Research & Development
Enomaly Labs

US Phone: 212-461-4988 x5
Canada Phone: 416-848-6036 x5
E-mail: kh...@enomaly.com
Get Linked in> http://www.linkedin.com/in/sapenov  

Paco NATHAN

unread,
Jun 27, 2008, 7:47:53 PM6/27/08
to cloud-c...@googlegroups.com
Be careful what you optimize for...

I think those two posts are on the right track, in terms of
optimization problems. The constraints in that system of equations
becomes interesting -- and in practice goes beyond what O(n)
complexity generally considers.

Case in point:

1. Suppose my business has a mission-critical system, which requires 4
servers. That system completes some specific processing in nearly 24
hrs, and it must run each day. Some people would look at that and
pronounce it "Close to full utilization - good." An LP model might
agree, given a superficial set of constraints and cost function.

2. Suppose further that it would be possible to move this same
processing off to some other grid, and run it on 48 servers + 2
controls (same size servers) within approximately 2 hours. Roughly
speaking, linear scaling. Slightly more expensive, due to the added
controllers.


In practice, however, one must consider other factors beyond the
complexity of the problem "N", the number of nodes "n", and the time
"T" required:

* overall capex vs. opex for the server resources -- particularly
if a large number of servers can be leased at an hourly rate, thereby
reducing capex.

* cost of moving the data -- both the time required and the network cost.

* operational risk in Approach 1 if one of those four servers goes
belly up -- versus the fault tolerance benefits of a cloud-based
framework which can accommodate node failures.

* operational risk in Approach 1 if some aspect of the data or code
involved is incorrect, and therefore "full utilization" implies the
job cannot be rerun within the 24 hour requirement.

* opportunity cost if the problem scales up suddenly, such that
Approach 1 requires either more than 4 servers or more than 24 hours
to run.

Considering those trade-offs in the constraints and cost function
allows an optimization problem to solve for minimized risk.

Managing risk is what will convince the decision makers at enterprise
businesses to adopt cloud computing for mission critical systems. But
not O(N) complexity :)

Hope that helps -

Paco

--
Manager, Behavioral Targeting
Adknowledge, Inc.

pna...@adknowledge.com
pac...@cs.stanford.edu


On Fri, Jun 27, 2008 at 8:50 AM, Khazret Sapenov <sap...@gmail.com> wrote:

Greg Pfister

unread,
Jun 27, 2008, 9:35:23 PM6/27/08
to Cloud Computing
Right, and power is proportional to frequency cubed, so more slower
processors always wins in power & heat terms. So we should all use 10
Million 0.01MIPS processors and save the planet.

All such statements are totally, undeniably, true. (Maybe not saving
the planet all by ourselves.)

They also ignore natural functional processing boundaries, and how
difficult it is to parallelize code to units smaller than that, and
how simple can be to do when the processing matches those boundaries.

Example: When commodity micros got fast enough that each could handle
an entire web request all by itself at reasonable response time, what
happened? The web, that's what. Server farms. Stick a simple
dispatcher in front of a pile of commodity servers with a shared file
system, and away you go.

(Well, OK, each server overlaps disk IO to run other threads in the IO
wait time, but the code the majority of programmers wrote was straight
sequential.)

Think how hard that would be to do if each individual web request had
to somehow be split up into much smaller grains that ran in parallel.

The scaling of most commercial clouds absolutely lives on
relationships like this. It's why most HPC code is so much harder to
parallelize: Most of it does not have lots of independent requests.

There is a long-running repeating pattern in parallel processing where
people propose a gazillion ants to run in parallel to get absolutely
huge performance. On the other hand, Seymour Cray once said "I'd
rather plow a field with four strong horses than with 512 chickens."
It all depends on the application, of course; there are certainly
cases that work well with 512 chickens, or 10E10 bacteria, or
whatever. But most cases work better with he horses.

--
Greg Pfister

On Jun 27, 5:58 am, "Sassa NF" <sassa...@gmail.com> wrote:
> There is a curious law about the chip size and the performance.
>
> To solve a problem of complexity N in time T, the chip area A design
> should satisfy a relationship A*T^2 >= c*f(N), where c is a constant,
> and f(N) is a function of the problem complexity.
>
> The curious thing here is that if you have a chip of size n^2*A that
> solves the problem in time T/n, the efficiency of the single-unit
> design is A*T^2. The same efficiency can be achieved by making n^2
> units of area A solving the problem in time T (i.e. n times slower
> than the large unit). However, the system of n^2 units is capable of
> splitting the work into n^2 pieces, perhaps ending up finishing the
> whole job in T/n^2, i.e. n times faster than the large unit... Get it?
> :-)
>
> I wonder if the same goes for grids/clouds.
>
> Sassa
>
> 2008/6/24 Geoffrey Fox <g...@grids.ucs.indiana.edu>:
>
> > On transistors and performance, I note a post I did in May when we discussed
> > "green computing". Namely
> > "I often used a slide deck from Peter Kogge from 10 years ago plotting
> > performance per transistor versus number of transistors. Performance
> > per transistor peaked at around a million transistors.
> > Seehttp://www.old-npac.org/users/gcf/PPTKoggepimtalk/seporgimagedir/037.jpg
> > : Geoffrey Fox  g...@indiana.edu FAX 8128567972http://www.infomall.org

Sassa NF

unread,
Jun 28, 2008, 6:13:01 AM6/28/08
to cloud-c...@googlegroups.com
What I thought was that people get 512 horses into a cloud, because
they don't have elephants.


Sassa

2008/6/28 Greg Pfister <greg.p...@gmail.com>:

Sassa NF

unread,
Jun 28, 2008, 6:41:32 AM6/28/08
to cloud-c...@googlegroups.com
2008/6/28 Greg Pfister <greg.p...@gmail.com>:

> Right, and power is proportional to frequency cubed, so more slower
> processors always wins in power & heat terms. So we should all use 10
> Million 0.01MIPS processors and save the planet.

That's an idea for Khaz ;-)


> All such statements are totally, undeniably, true. (Maybe not saving
> the planet all by ourselves.)
>
> They also ignore natural functional processing boundaries, and how
> difficult it is to parallelize code to units smaller than that, and
> how simple can be to do when the processing matches those boundaries.

That's true, and that's why the AMD guy was politely sarcastic about
planting more cores and CPUs on one board, as mentioned in the "The
Register" article quoted here. The OS simply can't find enough jobs to
be done in parallel...


> Example: When commodity micros got fast enough that each could handle
> an entire web request all by itself at reasonable response time, what
> happened? The web, that's what. Server farms. Stick a simple
> dispatcher in front of a pile of commodity servers with a shared file
> system, and away you go.

Why a simple dispatcher in front of a pile of commodity servers
instead of a single mega-server?

I think the wisdom behind A*T^2 law is that it may be easier to build
smaller slow units (or buy slower computers) than faster units even if
they are larger. And that's where it ends. I don't think it says you
can do everything with that, or that the time T of individual small
units will be acceptable, because in the end to get it working n times
faster than a single unit, you need to parallelise processing among
n^2 units.


> (Well, OK, each server overlaps disk IO to run other threads in the IO
> wait time, but the code the majority of programmers wrote was straight
> sequential.)
>
> Think how hard that would be to do if each individual web request had
> to somehow be split up into much smaller grains that ran in parallel.

I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?

With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM. So the
question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.


> The scaling of most commercial clouds absolutely lives on
> relationships like this. It's why most HPC code is so much harder to
> parallelize: Most of it does not have lots of independent requests.

That's true, no doubt.


> There is a long-running repeating pattern in parallel processing where
> people propose a gazillion ants to run in parallel to get absolutely
> huge performance. On the other hand, Seymour Cray once said "I'd
> rather plow a field with four strong horses than with 512 chickens."
> It all depends on the application, of course; there are certainly
> cases that work well with 512 chickens, or 10E10 bacteria, or
> whatever. But most cases work better with he horses.

But we don't eat with ladles, we still eat with spoons :-) even if it
requires more trips.

Also, if you were to carry shopping of 100 people for 1 mile, you
could do 100 trips, or you could use a train of shopping trolleys and
just walk home.

As you said, it all depends on the job. And it looks like Google found
that using chickens is enough for the field they are about to plough.


Sassa

timothy norman huber

unread,
Jun 28, 2008, 12:08:38 PM6/28/08
to cloud-c...@googlegroups.com
On Jun 28, 2008, at 3:41 AM, Sassa NF wrote:

I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?

With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM.

You've illustrated the memory gap.  Memory doubles in density every 30 months compared to 18 months for processors.  

Fred Weber, former CTO of AMD (spearheaded development of the Opteron), and Suresh Rajan, former Dir. of Marketing of nVidia hatched MetaRAM to increase memory footprints utilizing existing memory busses and commodity dram parts.

So the question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.

For those that might not follow LKML, read Linux Kernel Hacker Daniel Phillips' paper "Ramback- Faster than a Speeding Bullet"  - The thread became a 400+ post flame war which ended when Alan Cox conceded Daniel was right.

Here's the intro:
Every little factor of 25 performance increase really helps.

Ramback is a new virtual device with the ability to back a ramdisk
by a real disk, obtaining the performance level of a ramdisk but with
the data durability of a hard disk.  To work this magic, ramback needs
a little help from a UPS.  In a typical test, ramback reduced a 25
second file operation[1] to under one second including sync.  Even
greater gains are possible for seek-intensive applications.

The difference between ramback and an ordinary ramdisk is: when the 
machine powers down the data does not vanish because it is continuously 
saved to backing store.  When line power returns, the backing store
repopulates the ramdisk while allowing application io to proceed 
concurrently.  Once fully populated, a little green light winks on and 
file operations once again run at ramdisk speed.
Daniel's code is rev 0.0, and is buggy like any new code, but the backbone is solid.  Motivated parties should foster development of this open-source gem.   Accelerating durable writes could dramatically accelerate cloud-based transactional systems.

Last bit-  x64 PCIe chipsets are just now hitting the market.  Big Honking multi-processor servers  loaded with RAM can now have multiple x4 IB or 10GbE NICs.   IB is pretty cool, especially when you consider the RDMA capabilities.  

T



Timothy Huber
Strategic Account Development


181 Metro Drive, Suite 400 
San Jose, CA 95110 


Greg Pfister

unread,
Jun 28, 2008, 3:30:46 PM6/28/08
to Cloud Computing
On Jun 28, 5:41 am, "Sassa NF" <sassa...@gmail.com> wrote:
> 2008/6/28 Greg Pfister <greg.pfis...@gmail.com>:
>
> > Right, and power is proportional to frequency cubed, so more slower

Actually, now I'm not sure whether it's cubed or squared, but either
one works in this discussion.

> > processors always wins in power & heat terms. So we should all use 10
> > Million 0.01MIPS processors and save the planet.
>
> That's an idea for Khaz ;-)
>
> > All such statements are totally, undeniably, true. (Maybe not saving
> > the planet all by ourselves.)
>
> > They also ignore natural functional processing boundaries, and how
> > difficult it is to parallelize code to units smaller than that, and
> > how simple can be to do when the processing matches those boundaries.
>
> That's true, and that's why the AMD guy was politely sarcastic about
> planting more cores and CPUs on one board, as mentioned in the "The
> Register" article quoted here. The OS simply can't find enough jobs to
> be done in parallel...
>
> > Example: When commodity micros got fast enough that each could handle
> > an entire web request all by itself at reasonable response time, what
> > happened? The web, that's what. Server farms. Stick a simple
> > dispatcher in front of a pile of commodity servers with a shared file
> > system, and away you go.
>
> Why a simple dispatcher in front of a pile of commodity servers
> instead of a single mega-server?

Because the pile costs a whole lot less. Compared with, say, a
mainframe, something like 100X less per unit of performance. Maybe
it's not less counting continuing management costs, but lots of people
have a hard time seeing past a 100X initial purchase price delta to
TCO over many years.

> I think the wisdom behind A*T^2 law is that it may be easier to build
> smaller slow units (or buy slower computers) than faster units even if
> they are larger. And that's where it ends. I don't think it says you
> can do everything with that,

The problem I've seen is with people, usually dyed-in-the-wool
engineers who have never written a line of code, who don't see why you
*can't* do everything with that.

> or that the time T of individual small
> units will be acceptable, because in the end to get it working n times
> faster than a single unit, you need to parallelise processing among
> n^2 units.

Yup, that's the point.

> > (Well, OK, each server overlaps disk IO to run other threads in the IO
> > wait time, but the code the majority of programmers wrote was straight
> > sequential.)
>
> > Think how hard that would be to do if each individual web request had
> > to somehow be split up into much smaller grains that ran in parallel.
>
> I don't think that's the intention. On the other hand if you had 100
> servers processing 1000 requests a second, would you rather use that
> or 10 servers that work 10 times faster?

That's kind of the point I was making, but it's interesting to note
that actually, you don't have that choice any more. There is no single
processor server that goes 10 times faster. The big iron (like IBM
Power series) gets a bit more MHz thanks to exotic cooling, but mainly
it's a massive (>64-way), fairly-flat, SMP. So you still parallelize,
just using shared memory instead of multiple systems. And guess what
-- lots of the standard infrastructure code doesn't support 64-way
SMP.

> With 100 servers you potentially get 400GB of RAM alone, whereas with
> 10 servers on the same architecture you would get 40GB RAM.

I'd quibble with that. If it's seriously X times faster, it better
have something near X times more RAM or the added speed isn't useful.

> So the
> question is not only about CPU speed.

I won't argue with that. Lots of factors. IO rate, too.

> With parallelism you get an
> option to use local data store where there is no contention. But you
> also get a headache of updating the state across all of them, meaning
> slower writes.

By local data store, do you mean RAM shared among processors? Better
watch for bottlenecks there, too, in bandwidth, and contention
definitely is an issue. If you have a hot lock, it's a hot lock
whether it's in RAM or in an outboard DB. (With the outboard DB,
you'll feel it sooner, but it's still there.)

> > The scaling of most commercial clouds absolutely lives on
> > relationships like this. It's why most HPC code is so much harder to
> > parallelize: Most of it does not have lots of independent requests.
>
> That's true, no doubt.
>
> > There is a long-running repeating pattern in parallel processing where
> > people propose a gazillion ants to run in parallel to get absolutely
> > huge performance. On the other hand, Seymour Cray once said "I'd
> > rather plow a field with four strong horses than with 512 chickens."
> > It all depends on the application, of course; there are certainly
> > cases that work well with 512 chickens, or 10E10 bacteria, or
> > whatever. But most cases work better with he horses.
>
> But we don't eat with ladles, we still eat with spoons :-) even if it
> requires more trips.

Soup spoons!

> Also, if you were to carry shopping of 100 people for 1 mile, you
> could do 100 trips, or you could use a train of shopping trolleys and
> just walk home.
>
> As you said, it all depends on the job. And it looks like Google found
> that using chickens is enough for the field they are about to plough.

Yes, but I'd add that those chickens have been fed steroids. Google
couldn't have done it in the mid-80s. Microprocessors weren't fast
enough. Now they're the Chickens That Ate Chicago.

--
Greg Pfister

Sassa NF

unread,
Jun 29, 2008, 1:37:58 PM6/29/08
to cloud-c...@googlegroups.com
It doesn't look like this flame will hit the 400+ mark as I already
agree with your points :-)


2008/6/28 Greg Pfister <greg.p...@gmail.com>:
...


>> Why a simple dispatcher in front of a pile of commodity servers
>> instead of a single mega-server?
>
> Because the pile costs a whole lot less. Compared with, say, a
> mainframe, something like 100X less per unit of performance. Maybe
> it's not less counting continuing management costs, but lots of people
> have a hard time seeing past a 100X initial purchase price delta to
> TCO over many years.

sure, that's what I thought too :-)

...


>> With 100 servers you potentially get 400GB of RAM alone, whereas with
>> 10 servers on the same architecture you would get 40GB RAM.
>
> I'd quibble with that. If it's seriously X times faster, it better
> have something near X times more RAM or the added speed isn't useful.

Well, in general I also think so, but in the special case of serving
multitudes of web requests it is not obviously important, because
serving 100 requests sequentially doesn't require 100 times more
memory than for one request.


>> So the
>> question is not only about CPU speed.
>
> I won't argue with that. Lots of factors. IO rate, too.

Yes


>> With parallelism you get an
>> option to use local data store where there is no contention. But you
>> also get a headache of updating the state across all of them, meaning
>> slower writes.
>
> By local data store, do you mean RAM shared among processors? Better
> watch for bottlenecks there, too, in bandwidth, and contention
> definitely is an issue. If you have a hot lock, it's a hot lock
> whether it's in RAM or in an outboard DB. (With the outboard DB,
> you'll feel it sooner, but it's still there.)

Well, I meant a theoretic local data store. On a multi-processor
system it is the processor cache. On a multi-computer system it is the
RAM of individual computer. There is no contention in the local data
store, because it is local. The contention does appear when you want
to copy the data from/to the local data store to/from the remote store
(shared file/memory/DB).


>> > The scaling of most commercial clouds absolutely lives on
>> > relationships like this. It's why most HPC code is so much harder to
>> > parallelize: Most of it does not have lots of independent requests.
>>
>> That's true, no doubt.
>>
>> > There is a long-running repeating pattern in parallel processing where
>> > people propose a gazillion ants to run in parallel to get absolutely
>> > huge performance. On the other hand, Seymour Cray once said "I'd
>> > rather plow a field with four strong horses than with 512 chickens."
>> > It all depends on the application, of course; there are certainly
>> > cases that work well with 512 chickens, or 10E10 bacteria, or
>> > whatever. But most cases work better with he horses.
>>
>> But we don't eat with ladles, we still eat with spoons :-) even if it
>> requires more trips.
>
> Soup spoons!

I was building the analogy of a client computer running a browser - so
to speak, small appetite requires a little spoon travelling many
times. That's how the web requests better be served by 512 chickens
than 4 horses.


>> Also, if you were to carry shopping of 100 people for 1 mile, you
>> could do 100 trips, or you could use a train of shopping trolleys and
>> just walk home.
>>
>> As you said, it all depends on the job. And it looks like Google found
>> that using chickens is enough for the field they are about to plough.
>
> Yes, but I'd add that those chickens have been fed steroids. Google
> couldn't have done it in the mid-80s. Microprocessors weren't fast
> enough. Now they're the Chickens That Ate Chicago.

I thought mid-80s were ants :-) and now we have chickens.

Sassa


> --
> Greg Pfister
>
> >
>

Stephen Voorhees

unread,
Jun 29, 2008, 6:20:34 PM6/29/08
to cloud-c...@googlegroups.com


From: cloud-c...@googlegroups.com
To: cloud-c...@googlegroups.com
Sent: Sat Jun 28 09:08:38 2008


Subject: Re: More with Moore, more or less

On Jun 28, 2008, at 3:41 AM, Sassa NF wrote:

I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?

With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM.

You've illustrated the memory gap.  Memory doubles in density every 30 months compared to 18 months for processors.  

Fred Weber, former CTO of AMD (spearheaded development of the Opteron), and Suresh Rajan, former Dir. of Marketing of nVidia hatched MetaRAM to increase memory footprints utilizing existing memory busses and commodity dram parts.

So the question is not only about CPU speed. With parallelism you get an

option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.

Hall, Jacob

unread,
Jun 30, 2008, 1:40:06 PM6/30/08
to cloud-c...@googlegroups.com
I would prefer less servers. Once we virtualize everything the economics of buying equipment will change. We'll almost always buy the faster technology (high clock rates, faster RAM, faster disks, etc....) because it will provide a better performance per watt versus the alternatives. Afterall... whether you buy a 3Ghz or 2 Ghz processor in a given machine model.. the power supply often reserves the same ammount of power. Our goal with fabric based or cloud computing is to optimize the allocation of power.

Density Dyanamics (http://www.densitydynamics.com) offers NV-DRAM or Volatile DRAM in a 3.5" package. I'm betting that the combination of a high speed low latency interconnect and commodity deployments of compute devices which only contain CPU, RAM, and Unified I/O is the way to go.

Thoughts?

Virtualized Everything Presentation - How to Design to Save Automatically
http://www.unifiedfabric.com/imworld

________________________________

From: cloud-c...@googlegroups.com on behalf of Stephen Voorhees
Sent: Sun 6/29/2008 6:20 PM
To: 'cloud-c...@googlegroups.com'
Subject: Re: More with Moore, more or less

________________________________

From: cloud-c...@googlegroups.com
To: cloud-c...@googlegroups.com
Sent: Sat Jun 28 09:08:38 2008
Subject: Re: More with Moore, more or less

On Jun 28, 2008, at 3:41 AM, Sassa NF wrote:


I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?


With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM.


You've illustrated the memory gap. Memory doubles in density every 30 months compared to 18 months for processors.

Fred Weber, former CTO of AMD (spearheaded development of the Opteron), and Suresh Rajan, former Dir. of Marketing of nVidia hatched MetaRAM to increase memory footprints utilizing existing memory busses and commodity dram parts.


So the question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.


For those that might not follow LKML, read Linux Kernel Hacker Daniel Phillips' paper "Ramback- Faster than a Speeding Bullet <http://lwn.net/Articles/272534/> " - The thread became a 400+ post flame war which ended when Alan Cox conceded Daniel was right.

Here's the intro:
Every little factor of 25 performance increase really helps.

Ramback is a new virtual device with the ability to back a ramdisk
by a real disk, obtaining the performance level of a ramdisk but with
the data durability of a hard disk. To work this magic, ramback needs
a little help from a UPS. In a typical test, ramback reduced a 25
second file operation[1] to under one second including sync. Even
greater gains are possible for seek-intensive applications.

The difference between ramback and an ordinary ramdisk is: when the
machine powers down the data does not vanish because it is continuously
saved to backing store. When line power returns, the backing store
repopulates the ramdisk while allowing application io to proceed
concurrently. Once fully populated, a little green light winks on and
file operations once again run at ramdisk speed.
Daniel's code is rev 0.0, and is buggy like any new code, but the backbone is solid. Motivated parties should foster development of this open-source gem. Accelerating durable writes could dramatically accelerate cloud-based transactional systems.

Last bit- x64 PCIe chipsets are just now hitting the market. Big Honking multi-processor servers loaded with RAM can now have multiple x4 IB or 10GbE NICs. IB is pretty cool, especially when you consider the RDMA capabilities.

T

Timothy Huber
Strategic Account Development


tim....@metaram.com
cell 310 795.6599

MetaRAM Inc. <http://www.metaram.com/>

timothy norman huber

unread,
Jun 30, 2008, 4:25:23 PM6/30/08
to cloud-c...@googlegroups.com
Jacob,

what do you think of the following server config comparisons?

1) 4P-128GB vs. 2P-128GB
2) 4P-128GB vs. 4P-256GB

In your opinion, what's optimal for a server config- number of procs,
memory size, chipset (PCIe lanes), etc?

Timothy

tim....@metaram.com
cell 310 795.6599

MetaRAM Inc.

Wes Felter

unread,
Jun 30, 2008, 4:35:54 PM6/30/08
to cloud-c...@googlegroups.com
Hall, Jacob wrote:
> I would prefer less servers. Once we virtualize everything the economics of buying equipment will change. We'll almost always buy the faster technology (high clock rates, faster RAM, faster disks, etc....) because it will provide a better performance per watt versus the alternatives. Afterall... whether you buy a 3Ghz or 2 Ghz processor in a given machine model.. the power supply often reserves the same ammount of power. Our goal with fabric based or cloud computing is to optimize the allocation of power.

A rule of thumb I noticed is 2x the sockets for roughly 4x the price, so
your power needs to be quite expensive to justify using large servers.
Also, it's questionable that larger servers provide better perf/W;
SPECpower results so far have shown that 2S servers use slightly more
than double the power of 1S servers; I wouldn't be surprised if the
trend continues at 4S and larger. This is not to mention SMP scalability
which is never perfect. OTOH, larger servers may benefit more from
statistical multiplexing.

I agree that power accounting is essential but I think the solution is
better accounting, not workarounds. You just can't afford to reserve the
same power capacity for things that are different. We need rack power
capping to drive those branch circuits to 80% utilization.

Wes Felter - wes...@felter.org

Hall, Jacob

unread,
Jun 30, 2008, 5:07:06 PM6/30/08
to cloud-c...@googlegroups.com
* apologies for the quick message, if you find error. Below are some
thoughts.

The configs for end point devices will vary based on workload but
generally it's nice for the endpoints to remain consistent. The closer
RAM is to the compute, the faster the software will perform
(particularly when RDMA is enabled) so large amounts of RAM have
benefits... this said... I wouldn't necessarily want 256GB at each end
point just to ensure my workloads don't overflow (primarily because
256GB would be powered up all the time and is fairly inefficient based
on today's motherboard designs). * Motherboards with larger RAM support
are more expensive. Larger machines today typically reserve larger
power supplies also which more standby energy waste and reserve more
power than is typically used.

No real opinion on size of RAM, but compute devices should only contain
what will be used and the power supply should be sized as such. Not
sure about you guys... but reserving 1300 WATTS * 2 for each 4P server
is a little over kill when less than 60% of the power is actually used.


Fabric Memory (which is scalable like a storage array using 3.5" disks)
is more ideal because it can more readily be repurposed and shared. By
repurposed it could look like a LUN, JMS, JDBC/ODBC, etc... or as an
RDMA cache for NFS / SAN. If the purpose is to have lots of RAM,
deploying RAM in a server is about the most inefficient way to do so
(video, hd controller, lots of PCI-Express slots, larger power supplies,
etc... just to serve RAM).

* Attached is the high level concept for a Processor Array.

Processor Array - Basic Idea v1.3.ppt

timothy norman huber

unread,
Jun 30, 2008, 6:18:35 PM6/30/08
to cloud-c...@googlegroups.com
Jacob,

*no errors, only seeking clarifications

How extensively is RDMA deployed in your applications now? How
latency sensitive are your cloud apps?

> If the purpose is to have lots of RAM,
> deploying RAM in a server is about the most inefficient way to do so
> (video, hd controller, lots of PCI-Express slots, larger power
> supplies,
> etc... just to serve RAM).

What would you propose as a more efficient mechanism to deploying lots
of RAM than sticking it directly in a server?

regards,

Timothy

> <Processor Array - Basic Idea v1.3.ppt>

timothy norman huber
timoth...@mac.com
310 795-6599

Geoffrey Fox

unread,
Jun 30, 2008, 7:22:09 PM6/30/08
to cloud-c...@googlegroups.com

Paul Harrington

unread,
Jun 30, 2008, 9:15:19 PM6/30/08
to cloud-c...@googlegroups.com, Paul Harrington
Any opinions out there on leveraging on iSER and STGT (openfabrics.org) as cloud computing teired storage?

Realize this is an open-ended question....and more of a storage fishing expedition...anyone successfully fishing in the iSER/STGT pond?

Sniper Ninja

unread,
Jul 1, 2008, 9:32:19 AM7/1/08
to cloud-c...@googlegroups.com
>> So the
>> question is not only about CPU speed.
>
> I won't argue with that. Lots of factors. IO rate, too.

 That's right. We should also not forget that 80% of assembly code of any application are MOV instructions. i.e instructions to move data to or from internal regiister to memory. I think it's a very important fact, that's ought not to be neglected.

     Benmerar.T.Z


Hall, Jacob

unread,
Jul 1, 2008, 8:22:18 AM7/1/08