More with Moore, more or less

9 views
Skip to first unread message

Reuven Cohen

unread,
Jun 22, 2008, 7:47:54 PM6/22/08
to cloud-computing
Recently I've been asked about the benefits of cloud computing in
comparison to that of virtualization. Generally my answer has been
they are an ideal match. For the most part virtualization has been
about doing more with less (consolidation). VMware in particular
positioned their products and pricing in a way that encourages you to
use the least amount of servers possible. In comparison the
interesting thing about cloud computing is it's about doing more with
more. Or if you're Intel, doing more with Moore.

At Intel's core, they are a company driven by one singular mantra,
"Moore's Law". According to wikipedia, Moore's law describes an
important trend in the history of computer hardware: that the number
of transistors that can be inexpensively placed on an integrated
circuit is increasing exponentially, doubling approximately every two
years. The observation was first made by Intel co-founder Gordon E.
Moore in a 1965 paper.

Over the last couple years we have been working very closely with
Intel, specifically in the areas of virtualization. During this time
we have learned a lot about how they think and what drives them as an
organization. In one of my early pitches we described our approach to
virtualization as "Doing more with Moore" A kind of play on the common
phases "doing more with less" combined with some of the ideas behind
"Moores Law" which is all about growth and greater efficiencies. They
loved the idea, for the first time someone was looking at
virtualization not purely as a way to consolidate a data center but as
a way to more effectively scale your overall capacity.

What is interesting about Moores law in regards to cloud computing is
it is no longer just about how many transistors you can get on a
single CPU, but more about how effectively you spread your compute
capacity on more then one CPU, be it multi-core chips, or among
hundreds, or even thousands of connected servers. Historically the
faster the CPU gets the more demanding the applications built for it
become. I am curious if we're on the verge of seeing a similar "Moores
Law" applied to the cloud? And if so, will it follow the same
principals? Will we start to see a "Ruv's Law" where every 18 months
the amount of cloud capacity will double or will we reach a point
where there is never enough excess capacity to meet the demand?

(Original Post:
http://elasticvapor.com/2008/06/more-with-moore-more-or-less.html)
--
--

Reuven Cohen
Founder & Chief Technologist, Enomaly Inc.

blog > www.elasticvapor.com

Ian Rae

unread,
Jun 22, 2008, 8:11:46 PM6/22/08
to cloud-c...@googlegroups.com
I think to become a law it needs to stand up to many tests, call it a
hypothesis?

I see the cloud is more of an ecosystem, which is likely to defy any
simple, linear type rule due to emergent properties. I wouldn't be shocked
if it was as hard to predict as the weather ("5 day forecast calls for a
200 msec second standard deviation in latency with 10% probability of the
jitters"). I'm only slightly joking - my early experiences with sharing
hosted grid computing resources were pretty blustery.

On that note I am much more interested in the phase transitions, critical
junctures where the properties change.

Ian @
http://infreemation.net

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Ray Nugent

unread,
Jun 22, 2008, 10:32:35 PM6/22/08
to cloud-c...@googlegroups.com
More likely you see "Ray's Law" (sorry, couldn't help myself ;-) where the amount of computing one can do on a given piece of the cloud increases exponentially while the cost for said piece drops by half...

Ray

Paco NATHAN

unread,
Jun 23, 2008, 12:13:15 AM6/23/08
to cloud-c...@googlegroups.com
Seems like there is some raw data available about trend lines for
cloud computing.

Jeff Dean showed some interesting numbers about cluster sizes and job
throughput in this talk:
http://sites.google.com/site/io/underneath-the-covers-at-google-current-systems-and-future-directions

Maybe add a little historical price data for commodity hardware...

Paco


On Sun, Jun 22, 2008 at 9:32 PM, Ray Nugent <rnu...@yahoo.com> wrote:
> More likely you see "Ray's Law" (sorry, couldn't help myself ;-) where the
> amount of computing one can do on a given piece of the cloud increases
> exponentially while the cost for said piece drops by half...
>
> Ray
>
> ----- Original Message ----
> From: Reuven Cohen <r...@enomaly.com>
...
> At Intel's core, they are a company driven by one singular mantra,
> "Moore's Law". According to wikipedia, Moore's law describes an
> important trend in the history of computer hardware: that the number
> of transistors that can be inexpensively placed on an integrated
> circuit is increasing exponentially, doubling approximately every two
> years. The observation was first made by Intel co-founder Gordon E.
> Moore in a 1965 paper.

...

Theodore Omtzigt

unread,
Jun 23, 2008, 9:31:46 AM6/23/08
to cloud-c...@googlegroups.com
Paco:

That is an absolutely brilliant link!

Attributes Aug, '04 Mar, '05 Mar, '06 Sep, '07
Number of jobs 29K 72K 171K 2,217K
Average completion time (secs) 634 934 874 395
Machine years used 217 981 2,002 11,081
Input data read (TB) 3,288 12,571 52,254 403,152
Intermediate data (TB) 758 2,756 6,743 34,774
Output data written (TB) 193 941 2,970 14,018
Average worker machines 157 232 268 394

2M jobs per month, 400PBytes read and an average job size of 400. That in a
nutshell is why Yahoo and MSN are being run out of town. That is using
computing to add value to your business. Google has to be the poster child
of "Competing on Analytics"

Of course, it takes serious money to get to that scale and in some way you
can think of this IT scale as the new business reality of distribution
power. Whereas big companies used to own the market because they were able
to distribute physical widgets, now big companies own the market because
they can aggregate information.

Does anyone have practical experience with the actual computes that take
place and how much they would cost using EC-2 instances? I would be
interested to hear what the top 5 algorithms are that are run at Google and
what the attributes, as mentioned above, are for these 5 algorithms. If we
know that we could see how the economics map onto other cloud models (IBM,
Sun, AWS).

Theo

Khazret Sapenov

unread,
Jun 23, 2008, 9:48:32 AM6/23/08
to cloud-c...@googlegroups.com
I agree with Theodore, pretty interesting presentation. Part of Dean's story is devoted to an anatomy of Google query, claiming, that 1000s of machines are involved. I wonder how many watts does one google query like "restaurants new york" actually burn and if there's any intention to optimize the green side of the structure.
 
cheers,
Khazr3t Sapenov

Theodore Omtzigt

unread,
Jun 23, 2008, 10:07:09 AM6/23/08
to cloud-c...@googlegroups.com
Khazret:
 
  We should be able to do this from first principals: 0.25s and let's say they use 4000 servers. I think the size of the index server banks is large since they benefit from having the whole index in memory. Whole index in memory + redundancy + 1000s queries per sec indicates a large cluster for this (I was trying to dig up info on the size of the index but can't find it right now so it is a bit of a shot from the hip).
 
0.25s on 4000 servers
1 index server is 400Watts given that these have large memory configs and fastish cpus
assume the network takes about 1/4 of the power of the cluster it serves
 
so approximate the query consumes of the order of 500kJoules. Let's see if we can put this in perspective: A hair dryer is about 1500Watts: so the query takes as much energy as running the hair dryer for 5.5 minutes. Please double check my math.
 
Theo


From: cloud-c...@googlegroups.com [mailto:cloud-c...@googlegroups.com] On Behalf Of Khazret Sapenov
Sent: Monday, June 23, 2008 9:49 AM
To: cloud-c...@googlegroups.com
Subject: Re: Top 5 computes at Google?

Reuven Cohen

unread,
Jun 23, 2008, 2:46:02 PM6/23/08
to cloud-c...@googlegroups.com
Paco thanks for the link. That presentation is very informative, we
should invite Jeff@google to join or our group!

I actually just read a wired article that touched up similar ideas. (I
was hoping to get quoted in the article but it was not to be,
interesting none the less)

http://www.wired.com/science/discoveries/magazine/16-07/pb_intro

reuven

--
--

Reuven Cohen
Founder & Chief Technologist, Enomaly Inc.

www.enomaly.com :: 416 848 6036 x 1
skype: ruv.net // aol: ruv6

blog > www.elasticvapor.com
-
Get Linked in> http://linkedin.com/pub/0/b72/7b4

Khazret Sapenov

unread,
Jun 23, 2008, 8:56:03 PM6/23/08
to cloud-c...@googlegroups.com
On Mon, Jun 23, 2008 at 10:07 AM, Theodore Omtzigt <th...@stillwater-sc.com> wrote:
Khazret:
 
  We should be able to do this from first principals: 0.25s and let's say they use 4000 servers. I think the size of the index server banks is large since they benefit from having the whole index in memory. Whole index in memory + redundancy + 1000s queries per sec indicates a large cluster for this (I was trying to dig up info on the size of the index but can't find it right now so it is a bit of a shot from the hip).
 
0.25s on 4000 servers
1 index server is 400Watts given that these have large memory configs and fastish cpus
assume the network takes about 1/4 of the power of the cluster it serves
 
so approximate the query consumes of the order of 500kJoules. Let's see if we can put this in perspective: A hair dryer is about 1500Watts: so the query takes as much energy as running the hair dryer for 5.5 minutes. Please double check my math.
 
Theo
 
If we take your numbers as starting point, then extrapolation gives us following energy consumption for September 2007:
 
500 kilojoules is about 138.88 watt-hour, so 2,217,000 jobs/month gives 307.89696 megawatt-hour,
 
which equals about 205,264.64 of pretty American women standing in front of mirror with hairdryer. :)

Greg Pfister

unread,
Jun 23, 2008, 9:00:52 PM6/23/08
to Cloud Computing
Reuven,

I understand what you are saying, but I fear I must respectfully
disagree. In fact, I think you have it precisely backwards -- unless,
perhaps, you are specifically talking about Enomaly, as opposed to
cloud computing in general.

Moore's law is about larger numbers of transistors per device, per
system. Cloud computing, like clusters, farms, and grids, is about
using multiple systems. A key example is Google. As Paco's excellent
reference to "behind the scenes" at Google pointed out, Google is all
about using low-end systems. LOTS of low-end systems. Not the high-end
systems that are at any time the current best expression of Moore's
law. "Single machine performance is not interesting" (p. 4). I've seen
no indication they're the sligntest bit interested in virtualization.

While cloud computing can *use* virtualization to pack more separate
systems into a smaller volume (with lower power, etc.), it is the
virtualization alone that exploits more processors per chip, more bits
per RAM DIP, etc.

Enomaly's implementation may tame the management of virtualization,
making it simpler to use. But taming, or eliminating, or outsourcing
management of many systems is a significant part of cloud computing --
whether those systems are virtual or not.

(Postscript -- other issues: (1) It seems to me that you are
implicitly equating "number of transistors" and "performance." There
used to be a direct relationship between them, but no longer. Big
subject. (2) Intel is by far not the only company "driven" by Moore's
Law. Everybody in the computing industry has been, and is. Clusterers/
farmers/cloudies, however, are far less so because of the multi-system
aspect.)

--
Greg Pfister

Reuven Cohen

unread,
Jun 23, 2008, 9:35:29 PM6/23/08
to cloud-c...@googlegroups.com
I guess what I was trying to say was like Moore's Laws, there is an
ever increasing amount of computing capacity needed within
enterprises. I was curious if this demand for compute capactity would
increase in a similar formula to that of Moore's Law. Whether its a
high and or low end system, virtualized or PXE booted is not
important. The amount of aggregate cloud capacity is the import metric
I was attempting to examine. As for the Enomaly, of course I have a
bias point of view, I started the company.

reuven

--
--

Geoffrey Fox

unread,
Jun 23, 2008, 9:40:16 PM6/23/08
to cloud-c...@googlegroups.com
On transistors and performance, I note a post I did in May when we discussed "green computing". Namely
"I often used a slide deck from Peter Kogge from 10 years ago plotting
performance per transistor versus number of transistors. Performance
per transistor peaked at around a million transistors.
See http://www.old-npac.org/users/gcf/PPTKoggepimtalk/seporgimagedir/037.jpg
As chips went from one million to one hundred million transistors, the architecture
improvements did get a factor of 10 or so in performance but not linear
in number of transistors."

So for years Intel tried its hardest to support sequential computing (as parallel computing was/is considered too hard) but in a way that was highly inefficient; once you agree to parallelism -- as Google does -- you can move back to simpler slower chips but with better performance per hairdryer hour using unit introduced earlier.
-- 
:
: Geoffrey Fox  g...@indiana.edu FAX 8128567972 http://www.infomall.org
: Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977
: SkypeIn 812-669-0772 with voicemail, International cell 8123910207

Sassa NF

unread,
Jun 27, 2008, 6:58:53 AM6/27/08
to cloud-c...@googlegroups.com
There is a curious law about the chip size and the performance.

To solve a problem of complexity N in time T, the chip area A design
should satisfy a relationship A*T^2 >= c*f(N), where c is a constant,
and f(N) is a function of the problem complexity.

The curious thing here is that if you have a chip of size n^2*A that
solves the problem in time T/n, the efficiency of the single-unit
design is A*T^2. The same efficiency can be achieved by making n^2
units of area A solving the problem in time T (i.e. n times slower
than the large unit). However, the system of n^2 units is capable of
splitting the work into n^2 pieces, perhaps ending up finishing the
whole job in T/n^2, i.e. n times faster than the large unit... Get it?
:-)

I wonder if the same goes for grids/clouds.


Sassa

2008/6/24 Geoffrey Fox <g...@grids.ucs.indiana.edu>:

Khazret Sapenov

unread,
Jun 27, 2008, 9:50:52 AM6/27/08
to cloud-c...@googlegroups.com
On Fri, Jun 27, 2008 at 6:58 AM, Sassa NF <sassa.nf@gmail.com> wrote:

There is a curious law about the chip size and the performance.

To solve a problem of complexity N in time T, the chip area A design
should satisfy a relationship A*T^2 >= c*f(N), where c is a constant,
and f(N) is a function of the problem complexity.

The curious thing here is that if you have a chip of size n^2*A that
solves the problem in time T/n, the efficiency of the single-unit
design is A*T^2. The same efficiency can be achieved by making n^2
units of area A solving the problem in time T (i.e. n times slower
than the large unit). However, the system of n^2 units is capable of
splitting the work into n^2 pieces, perhaps ending up finishing the
whole job in T/n^2, i.e. n times faster than the large unit... Get it?
:-)

I wonder if the same goes for grids/clouds.


Sassa
 
In my opinion, general approach to this class of problems would be solving problem of discrete optimization, to achieve best outcome (maximization/minimization) for specified requirements/constraints, represented as linear equations. 
 
It might be expressed in canonical form as:
 
Maximize/Minimize(c^t x) (1)
subject to Ax <=b (2)
 
where x is vector of variables, while c and b are vectors on known coefficients and A is known matrix of coefficients. (1) is objective function, equations (2) are constraints.
 
For example in context of cloud/grid, here is what I encountered during work:
 
- optimal fit between cloud resources and workloads for known constraints (bandwidth, disk space, power, total cost of ownership, consolidation ratio etc);
 
- power and cooling costs for green computing initiatives;
 
As a product of optimization, we receive specific plan (e.g. consolidation plan) that allows to minimize resource contention and maximize utilization of cloud infrastructure.
 
cheers,
--
Khaz Sapenov,
Director of Research & Development
Enomaly Labs

US Phone: 212-461-4988 x5
Canada Phone: 416-848-6036 x5
E-mail: kh...@enomaly.com
Get Linked in> http://www.linkedin.com/in/sapenov  

Paco NATHAN

unread,
Jun 27, 2008, 7:47:53 PM6/27/08
to cloud-c...@googlegroups.com
Be careful what you optimize for...

I think those two posts are on the right track, in terms of
optimization problems. The constraints in that system of equations
becomes interesting -- and in practice goes beyond what O(n)
complexity generally considers.

Case in point:

1. Suppose my business has a mission-critical system, which requires 4
servers. That system completes some specific processing in nearly 24
hrs, and it must run each day. Some people would look at that and
pronounce it "Close to full utilization - good." An LP model might
agree, given a superficial set of constraints and cost function.

2. Suppose further that it would be possible to move this same
processing off to some other grid, and run it on 48 servers + 2
controls (same size servers) within approximately 2 hours. Roughly
speaking, linear scaling. Slightly more expensive, due to the added
controllers.


In practice, however, one must consider other factors beyond the
complexity of the problem "N", the number of nodes "n", and the time
"T" required:

* overall capex vs. opex for the server resources -- particularly
if a large number of servers can be leased at an hourly rate, thereby
reducing capex.

* cost of moving the data -- both the time required and the network cost.

* operational risk in Approach 1 if one of those four servers goes
belly up -- versus the fault tolerance benefits of a cloud-based
framework which can accommodate node failures.

* operational risk in Approach 1 if some aspect of the data or code
involved is incorrect, and therefore "full utilization" implies the
job cannot be rerun within the 24 hour requirement.

* opportunity cost if the problem scales up suddenly, such that
Approach 1 requires either more than 4 servers or more than 24 hours
to run.

Considering those trade-offs in the constraints and cost function
allows an optimization problem to solve for minimized risk.

Managing risk is what will convince the decision makers at enterprise
businesses to adopt cloud computing for mission critical systems. But
not O(N) complexity :)

Hope that helps -

Paco

--
Manager, Behavioral Targeting
Adknowledge, Inc.

pna...@adknowledge.com
pac...@cs.stanford.edu


On Fri, Jun 27, 2008 at 8:50 AM, Khazret Sapenov <sap...@gmail.com> wrote:

Greg Pfister

unread,
Jun 27, 2008, 9:35:23 PM6/27/08
to Cloud Computing
Right, and power is proportional to frequency cubed, so more slower
processors always wins in power & heat terms. So we should all use 10
Million 0.01MIPS processors and save the planet.

All such statements are totally, undeniably, true. (Maybe not saving
the planet all by ourselves.)

They also ignore natural functional processing boundaries, and how
difficult it is to parallelize code to units smaller than that, and
how simple can be to do when the processing matches those boundaries.

Example: When commodity micros got fast enough that each could handle
an entire web request all by itself at reasonable response time, what
happened? The web, that's what. Server farms. Stick a simple
dispatcher in front of a pile of commodity servers with a shared file
system, and away you go.

(Well, OK, each server overlaps disk IO to run other threads in the IO
wait time, but the code the majority of programmers wrote was straight
sequential.)

Think how hard that would be to do if each individual web request had
to somehow be split up into much smaller grains that ran in parallel.

The scaling of most commercial clouds absolutely lives on
relationships like this. It's why most HPC code is so much harder to
parallelize: Most of it does not have lots of independent requests.

There is a long-running repeating pattern in parallel processing where
people propose a gazillion ants to run in parallel to get absolutely
huge performance. On the other hand, Seymour Cray once said "I'd
rather plow a field with four strong horses than with 512 chickens."
It all depends on the application, of course; there are certainly
cases that work well with 512 chickens, or 10E10 bacteria, or
whatever. But most cases work better with he horses.

--
Greg Pfister

On Jun 27, 5:58 am, "Sassa NF" <sassa...@gmail.com> wrote:
> There is a curious law about the chip size and the performance.
>
> To solve a problem of complexity N in time T, the chip area A design
> should satisfy a relationship A*T^2 >= c*f(N), where c is a constant,
> and f(N) is a function of the problem complexity.
>
> The curious thing here is that if you have a chip of size n^2*A that
> solves the problem in time T/n, the efficiency of the single-unit
> design is A*T^2. The same efficiency can be achieved by making n^2
> units of area A solving the problem in time T (i.e. n times slower
> than the large unit). However, the system of n^2 units is capable of
> splitting the work into n^2 pieces, perhaps ending up finishing the
> whole job in T/n^2, i.e. n times faster than the large unit... Get it?
> :-)
>
> I wonder if the same goes for grids/clouds.
>
> Sassa
>
> 2008/6/24 Geoffrey Fox <g...@grids.ucs.indiana.edu>:
>
> > On transistors and performance, I note a post I did in May when we discussed
> > "green computing". Namely
> > "I often used a slide deck from Peter Kogge from 10 years ago plotting
> > performance per transistor versus number of transistors. Performance
> > per transistor peaked at around a million transistors.
> > Seehttp://www.old-npac.org/users/gcf/PPTKoggepimtalk/seporgimagedir/037.jpg
> > : Geoffrey Fox  g...@indiana.edu FAX 8128567972http://www.infomall.org

Sassa NF

unread,
Jun 28, 2008, 6:13:01 AM6/28/08
to cloud-c...@googlegroups.com
What I thought was that people get 512 horses into a cloud, because
they don't have elephants.


Sassa

2008/6/28 Greg Pfister <greg.p...@gmail.com>:

Sassa NF

unread,
Jun 28, 2008, 6:41:32 AM6/28/08
to cloud-c...@googlegroups.com
2008/6/28 Greg Pfister <greg.p...@gmail.com>:

> Right, and power is proportional to frequency cubed, so more slower
> processors always wins in power & heat terms. So we should all use 10
> Million 0.01MIPS processors and save the planet.

That's an idea for Khaz ;-)


> All such statements are totally, undeniably, true. (Maybe not saving
> the planet all by ourselves.)
>
> They also ignore natural functional processing boundaries, and how
> difficult it is to parallelize code to units smaller than that, and
> how simple can be to do when the processing matches those boundaries.

That's true, and that's why the AMD guy was politely sarcastic about
planting more cores and CPUs on one board, as mentioned in the "The
Register" article quoted here. The OS simply can't find enough jobs to
be done in parallel...


> Example: When commodity micros got fast enough that each could handle
> an entire web request all by itself at reasonable response time, what
> happened? The web, that's what. Server farms. Stick a simple
> dispatcher in front of a pile of commodity servers with a shared file
> system, and away you go.

Why a simple dispatcher in front of a pile of commodity servers
instead of a single mega-server?

I think the wisdom behind A*T^2 law is that it may be easier to build
smaller slow units (or buy slower computers) than faster units even if
they are larger. And that's where it ends. I don't think it says you
can do everything with that, or that the time T of individual small
units will be acceptable, because in the end to get it working n times
faster than a single unit, you need to parallelise processing among
n^2 units.


> (Well, OK, each server overlaps disk IO to run other threads in the IO
> wait time, but the code the majority of programmers wrote was straight
> sequential.)
>
> Think how hard that would be to do if each individual web request had
> to somehow be split up into much smaller grains that ran in parallel.

I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?

With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM. So the
question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.


> The scaling of most commercial clouds absolutely lives on
> relationships like this. It's why most HPC code is so much harder to
> parallelize: Most of it does not have lots of independent requests.

That's true, no doubt.


> There is a long-running repeating pattern in parallel processing where
> people propose a gazillion ants to run in parallel to get absolutely
> huge performance. On the other hand, Seymour Cray once said "I'd
> rather plow a field with four strong horses than with 512 chickens."
> It all depends on the application, of course; there are certainly
> cases that work well with 512 chickens, or 10E10 bacteria, or
> whatever. But most cases work better with he horses.

But we don't eat with ladles, we still eat with spoons :-) even if it
requires more trips.

Also, if you were to carry shopping of 100 people for 1 mile, you
could do 100 trips, or you could use a train of shopping trolleys and
just walk home.

As you said, it all depends on the job. And it looks like Google found
that using chickens is enough for the field they are about to plough.


Sassa

timothy norman huber

unread,
Jun 28, 2008, 12:08:38 PM6/28/08
to cloud-c...@googlegroups.com
On Jun 28, 2008, at 3:41 AM, Sassa NF wrote:

I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?

With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM.

You've illustrated the memory gap.  Memory doubles in density every 30 months compared to 18 months for processors.  

Fred Weber, former CTO of AMD (spearheaded development of the Opteron), and Suresh Rajan, former Dir. of Marketing of nVidia hatched MetaRAM to increase memory footprints utilizing existing memory busses and commodity dram parts.

So the question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.

For those that might not follow LKML, read Linux Kernel Hacker Daniel Phillips' paper "Ramback- Faster than a Speeding Bullet"  - The thread became a 400+ post flame war which ended when Alan Cox conceded Daniel was right.

Here's the intro:
Every little factor of 25 performance increase really helps.

Ramback is a new virtual device with the ability to back a ramdisk
by a real disk, obtaining the performance level of a ramdisk but with
the data durability of a hard disk.  To work this magic, ramback needs
a little help from a UPS.  In a typical test, ramback reduced a 25
second file operation[1] to under one second including sync.  Even
greater gains are possible for seek-intensive applications.

The difference between ramback and an ordinary ramdisk is: when the 
machine powers down the data does not vanish because it is continuously 
saved to backing store.  When line power returns, the backing store
repopulates the ramdisk while allowing application io to proceed 
concurrently.  Once fully populated, a little green light winks on and 
file operations once again run at ramdisk speed.
Daniel's code is rev 0.0, and is buggy like any new code, but the backbone is solid.  Motivated parties should foster development of this open-source gem.   Accelerating durable writes could dramatically accelerate cloud-based transactional systems.

Last bit-  x64 PCIe chipsets are just now hitting the market.  Big Honking multi-processor servers  loaded with RAM can now have multiple x4 IB or 10GbE NICs.   IB is pretty cool, especially when you consider the RDMA capabilities.  

T



Timothy Huber
Strategic Account Development


181 Metro Drive, Suite 400 
San Jose, CA 95110 


Greg Pfister

unread,
Jun 28, 2008, 3:30:46 PM6/28/08
to Cloud Computing
On Jun 28, 5:41 am, "Sassa NF" <sassa...@gmail.com> wrote:
> 2008/6/28 Greg Pfister <greg.pfis...@gmail.com>:
>
> > Right, and power is proportional to frequency cubed, so more slower

Actually, now I'm not sure whether it's cubed or squared, but either
one works in this discussion.

> > processors always wins in power & heat terms. So we should all use 10
> > Million 0.01MIPS processors and save the planet.
>
> That's an idea for Khaz ;-)
>
> > All such statements are totally, undeniably, true. (Maybe not saving
> > the planet all by ourselves.)
>
> > They also ignore natural functional processing boundaries, and how
> > difficult it is to parallelize code to units smaller than that, and
> > how simple can be to do when the processing matches those boundaries.
>
> That's true, and that's why the AMD guy was politely sarcastic about
> planting more cores and CPUs on one board, as mentioned in the "The
> Register" article quoted here. The OS simply can't find enough jobs to
> be done in parallel...
>
> > Example: When commodity micros got fast enough that each could handle
> > an entire web request all by itself at reasonable response time, what
> > happened? The web, that's what. Server farms. Stick a simple
> > dispatcher in front of a pile of commodity servers with a shared file
> > system, and away you go.
>
> Why a simple dispatcher in front of a pile of commodity servers
> instead of a single mega-server?

Because the pile costs a whole lot less. Compared with, say, a
mainframe, something like 100X less per unit of performance. Maybe
it's not less counting continuing management costs, but lots of people
have a hard time seeing past a 100X initial purchase price delta to
TCO over many years.

> I think the wisdom behind A*T^2 law is that it may be easier to build
> smaller slow units (or buy slower computers) than faster units even if
> they are larger. And that's where it ends. I don't think it says you
> can do everything with that,

The problem I've seen is with people, usually dyed-in-the-wool
engineers who have never written a line of code, who don't see why you
*can't* do everything with that.

> or that the time T of individual small
> units will be acceptable, because in the end to get it working n times
> faster than a single unit, you need to parallelise processing among
> n^2 units.

Yup, that's the point.

> > (Well, OK, each server overlaps disk IO to run other threads in the IO
> > wait time, but the code the majority of programmers wrote was straight
> > sequential.)
>
> > Think how hard that would be to do if each individual web request had
> > to somehow be split up into much smaller grains that ran in parallel.
>
> I don't think that's the intention. On the other hand if you had 100
> servers processing 1000 requests a second, would you rather use that
> or 10 servers that work 10 times faster?

That's kind of the point I was making, but it's interesting to note
that actually, you don't have that choice any more. There is no single
processor server that goes 10 times faster. The big iron (like IBM
Power series) gets a bit more MHz thanks to exotic cooling, but mainly
it's a massive (>64-way), fairly-flat, SMP. So you still parallelize,
just using shared memory instead of multiple systems. And guess what
-- lots of the standard infrastructure code doesn't support 64-way
SMP.

> With 100 servers you potentially get 400GB of RAM alone, whereas with
> 10 servers on the same architecture you would get 40GB RAM.

I'd quibble with that. If it's seriously X times faster, it better
have something near X times more RAM or the added speed isn't useful.

> So the
> question is not only about CPU speed.

I won't argue with that. Lots of factors. IO rate, too.

> With parallelism you get an
> option to use local data store where there is no contention. But you
> also get a headache of updating the state across all of them, meaning
> slower writes.

By local data store, do you mean RAM shared among processors? Better
watch for bottlenecks there, too, in bandwidth, and contention
definitely is an issue. If you have a hot lock, it's a hot lock
whether it's in RAM or in an outboard DB. (With the outboard DB,
you'll feel it sooner, but it's still there.)

> > The scaling of most commercial clouds absolutely lives on
> > relationships like this. It's why most HPC code is so much harder to
> > parallelize: Most of it does not have lots of independent requests.
>
> That's true, no doubt.
>
> > There is a long-running repeating pattern in parallel processing where
> > people propose a gazillion ants to run in parallel to get absolutely
> > huge performance. On the other hand, Seymour Cray once said "I'd
> > rather plow a field with four strong horses than with 512 chickens."
> > It all depends on the application, of course; there are certainly
> > cases that work well with 512 chickens, or 10E10 bacteria, or
> > whatever. But most cases work better with he horses.
>
> But we don't eat with ladles, we still eat with spoons :-) even if it
> requires more trips.

Soup spoons!

> Also, if you were to carry shopping of 100 people for 1 mile, you
> could do 100 trips, or you could use a train of shopping trolleys and
> just walk home.
>
> As you said, it all depends on the job. And it looks like Google found
> that using chickens is enough for the field they are about to plough.

Yes, but I'd add that those chickens have been fed steroids. Google
couldn't have done it in the mid-80s. Microprocessors weren't fast
enough. Now they're the Chickens That Ate Chicago.

--
Greg Pfister

Sassa NF

unread,
Jun 29, 2008, 1:37:58 PM6/29/08
to cloud-c...@googlegroups.com
It doesn't look like this flame will hit the 400+ mark as I already
agree with your points :-)


2008/6/28 Greg Pfister <greg.p...@gmail.com>:
...


>> Why a simple dispatcher in front of a pile of commodity servers
>> instead of a single mega-server?
>
> Because the pile costs a whole lot less. Compared with, say, a
> mainframe, something like 100X less per unit of performance. Maybe
> it's not less counting continuing management costs, but lots of people
> have a hard time seeing past a 100X initial purchase price delta to
> TCO over many years.

sure, that's what I thought too :-)

...


>> With 100 servers you potentially get 400GB of RAM alone, whereas with
>> 10 servers on the same architecture you would get 40GB RAM.
>
> I'd quibble with that. If it's seriously X times faster, it better
> have something near X times more RAM or the added speed isn't useful.

Well, in general I also think so, but in the special case of serving
multitudes of web requests it is not obviously important, because
serving 100 requests sequentially doesn't require 100 times more
memory than for one request.


>> So the
>> question is not only about CPU speed.
>
> I won't argue with that. Lots of factors. IO rate, too.

Yes


>> With parallelism you get an
>> option to use local data store where there is no contention. But you
>> also get a headache of updating the state across all of them, meaning
>> slower writes.
>
> By local data store, do you mean RAM shared among processors? Better
> watch for bottlenecks there, too, in bandwidth, and contention
> definitely is an issue. If you have a hot lock, it's a hot lock
> whether it's in RAM or in an outboard DB. (With the outboard DB,
> you'll feel it sooner, but it's still there.)

Well, I meant a theoretic local data store. On a multi-processor
system it is the processor cache. On a multi-computer system it is the
RAM of individual computer. There is no contention in the local data
store, because it is local. The contention does appear when you want
to copy the data from/to the local data store to/from the remote store
(shared file/memory/DB).


>> > The scaling of most commercial clouds absolutely lives on
>> > relationships like this. It's why most HPC code is so much harder to
>> > parallelize: Most of it does not have lots of independent requests.
>>
>> That's true, no doubt.
>>
>> > There is a long-running repeating pattern in parallel processing where
>> > people propose a gazillion ants to run in parallel to get absolutely
>> > huge performance. On the other hand, Seymour Cray once said "I'd
>> > rather plow a field with four strong horses than with 512 chickens."
>> > It all depends on the application, of course; there are certainly
>> > cases that work well with 512 chickens, or 10E10 bacteria, or
>> > whatever. But most cases work better with he horses.
>>
>> But we don't eat with ladles, we still eat with spoons :-) even if it
>> requires more trips.
>
> Soup spoons!

I was building the analogy of a client computer running a browser - so
to speak, small appetite requires a little spoon travelling many
times. That's how the web requests better be served by 512 chickens
than 4 horses.


>> Also, if you were to carry shopping of 100 people for 1 mile, you
>> could do 100 trips, or you could use a train of shopping trolleys and
>> just walk home.
>>
>> As you said, it all depends on the job. And it looks like Google found
>> that using chickens is enough for the field they are about to plough.
>
> Yes, but I'd add that those chickens have been fed steroids. Google
> couldn't have done it in the mid-80s. Microprocessors weren't fast
> enough. Now they're the Chickens That Ate Chicago.

I thought mid-80s were ants :-) and now we have chickens.

Sassa


> --
> Greg Pfister
>
> >
>

Stephen Voorhees

unread,
Jun 29, 2008, 6:20:34 PM6/29/08
to cloud-c...@googlegroups.com


From: cloud-c...@googlegroups.com
To: cloud-c...@googlegroups.com
Sent: Sat Jun 28 09:08:38 2008


Subject: Re: More with Moore, more or less

On Jun 28, 2008, at 3:41 AM, Sassa NF wrote:

I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?

With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM.

You've illustrated the memory gap.  Memory doubles in density every 30 months compared to 18 months for processors.  

Fred Weber, former CTO of AMD (spearheaded development of the Opteron), and Suresh Rajan, former Dir. of Marketing of nVidia hatched MetaRAM to increase memory footprints utilizing existing memory busses and commodity dram parts.

So the question is not only about CPU speed. With parallelism you get an

option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.

Hall, Jacob

unread,
Jun 30, 2008, 1:40:06 PM6/30/08
to cloud-c...@googlegroups.com
I would prefer less servers. Once we virtualize everything the economics of buying equipment will change. We'll almost always buy the faster technology (high clock rates, faster RAM, faster disks, etc....) because it will provide a better performance per watt versus the alternatives. Afterall... whether you buy a 3Ghz or 2 Ghz processor in a given machine model.. the power supply often reserves the same ammount of power. Our goal with fabric based or cloud computing is to optimize the allocation of power.

Density Dyanamics (http://www.densitydynamics.com) offers NV-DRAM or Volatile DRAM in a 3.5" package. I'm betting that the combination of a high speed low latency interconnect and commodity deployments of compute devices which only contain CPU, RAM, and Unified I/O is the way to go.

Thoughts?

Virtualized Everything Presentation - How to Design to Save Automatically
http://www.unifiedfabric.com/imworld

________________________________

From: cloud-c...@googlegroups.com on behalf of Stephen Voorhees
Sent: Sun 6/29/2008 6:20 PM
To: 'cloud-c...@googlegroups.com'
Subject: Re: More with Moore, more or less

________________________________

From: cloud-c...@googlegroups.com
To: cloud-c...@googlegroups.com
Sent: Sat Jun 28 09:08:38 2008
Subject: Re: More with Moore, more or less

On Jun 28, 2008, at 3:41 AM, Sassa NF wrote:


I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?


With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM.


You've illustrated the memory gap. Memory doubles in density every 30 months compared to 18 months for processors.

Fred Weber, former CTO of AMD (spearheaded development of the Opteron), and Suresh Rajan, former Dir. of Marketing of nVidia hatched MetaRAM to increase memory footprints utilizing existing memory busses and commodity dram parts.


So the question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.


For those that might not follow LKML, read Linux Kernel Hacker Daniel Phillips' paper "Ramback- Faster than a Speeding Bullet <http://lwn.net/Articles/272534/> " - The thread became a 400+ post flame war which ended when Alan Cox conceded Daniel was right.

Here's the intro:
Every little factor of 25 performance increase really helps.

Ramback is a new virtual device with the ability to back a ramdisk
by a real disk, obtaining the performance level of a ramdisk but with
the data durability of a hard disk. To work this magic, ramback needs
a little help from a UPS. In a typical test, ramback reduced a 25
second file operation[1] to under one second including sync. Even
greater gains are possible for seek-intensive applications.

The difference between ramback and an ordinary ramdisk is: when the
machine powers down the data does not vanish because it is continuously
saved to backing store. When line power returns, the backing store
repopulates the ramdisk while allowing application io to proceed
concurrently. Once fully populated, a little green light winks on and
file operations once again run at ramdisk speed.
Daniel's code is rev 0.0, and is buggy like any new code, but the backbone is solid. Motivated parties should foster development of this open-source gem. Accelerating durable writes could dramatically accelerate cloud-based transactional systems.

Last bit- x64 PCIe chipsets are just now hitting the market. Big Honking multi-processor servers loaded with RAM can now have multiple x4 IB or 10GbE NICs. IB is pretty cool, especially when you consider the RDMA capabilities.

T

Timothy Huber
Strategic Account Development


tim....@metaram.com
cell 310 795.6599

MetaRAM Inc. <http://www.metaram.com/>

timothy norman huber

unread,
Jun 30, 2008, 4:25:23 PM6/30/08
to cloud-c...@googlegroups.com
Jacob,

what do you think of the following server config comparisons?

1) 4P-128GB vs. 2P-128GB
2) 4P-128GB vs. 4P-256GB

In your opinion, what's optimal for a server config- number of procs,
memory size, chipset (PCIe lanes), etc?

Timothy

tim....@metaram.com
cell 310 795.6599

MetaRAM Inc.

Wes Felter

unread,
Jun 30, 2008, 4:35:54 PM6/30/08
to cloud-c...@googlegroups.com
Hall, Jacob wrote:
> I would prefer less servers. Once we virtualize everything the economics of buying equipment will change. We'll almost always buy the faster technology (high clock rates, faster RAM, faster disks, etc....) because it will provide a better performance per watt versus the alternatives. Afterall... whether you buy a 3Ghz or 2 Ghz processor in a given machine model.. the power supply often reserves the same ammount of power. Our goal with fabric based or cloud computing is to optimize the allocation of power.

A rule of thumb I noticed is 2x the sockets for roughly 4x the price, so
your power needs to be quite expensive to justify using large servers.
Also, it's questionable that larger servers provide better perf/W;
SPECpower results so far have shown that 2S servers use slightly more
than double the power of 1S servers; I wouldn't be surprised if the
trend continues at 4S and larger. This is not to mention SMP scalability
which is never perfect. OTOH, larger servers may benefit more from
statistical multiplexing.

I agree that power accounting is essential but I think the solution is
better accounting, not workarounds. You just can't afford to reserve the
same power capacity for things that are different. We need rack power
capping to drive those branch circuits to 80% utilization.

Wes Felter - wes...@felter.org

Hall, Jacob

unread,
Jun 30, 2008, 5:07:06 PM6/30/08
to cloud-c...@googlegroups.com
* apologies for the quick message, if you find error. Below are some
thoughts.

The configs for end point devices will vary based on workload but
generally it's nice for the endpoints to remain consistent. The closer
RAM is to the compute, the faster the software will perform
(particularly when RDMA is enabled) so large amounts of RAM have
benefits... this said... I wouldn't necessarily want 256GB at each end
point just to ensure my workloads don't overflow (primarily because
256GB would be powered up all the time and is fairly inefficient based
on today's motherboard designs). * Motherboards with larger RAM support
are more expensive. Larger machines today typically reserve larger
power supplies also which more standby energy waste and reserve more
power than is typically used.

No real opinion on size of RAM, but compute devices should only contain
what will be used and the power supply should be sized as such. Not
sure about you guys... but reserving 1300 WATTS * 2 for each 4P server
is a little over kill when less than 60% of the power is actually used.


Fabric Memory (which is scalable like a storage array using 3.5" disks)
is more ideal because it can more readily be repurposed and shared. By
repurposed it could look like a LUN, JMS, JDBC/ODBC, etc... or as an
RDMA cache for NFS / SAN. If the purpose is to have lots of RAM,
deploying RAM in a server is about the most inefficient way to do so
(video, hd controller, lots of PCI-Express slots, larger power supplies,
etc... just to serve RAM).

* Attached is the high level concept for a Processor Array.

Processor Array - Basic Idea v1.3.ppt

timothy norman huber

unread,
Jun 30, 2008, 6:18:35 PM6/30/08
to cloud-c...@googlegroups.com
Jacob,

*no errors, only seeking clarifications

How extensively is RDMA deployed in your applications now? How
latency sensitive are your cloud apps?

> If the purpose is to have lots of RAM,
> deploying RAM in a server is about the most inefficient way to do so
> (video, hd controller, lots of PCI-Express slots, larger power
> supplies,
> etc... just to serve RAM).

What would you propose as a more efficient mechanism to deploying lots
of RAM than sticking it directly in a server?

regards,

Timothy

> <Processor Array - Basic Idea v1.3.ppt>

timothy norman huber
timoth...@mac.com
310 795-6599

Geoffrey Fox

unread,
Jun 30, 2008, 7:22:09 PM6/30/08
to cloud-c...@googlegroups.com

Paul Harrington

unread,
Jun 30, 2008, 9:15:19 PM6/30/08
to cloud-c...@googlegroups.com, Paul Harrington
Any opinions out there on leveraging on iSER and STGT (openfabrics.org) as cloud computing teired storage?

Realize this is an open-ended question....and more of a storage fishing expedition...anyone successfully fishing in the iSER/STGT pond?

Sniper Ninja

unread,
Jul 1, 2008, 9:32:19 AM7/1/08
to cloud-c...@googlegroups.com
>> So the
>> question is not only about CPU speed.
>
> I won't argue with that. Lots of factors. IO rate, too.

 That's right. We should also not forget that 80% of assembly code of any application are MOV instructions. i.e instructions to move data to or from internal regiister to memory. I think it's a very important fact, that's ought not to be neglected.

     Benmerar.T.Z


Hall, Jacob

unread,
Jul 1, 2008, 8:22:18 AM7/1/08
to cloud-c...@googlegroups.com, Clive G. Cook

My proposal for efficiently deploying lots of RAM is to package it in a
3.5" disk form factor, optimize its power use to deliver the best
possible bandwidth / latency with the lowest power consumption, layer
software which allows the RAM to be de-duplicated, thin provisioned and
take on multiple personalities. Read-only caching is the low hanging
fruit for fabric memory as this optimizes use of I/O, improves latency,
and accelerates writes as in the case of a distributed storage array
cache.

In my opinion, latency has an impact on everything which utilizes I/O.
Specifically, I/O wait and latency impacts the efficiency and power
consumption of processors.

Now that we are seeing Virtual Machines run on the Cloud to provide HA
by Default without lots of dedicated standby nodes and the ability to
provision first... there will be a growing dependency on I/O (and hence
low latency). RDMA lowers latency.

Check out RNA Networks (Clive Cook cc'd) in regards to an example and
efficient deployment of Fabric Memory with RDMA software support. Think
of fabric memory not as a replacement for native RAM, but more like what
we see in processors with L1, L2, and sometimes L3 cache. Fabric memory
is essentially an L4 cache.

The best use of RDMA is when you get RMDA for free. SDP, SRP, iSER, MS
Network Direct

I am unable to share where we utilize RDMA today, but the above comments
summarize the general benefit.

Sassa NF

unread,
Jul 1, 2008, 10:55:12 AM7/1/08
to cloud-c...@googlegroups.com
Oh, and perhaps 80% of the data copied with MOV are pointers :-)


Sassa

2008/7/1 Sniper Ninja <sniper...@gmail.com>:

Greg Pfister

unread,
Jul 2, 2008, 1:41:08 AM7/2/08
to Cloud Computing
On Jun 29, 12:37 pm, "Sassa NF" <sassa...@gmail.com> wrote:

> It doesn't look like this flame will hit the 400+ mark as I already
> agree with your points :-)

What?! WE WON"T EVEN GET TO THE ALL CAPS STAGE? oh, sorry. :-) But I
even agree with that.

<snip some agreement>

> >> With 100 servers you potentially get 400GB of RAM alone, whereas with
> >> 10 servers on the same architecture you would get 40GB RAM.
>
> > I'd quibble with that. If it's seriously X times faster, it better
> > have something near X times more RAM or the added speed isn't useful.
>
> Well, in general I also think so, but in the special case of serving
> multitudes of web requests it is not obviously important, because
> serving 100 requests sequentially doesn't require 100 times more
> memory than for one request.

Agreed, but only if those requests don't involve IO. Then the IO waits
mean X times as many items in flight, which must all reside in RAM to
maintain response time. Really simple RO web requests might be
satisfied out of a shared file cache, though, with no IO. Maybe a
little more space for the file cache, but not X times as much for that
case.

<snip again>
> >> With parallelism you get an
> >> option to use local data store where there is no contention. But you
> >> also get a headache of updating the state across all of them, meaning
> >> slower writes.
>
> > By local data store, do you mean RAM shared among processors? Better
> > watch for bottlenecks there, too, in bandwidth, and contention
> > definitely is an issue. If you have a hot lock, it's a hot lock
> > whether it's in RAM or in an outboard DB. (With the outboard DB,
> > you'll feel it sooner, but it's still there.)
>
> Well, I meant a theoretic local data store. On a multi-processor
> system it is the processor cache. On a multi-computer system it is the
> RAM of individual computer. There is no contention in the local data
> store, because it is local. The contention does appear when you want
> to copy the data from/to the local data store to/from the remote store
> (shared file/memory/DB).

I still don't see it.

On a multi-processor, there is a known effect that, with the right
program, of course, gives real superlinear speedup sometimes because
adding a processor adds cache: N processors have more memory (in
cache) to work with. But you don't have the same effect with a multi-
computer.

But maybe this is partly an issue of terminology. Say "memory
bandwidth" instead of "contention" and I find something to agree with.
It's hard (not impossible, just expensive) to build a multiprocessor
with memory bandwidth that scales well with the number of processors.
Front-side-bus - based memory starvation is a common issue, and AMD's
scheme is better only with very clever memory allocation. MP caches
don't have that issue, and MCs duplicate the whole memory system,
adding proportional to the number of systems. If those kinds of memory
bandwidth scaling effects are what you mean by "contention," then,
well, we're in boring agreement again.

<snipping more agreement>

> >> But we don't eat with ladles, we still eat with spoons :-) even if it
> >> requires more trips.
>
> > Soup spoons!
>
> I was building the analogy of a client computer running a browser - so
> to speak, small appetite requires a little spoon travelling many
> times. That's how the web requests better be served by 512 chickens
> than 4 horses.

"Better"? Cheaper, yes. But that has to do more with production
volumes than anything else, not anything to do with intrinsic size.
Commodity micros are good enough, and cheap. If soup spoons had the
same cost, we'd use them; less junk to manage.

> > Yes, but I'd add that those chickens have been fed steroids. Google
> > couldn't have done it in the mid-80s. Microprocessors weren't fast
> > enough. Now they're the Chickens That Ate Chicago.
>
> I thought mid-80s were ants :-) and now we have chickens.

Actually, a project of mine (RP3 at IBM Research) in the mid-80s
directly drew the "chickens" comment. So 80s micros are the true
chickens of historical context. There were true ants, too, much
smaller than "general purpose" micros.

Now, everything's a horse. Or everything's a chicken. Or an ant. There
aren't any significant performance differences any more (outside of
embedded computing). Except for special-purpose (arguably) systems
like GPGPUs or crypto engines or the like, and when you look closely
at those, you see... tiny ants. Piles of tiny ants.

--
Greg Pfister

Shane Sigler

unread,
Jul 2, 2008, 12:14:10 PM7/2/08
to cloud-c...@googlegroups.com
SPECpower results are not necessarily a good indicator of reality as
vendors will skew configurations (most of which are not necessarily going
to be useful) to get better numbers. The best way to test this is to run
your application load on 4 1S servers, 2 2S servers and 1 4S server
and measure the power draw. There are of course a lot of assumptions built into
this such as the PSU efficiency for each of these systems is the same
(which it often isn't) and the number of I/O slots scale with the number
of CPUs (which they don't tend to), etc.

Shane

Sassa NF

unread,
Jul 2, 2008, 5:20:15 PM7/2/08
to cloud-c...@googlegroups.com
2008/7/2 Greg Pfister <greg.p...@gmail.com>:

> On Jun 29, 12:37 pm, "Sassa NF" <sassa...@gmail.com> wrote:
>> >> With parallelism you get an
>> >> option to use local data store where there is no contention. But you
>> >> also get a headache of updating the state across all of them, meaning
>> >> slower writes.
>>
>> > By local data store, do you mean RAM shared among processors? Better
>> > watch for bottlenecks there, too, in bandwidth, and contention
>> > definitely is an issue. If you have a hot lock, it's a hot lock
>> > whether it's in RAM or in an outboard DB. (With the outboard DB,
>> > you'll feel it sooner, but it's still there.)
>>
>> Well, I meant a theoretic local data store. On a multi-processor
>> system it is the processor cache. On a multi-computer system it is the
>> RAM of individual computer. There is no contention in the local data
>> store, because it is local. The contention does appear when you want
>> to copy the data from/to the local data store to/from the remote store
>> (shared file/memory/DB).
>
> I still don't see it.

Your "bandwidth" term describes part of what I meant. I meant that if
one component wants to read or modify the "master" copy, they'd need
to lock it for exclusive access. The others would need to wait to get
the lock, hence contention (if I am using the right term), and the
cause of contention is bandwidth.

So RAM of individual computers would be contention-free (on the scale
of the multi-computer system).


> On a multi-processor, there is a known effect that, with the right
> program, of course, gives real superlinear speedup sometimes because
> adding a processor adds cache: N processors have more memory (in
> cache) to work with. But you don't have the same effect with a multi-
> computer.

I was wondering why you don't have the same effect with a
multi-computer? (And I thought that is how things could work much
faster on lots of cheap computers with loads of RAM.)

Say, hypothetically Google splits its knowledge into a tree or a
hierarchy, each node represented by a cheap computer with loads of
RAM. Searching the net would mean redirecting the request through a
few branches in the tree and the response would be provided at the
speed of search in RAM for frequently sought keywords. On the other
hand, a Big Iron would face problems of a different kind; whereas its
processors might run faster, it might not get the same response time,
if it cannot fit the same amount of data into RAM, as the chicken
factory, and end up trundling across the ...disk?.

Also, the specifics of Google search are that the data doesn't have to
be updated synchronously in all the replicas of the same node...
Because the nature of the search is more of a guess (so if a node
starts guessing a more precise value a few hours later it may not be a
problem?). This means relaxed requirements towards replication speed
and accuracy.


> But maybe this is partly an issue of terminology. Say "memory
> bandwidth" instead of "contention" and I find something to agree with.
> It's hard (not impossible, just expensive) to build a multiprocessor
> with memory bandwidth that scales well with the number of processors.
> Front-side-bus - based memory starvation is a common issue, and AMD's
> scheme is better only with very clever memory allocation. MP caches
> don't have that issue, and MCs duplicate the whole memory system,
> adding proportional to the number of systems. If those kinds of memory
> bandwidth scaling effects are what you mean by "contention," then,
> well, we're in boring agreement again.

You are right, perhaps I should use the term describing the channel
through which the data is fed in/out of RAM, because the contention is
the attempt to own the channel (to make consistent changes). This is
perhaps called the bus.

So the point I was making is that obviously access to RAM cannot be
contended by other computers in the multi-computer system, whereas if
you have to store the data in a shared store, access to that causes
contention. That's how it may be better to have multiple computers
with loads of RAM. But you also said that it would be the norm for
faster computers to have proportionately more memory. Then why would
Google buy lots of cheaper computers, as someone claimed here?

Greg Pfister

unread,
Jul 3, 2008, 10:58:30 AM7/3/08
to Cloud Computing
On Jul 2, 4:20 pm, "Sassa NF" <sassa...@gmail.com> wrote:
...
> Your "bandwidth" term describes part of what I meant. I meant that if
> one component wants to read or modify the "master" copy, they'd need
> to lock it for exclusive access. The others would need to wait to get
> the lock, hence contention (if I am using the right term), and the
> cause of contention is bandwidth.
>
> So RAM of individual computers would be contention-free (on the scale
> of the multi-computer system).

This is the nub of our disagreement. Yes, local memory access has no
contention for access to that memory, but:

Assume you don't change the core logic of the application. Then if you
needed a to acquire a lock in a multiprocessor, you still need to
acquire that same lock on a multicomputer. (If for some reason you do
NOT need the lock on the MC, then duplicate that lock-free logic in
the MP, and its memory contention goes away.) So while you're not
contending for memory, you've traded MP lock time (nsecs. or usecs.)
for MC lock time (millisec.).

Looking at it another way: You didn't eliminate the logical
contention. You can't; it's inherent in the program logic + data
access pattern. You just moved the contention from the MP memory
system into the network / adapter / device driver.

(By the way, this same issue was a flaw in the logic of the classic
paper "On the Future of High-Performance Database Processing" by
DeWitt and Gray in 1992, the paper that basically launched shared-
nothing distributed databases. Unfortunately, it wasn't the only logic
flaw, but nobody recognized that at the time.)

> > On a multi-processor, there is a known effect that, with the right
> > program, of course, gives real superlinear speedup sometimes because
> > adding a processor adds cache: N processors have more memory (in
> > cache) to work with. But you don't have the same effect with a multi-
> > computer.
>
> I was wondering why you don't have the same effect with a
> multi-computer? (And I thought that is how things could work much
> faster on lots of cheap computers with loads of RAM.)
>
> Say, hypothetically Google splits its knowledge into a tree or a
> hierarchy, each node represented by a cheap computer with loads of
> RAM. Searching the net would mean redirecting the request through a
> few branches in the tree and the response would be provided at the
> speed of search in RAM for frequently sought keywords. On the other
> hand, a Big Iron would face problems of a different kind; whereas its
> processors might run faster, it might not get the same response time,
> if it cannot fit the same amount of data into RAM, as the chicken
> factory, and end up trundling across the ...disk?.

Well, yes (ignoring the time spent sending the messages inter-
computer), because you've increased the amount of RAM -- so in a sense
it's not true "superlinear" speedup. Depending on the constraints you
put on what you mean by "superlinear."

The MP superlinear speedup is arguably more subtle because it occurs
when you change nothing except the number of threads/processes used by
the code.

So you have this MP, with N processors and X GB of RAM. You run the
program with number of threads set to, say, 1. It runs, using only one
processor. Now you run it with number of threads = 2, so it uses 2
processors. It might speed up by >2X -- *if* the bottleneck was in
moving data in and out of some very cache-able memory, because along
with that processor you acquired the processor's cache.

(Add here all kinds of caveats about multi-level cache structure,
using real processors and not hardware threads on a single processor,
etc.)

> Also, the specifics of Google search are that the data doesn't have to
> be updated synchronously in all the replicas of the same node...
> Because the nature of the search is more of a guess (so if a node
> starts guessing a more precise value a few hours later it may not be a
> problem?). This means relaxed requirements towards replication speed
> and accuracy.

Yes, and if you relaxed those update requirements the same way in the
code run on an MP (yes, that would require more MP memory), you would
get less memory contention (but may have a memory bandwidth problem).
That's the kind of thing I meant above when referring to "changing the
the core logic of the application."

But without the motivation of scaling much more than any single MP,
you may not be willing to put the effort into changing the
application.

--
Greg Pfister

Sassa NF

unread,
Jul 4, 2008, 3:36:45 AM7/4/08
to cloud-c...@googlegroups.com
2008/7/3 Greg Pfister <greg.p...@gmail.com>:

> On Jul 2, 4:20 pm, "Sassa NF" <sassa...@gmail.com> wrote:
> ...
>> Your "bandwidth" term describes part of what I meant. I meant that if
>> one component wants to read or modify the "master" copy, they'd need
>> to lock it for exclusive access. The others would need to wait to get
>> the lock, hence contention (if I am using the right term), and the
>> cause of contention is bandwidth.
>>
>> So RAM of individual computers would be contention-free (on the scale
>> of the multi-computer system).
>
> This is the nub of our disagreement. Yes, local memory access has no
> contention for access to that memory, but:

I have to agree. :-)

I forgot that in both cases they are just a bunch of von Neumann
machines connected with channels. So if the MC case doesn't need to
send any messages and each computer can work autonomously, so does the
MP case.

But I somehow still think that using an MC has its benefits against an
MP, because in MP sharing resources between von Neumann machines
starts right outside the processor chip (let's say, outside each
processor's cache), whereas in MC sharing resources between von
Neumann machines starts on the other size of the network card.


> Assume you don't change the core logic of the application. Then if you
> needed a to acquire a lock in a multiprocessor, you still need to
> acquire that same lock on a multicomputer. (If for some reason you do
> NOT need the lock on the MC, then duplicate that lock-free logic in
> the MP, and its memory contention goes away.) So while you're not
> contending for memory, you've traded MP lock time (nsecs. or usecs.)
> for MC lock time (millisec.).
>
> Looking at it another way: You didn't eliminate the logical
> contention. You can't; it's inherent in the program logic + data
> access pattern. You just moved the contention from the MP memory
> system into the network / adapter / device driver.
>
> (By the way, this same issue was a flaw in the logic of the classic
> paper "On the Future of High-Performance Database Processing" by
> DeWitt and Gray in 1992, the paper that basically launched shared-
> nothing distributed databases. Unfortunately, it wasn't the only logic
> flaw, but nobody recognized that at the time.)

...only I found the same flaw ...ermmm.... 16 years later :-)


>> > On a multi-processor, there is a known effect that, with the right
>> > program, of course, gives real superlinear speedup sometimes because
>> > adding a processor adds cache: N processors have more memory (in
>> > cache) to work with. But you don't have the same effect with a multi-
>> > computer.
>>
>> I was wondering why you don't have the same effect with a
>> multi-computer? (And I thought that is how things could work much
>> faster on lots of cheap computers with loads of RAM.)
>>
>> Say, hypothetically Google splits its knowledge into a tree or a
>> hierarchy, each node represented by a cheap computer with loads of
>> RAM. Searching the net would mean redirecting the request through a
>> few branches in the tree and the response would be provided at the
>> speed of search in RAM for frequently sought keywords. On the other
>> hand, a Big Iron would face problems of a different kind; whereas its
>> processors might run faster, it might not get the same response time,
>> if it cannot fit the same amount of data into RAM, as the chicken
>> factory, and end up trundling across the ...disk?.
>
> Well, yes (ignoring the time spent sending the messages inter-
> computer), because you've increased the amount of RAM -- so in a sense
> it's not true "superlinear" speedup. Depending on the constraints you
> put on what you mean by "superlinear."

Actually, I wasn't after a superlinear speedup :-) I was after a
speedup that would be easier achievable on an MC than MP.


> The MP superlinear speedup is arguably more subtle because it occurs
> when you change nothing except the number of threads/processes used by
> the code.
>
> So you have this MP, with N processors and X GB of RAM. You run the
> program with number of threads set to, say, 1. It runs, using only one
> processor. Now you run it with number of threads = 2, so it uses 2
> processors. It might speed up by >2X -- *if* the bottleneck was in
> moving data in and out of some very cache-able memory, because along
> with that processor you acquired the processor's cache.

Yep, that's what I think might be the case with Google's search or
similar tasks. The RAM of individual nodes of the MC becomes what an
MP has as CPU cache, the only difference being the size.

I am not saying I am right. This is just what seems to be the case. In
an MP one von Neumann sequential processor is a CPU, and its local
data store is the cache (the rest of memory being a shared resource =
remote memory). In an MC one von Neumann sequential processor is a
single node, and its local data store is the RAM (the rest of storage
being a shared filesystem? = remote memory). They are sharing stuff on
different scale, so the gains are of different scale, too.


What's wrong with that? (are we in the region of 10+ messages on the
topic now? and we started from agreeing!)


Perhaps, the illusion of a gain is not in contention as I thought, but
in the end it is the IO bandwidth as you say. In the end in both cases
we have similar numbers of CPUs with proportionate sizes of cache,
RAM, disks and optic fibre. So for MC to make better use of Joules
than an MP the gain from IO confined to a single node in MC should
exceed the loss of IO due to more network communication.


> (Add here all kinds of caveats about multi-level cache structure,
> using real processors and not hardware threads on a single processor,
> etc.)
>
>> Also, the specifics of Google search are that the data doesn't have to
>> be updated synchronously in all the replicas of the same node...
>> Because the nature of the search is more of a guess (so if a node
>> starts guessing a more precise value a few hours later it may not be a
>> problem?). This means relaxed requirements towards replication speed
>> and accuracy.
>
> Yes, and if you relaxed those update requirements the same way in the
> code run on an MP (yes, that would require more MP memory), you would
> get less memory contention (but may have a memory bandwidth problem).
> That's the kind of thing I meant above when referring to "changing the
> the core logic of the application."
>
> But without the motivation of scaling much more than any single MP,
> you may not be willing to put the effort into changing the
> application.

Right, that's another point I missed. If the application can scale
much more than a single MP, then perhaps you can achieve the same
results at a similar speed but on a cheaper hardware?


Sassa

> --
> Greg Pfister

Greg Pfister

unread,
Jul 5, 2008, 11:26:29 PM7/5/08
to Cloud Computing
On Jul 4, 2:36 am, "Sassa NF" <sassa...@gmail.com> wrote:
> 2008/7/3 Greg Pfister <greg.pfis...@gmail.com>:
...
> But I somehow still think that using an MC has its benefits against an
> MP, ....

See my comments below.
...

> > (By the way, this same issue was a flaw in the logic of the classic
> > paper "On the Future of High-Performance Database Processing" by
> > DeWitt and Gray in 1992, the paper that basically launched shared-
> > nothing distributed databases. Unfortunately, it wasn't the only logic
> > flaw, but nobody recognized that at the time.)
>
> ...only I found the same flaw ...ermmm.... 16 years later :-)

Well, actually I was accusing you of making the same error -- and
pointing out you were in good company. :-)

But they also concluded that the only disk access was to local disks,
which isn't current practice any more. (SANs hadn't been invented.)

...

> Actually, I wasn't after a superlinear speedup :-) I was after a
> speedup that would be easier achievable on an MC than MP.

After all this time... is that all? Heck, I'll agree that happens
every day. There are lots of clusters / farms / MCs / whatever out
there actively using far more memory and IO bandwidth than any one MP
has ever achieved.

...

> I am not saying I am right. This is just what seems to be the case. In
> an MP one von Neumann sequential processor is a CPU, and its local
> data store is the cache (the rest of memory being a shared resource =
> remote memory). In an MC one von Neumann sequential processor is a
> single node, and its local data store is the RAM (the rest of storage
> being a shared filesystem? = remote memory). They are sharing stuff on
> different scale, so the gains are of different scale, too.
>
> What's wrong with that? (are we in the region of 10+ messages on the
> topic now? and we started from agreeing!)
>
> Perhaps, the illusion of a gain is not in contention as I thought, but
> in the end it is the IO bandwidth as you say. In the end in both cases
> we have similar numbers of CPUs with proportionate sizes of cache,
> RAM, disks and optic fibre. So for MC to make better use of Joules
> than an MP the gain from IO confined to a single node in MC should
> exceed the loss of IO due to more network communication.

Well, Joules are another layer of complexity, but leaving that aside
I'll agree. I'll even agree (ahem) that the word "contention" can be
used here -- as long as we agree that the contention is for basic
hardware resources -- data carriers like busses, switches, etc. -- and
not logical contention on locks that depends on the algorithms.

So I'm afraid we've clarified ourselves into agreement.

Dang, didn't get to use Caps Lock. :-)

--
Greg Pfister

Sassa NF

unread,
Jul 6, 2008, 3:00:03 AM7/6/08
to cloud-c...@googlegroups.com
2008/7/6 Greg Pfister <greg.p...@gmail.com>:

> On Jul 4, 2:36 am, "Sassa NF" <sassa...@gmail.com> wrote:
>> 2008/7/3 Greg Pfister <greg.pfis...@gmail.com>:
> ...
>> But I somehow still think that using an MC has its benefits against an
>> MP, ....
>
> See my comments below.
> ...
>
>> > (By the way, this same issue was a flaw in the logic of the classic
>> > paper "On the Future of High-Performance Database Processing" by
>> > DeWitt and Gray in 1992, the paper that basically launched shared-
>> > nothing distributed databases. Unfortunately, it wasn't the only logic
>> > flaw, but nobody recognized that at the time.)
>>
>> ...only I found the same flaw ...ermmm.... 16 years later :-)
>
> Well, actually I was accusing you of making the same error -- and
> pointing out you were in good company. :-)

Yes, I got it :-) But to make an error, you need to find one, in a way...


> But they also concluded that the only disk access was to local disks,
> which isn't current practice any more. (SANs hadn't been invented.)
>
> ...
>
>> Actually, I wasn't after a superlinear speedup :-) I was after a
>> speedup that would be easier achievable on an MC than MP.
>
> After all this time... is that all? Heck, I'll agree that happens
> every day. There are lots of clusters / farms / MCs / whatever out
> there actively using far more memory and IO bandwidth than any one MP
> has ever achieved.

:-) That's good to know, but it doesn't help me understand why/when an
MC of "mid-range" computers is better than an MC of MP computers...

I can see the resource sharing happens on a different scale, but
aren't MPs designed to share the resources per CPU more efficiently in
general case? (Then what are the special cases when the resource
sharing between MC of "mid-range" computers share the resource between
the CPUs better than the MPs?) Or is it just the case of unfarily
overpricing the MPs?

Or is it the case of simpler app development (kind of, to build an MC
of MPs, you need to implement two kinds of resource sharing - inter-
and intra-computer resource sharing)


> ...
>
>> I am not saying I am right. This is just what seems to be the case. In
>> an MP one von Neumann sequential processor is a CPU, and its local
>> data store is the cache (the rest of memory being a shared resource =
>> remote memory). In an MC one von Neumann sequential processor is a
>> single node, and its local data store is the RAM (the rest of storage
>> being a shared filesystem? = remote memory). They are sharing stuff on
>> different scale, so the gains are of different scale, too.
>>
>> What's wrong with that? (are we in the region of 10+ messages on the
>> topic now? and we started from agreeing!)
>>
>> Perhaps, the illusion of a gain is not in contention as I thought, but
>> in the end it is the IO bandwidth as you say. In the end in both cases
>> we have similar numbers of CPUs with proportionate sizes of cache,
>> RAM, disks and optic fibre. So for MC to make better use of Joules
>> than an MP the gain from IO confined to a single node in MC should
>> exceed the loss of IO due to more network communication.
>
> Well, Joules are another layer of complexity, but leaving that aside
> I'll agree. I'll even agree (ahem) that the word "contention" can be
> used here -- as long as we agree that the contention is for basic
> hardware resources -- data carriers like busses, switches, etc. -- and
> not logical contention on locks that depends on the algorithms.
>
> So I'm afraid we've clarified ourselves into agreement.
>
> Dang, didn't get to use Caps Lock. :-)

OK


Sassa

> --
> Greg Pfister

Greg Pfister

unread,
Jul 7, 2008, 6:56:10 PM7/7/08
to Cloud Computing
On Jul 6, 2:00 am, "Sassa NF" <sassa...@gmail.com> wrote:
> :-) That's good to know, but it doesn't help me understand why/when an
> MC of "mid-range" computers is better than an MC of MP computers...
>
> I can see the resource sharing happens on a different scale, but
> aren't MPs designed to share the resources per CPU more efficiently in
> general case? (Then what are the special cases when the resource
> sharing between MC of "mid-range" computers share the resource between
> the CPUs better than the MPs?) Or is it just the case of unfarily
> overpricing the MPs?
>
> Or is it the case of simpler app development (kind of, to build an MC
> of MPs, you need to implement two kinds of resource sharing - inter-
> and intra-computer resource sharing)

Advantages / disadvantages of MP vs. MC? When to use one vs. the
other?

Um, I wrote the book on this. Literally. Go see
http://www.amazon.com/Search-Clusters-2nd-Gregory-Pfister/dp/0138997098/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1215468747&sr=1-1
.

I'm not going to post the whole book here -- and in any event the book
explains the basic principles; it's not a checklist you can directly
use to pick one or the other -- but here are some things that are
relevant. It can be a very trivial or a very multi-dimensional
decision.

First of all, which is the app written for? MC and MP are very
different programming models. Scaling using an MP of MCs can require
that you come very close to writing it twice, nontrivially (unless you
basically punt the MP; see below). Some apps only scale significantly
on MC (Apache & web serving come to mind) while others give speedup
only on MP (speeding up a single run of NASTRAN is an MP-only case).
If your app is tied to one of the programming models, you don't have
much of a choice.

Virtualization fuzzes this, since it makes an MC out of an MP. (I
don't believe in MPs made out of an MC; that's a third programming
model.)

"Enterprise" programming models, like EJB (or other transaction
monitors) fuzz it again, since they take over the synchronization &
sharing and can at least in theory run the same apps on an MP or an
MC, don't care which, just invoke different code in the monitor. In
theory. How well that scales in practice... I've heard conflicting
reports.

Some forms of NUMA (non-uniform memory access) MPs also fuzz the
decision, but seldom scale well unless the code is so partitioned it
is almost MC-style (or EJB) anyway. (I'm thinking here of Azul in
particular.) However, there are HPC guys who really love their NUMA,
so SGI more-or-less stays in business. NUMA is actually yet another
programming model.

If you do have a choice, the same aggregate performance is probably
cheaper with commodity MC -- for purchase price. But management of
fewer MPs is cheaper over time, even if they're virtualized.

Availability: One MP is not HA. Period. Even if it's NUMA. An MC can
be more available, but you have to be sure the code will fail over
correctly. Some databases do that rather well. Mostly stateless apps
do it really well. But be sure the machines aren't on the same power
feed, under the same sprinkler head, etc., and if you're really
paranoid, not on the same tectonic plate -- which raises issues of
performance, since while you can do two-phase commit across big
geography, it costs in response time.

...and so on.

Lots of this, particularly issues of programming model and generic
characteristics of apps that scale on one or the other, is in the
book; the penultimate chapter is a summary comparison.

For the most part, since the book was written (2000, but it's still
selling and people still are referred to it) the market has spoken and
MCs are "it" for almost all cases: they're the target of the vast
majority of the programming going on. This whole generation of
programmers think scaling = MC. That we're now all going to have MCs
made of MPs (multimanycore) hasn't yet been well digested. For now,
the major use is MPMCs to cut up each MP into what amounts to an MC,
either by virtualization or by various forms of software, like
multiple MPI nodes or Mapreduce nodes on a single MP. This may not be
an efficient use of the MPs, depending on the app, but at least it
uses them.

--
Greg Pfister

Sassa NF

unread,
Jul 8, 2008, 4:21:07 AM7/8/08
to cloud-c...@googlegroups.com
Hey! I'm glad I asked!

I've seen your book, now I'll have to buy it :-)

Thanks, Greg!


Sassa

2008/7/7 Greg Pfister <greg.p...@gmail.com>:
...


> Advantages / disadvantages of MP vs. MC? When to use one vs. the
> other?
>
> Um, I wrote the book on this. Literally. Go see
> http://www.amazon.com/Search-Clusters-2nd-Gregory-Pfister/dp/0138997098/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1215468747&sr=1-1
> .
>
> I'm not going to post the whole book here -- and in any event the book
> explains the basic principles; it's not a checklist you can directly
> use to pick one or the other -- but here are some things that are
> relevant. It can be a very trivial or a very multi-dimensional
> decision.

...
> --
> Greg Pfister

Reply all
Reply to author
Forward
0 new messages