At Intel's core, they are a company driven by one singular mantra,
"Moore's Law". According to wikipedia, Moore's law describes an
important trend in the history of computer hardware: that the number
of transistors that can be inexpensively placed on an integrated
circuit is increasing exponentially, doubling approximately every two
years. The observation was first made by Intel co-founder Gordon E.
Moore in a 1965 paper.
Over the last couple years we have been working very closely with
Intel, specifically in the areas of virtualization. During this time
we have learned a lot about how they think and what drives them as an
organization. In one of my early pitches we described our approach to
virtualization as "Doing more with Moore" A kind of play on the common
phases "doing more with less" combined with some of the ideas behind
"Moores Law" which is all about growth and greater efficiencies. They
loved the idea, for the first time someone was looking at
virtualization not purely as a way to consolidate a data center but as
a way to more effectively scale your overall capacity.
What is interesting about Moores law in regards to cloud computing is
it is no longer just about how many transistors you can get on a
single CPU, but more about how effectively you spread your compute
capacity on more then one CPU, be it multi-core chips, or among
hundreds, or even thousands of connected servers. Historically the
faster the CPU gets the more demanding the applications built for it
become. I am curious if we're on the verge of seeing a similar "Moores
Law" applied to the cloud? And if so, will it follow the same
principals? Will we start to see a "Ruv's Law" where every 18 months
the amount of cloud capacity will double or will we reach a point
where there is never enough excess capacity to meet the demand?
(Original Post:
http://elasticvapor.com/2008/06/more-with-moore-more-or-less.html)
--
--
Reuven Cohen
Founder & Chief Technologist, Enomaly Inc.
blog > www.elasticvapor.com
I see the cloud is more of an ecosystem, which is likely to defy any
simple, linear type rule due to emergent properties. I wouldn't be shocked
if it was as hard to predict as the weather ("5 day forecast calls for a
200 msec second standard deviation in latency with 10% probability of the
jitters"). I'm only slightly joking - my early experiences with sharing
hosted grid computing resources were pretty blustery.
On that note I am much more interested in the phase transitions, critical
junctures where the properties change.
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
Jeff Dean showed some interesting numbers about cluster sizes and job
throughput in this talk:
http://sites.google.com/site/io/underneath-the-covers-at-google-current-systems-and-future-directions
Maybe add a little historical price data for commodity hardware...
Paco
On Sun, Jun 22, 2008 at 9:32 PM, Ray Nugent <rnu...@yahoo.com> wrote:
> More likely you see "Ray's Law" (sorry, couldn't help myself ;-) where the
> amount of computing one can do on a given piece of the cloud increases
> exponentially while the cost for said piece drops by half...
>
> Ray
>
> ----- Original Message ----
> From: Reuven Cohen <r...@enomaly.com>
...
> At Intel's core, they are a company driven by one singular mantra,
> "Moore's Law". According to wikipedia, Moore's law describes an
> important trend in the history of computer hardware: that the number
> of transistors that can be inexpensively placed on an integrated
> circuit is increasing exponentially, doubling approximately every two
> years. The observation was first made by Intel co-founder Gordon E.
> Moore in a 1965 paper.
...
That is an absolutely brilliant link!
Attributes Aug, '04 Mar, '05 Mar, '06 Sep, '07
Number of jobs 29K 72K 171K 2,217K
Average completion time (secs) 634 934 874 395
Machine years used 217 981 2,002 11,081
Input data read (TB) 3,288 12,571 52,254 403,152
Intermediate data (TB) 758 2,756 6,743 34,774
Output data written (TB) 193 941 2,970 14,018
Average worker machines 157 232 268 394
2M jobs per month, 400PBytes read and an average job size of 400. That in a
nutshell is why Yahoo and MSN are being run out of town. That is using
computing to add value to your business. Google has to be the poster child
of "Competing on Analytics"
Of course, it takes serious money to get to that scale and in some way you
can think of this IT scale as the new business reality of distribution
power. Whereas big companies used to own the market because they were able
to distribute physical widgets, now big companies own the market because
they can aggregate information.
Does anyone have practical experience with the actual computes that take
place and how much they would cost using EC-2 instances? I would be
interested to hear what the top 5 algorithms are that are run at Google and
what the attributes, as mentioned above, are for these 5 algorithms. If we
know that we could see how the economics map onto other cloud models (IBM,
Sun, AWS).
Theo
I actually just read a wired article that touched up similar ideas. (I
was hoping to get quoted in the article but it was not to be,
interesting none the less)
http://www.wired.com/science/discoveries/magazine/16-07/pb_intro
reuven
--
--
Reuven Cohen
Founder & Chief Technologist, Enomaly Inc.
www.enomaly.com :: 416 848 6036 x 1
skype: ruv.net // aol: ruv6
blog > www.elasticvapor.com
-
Get Linked in> http://linkedin.com/pub/0/b72/7b4
Khazret:We should be able to do this from first principals: 0.25s and let's say they use 4000 servers. I think the size of the index server banks is large since they benefit from having the whole index in memory. Whole index in memory + redundancy + 1000s queries per sec indicates a large cluster for this (I was trying to dig up info on the size of the index but can't find it right now so it is a bit of a shot from the hip).0.25s on 4000 servers1 index server is 400Watts given that these have large memory configs and fastish cpusassume the network takes about 1/4 of the power of the cluster it servesso approximate the query consumes of the order of 500kJoules. Let's see if we can put this in perspective: A hair dryer is about 1500Watts: so the query takes as much energy as running the hair dryer for 5.5 minutes. Please double check my math.Theo
reuven
--
--
-- : : Geoffrey Fox g...@indiana.edu FAX 8128567972 http://www.infomall.org : Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977 : SkypeIn 812-669-0772 with voicemail, International cell 8123910207
To solve a problem of complexity N in time T, the chip area A design
should satisfy a relationship A*T^2 >= c*f(N), where c is a constant,
and f(N) is a function of the problem complexity.
The curious thing here is that if you have a chip of size n^2*A that
solves the problem in time T/n, the efficiency of the single-unit
design is A*T^2. The same efficiency can be achieved by making n^2
units of area A solving the problem in time T (i.e. n times slower
than the large unit). However, the system of n^2 units is capable of
splitting the work into n^2 pieces, perhaps ending up finishing the
whole job in T/n^2, i.e. n times faster than the large unit... Get it?
:-)
I wonder if the same goes for grids/clouds.
Sassa
2008/6/24 Geoffrey Fox <g...@grids.ucs.indiana.edu>:
There is a curious law about the chip size and the performance.
To solve a problem of complexity N in time T, the chip area A design
should satisfy a relationship A*T^2 >= c*f(N), where c is a constant,
and f(N) is a function of the problem complexity.
The curious thing here is that if you have a chip of size n^2*A that
solves the problem in time T/n, the efficiency of the single-unit
design is A*T^2. The same efficiency can be achieved by making n^2
units of area A solving the problem in time T (i.e. n times slower
than the large unit). However, the system of n^2 units is capable of
splitting the work into n^2 pieces, perhaps ending up finishing the
whole job in T/n^2, i.e. n times faster than the large unit... Get it?
:-)
I wonder if the same goes for grids/clouds.
Sassa
I think those two posts are on the right track, in terms of
optimization problems. The constraints in that system of equations
becomes interesting -- and in practice goes beyond what O(n)
complexity generally considers.
Case in point:
1. Suppose my business has a mission-critical system, which requires 4
servers. That system completes some specific processing in nearly 24
hrs, and it must run each day. Some people would look at that and
pronounce it "Close to full utilization - good." An LP model might
agree, given a superficial set of constraints and cost function.
2. Suppose further that it would be possible to move this same
processing off to some other grid, and run it on 48 servers + 2
controls (same size servers) within approximately 2 hours. Roughly
speaking, linear scaling. Slightly more expensive, due to the added
controllers.
In practice, however, one must consider other factors beyond the
complexity of the problem "N", the number of nodes "n", and the time
"T" required:
* overall capex vs. opex for the server resources -- particularly
if a large number of servers can be leased at an hourly rate, thereby
reducing capex.
* cost of moving the data -- both the time required and the network cost.
* operational risk in Approach 1 if one of those four servers goes
belly up -- versus the fault tolerance benefits of a cloud-based
framework which can accommodate node failures.
* operational risk in Approach 1 if some aspect of the data or code
involved is incorrect, and therefore "full utilization" implies the
job cannot be rerun within the 24 hour requirement.
* opportunity cost if the problem scales up suddenly, such that
Approach 1 requires either more than 4 servers or more than 24 hours
to run.
Considering those trade-offs in the constraints and cost function
allows an optimization problem to solve for minimized risk.
Managing risk is what will convince the decision makers at enterprise
businesses to adopt cloud computing for mission critical systems. But
not O(N) complexity :)
Hope that helps -
Paco
--
Manager, Behavioral Targeting
Adknowledge, Inc.
pna...@adknowledge.com
pac...@cs.stanford.edu
On Fri, Jun 27, 2008 at 8:50 AM, Khazret Sapenov <sap...@gmail.com> wrote:
Sassa
2008/6/28 Greg Pfister <greg.p...@gmail.com>:
That's an idea for Khaz ;-)
> All such statements are totally, undeniably, true. (Maybe not saving
> the planet all by ourselves.)
>
> They also ignore natural functional processing boundaries, and how
> difficult it is to parallelize code to units smaller than that, and
> how simple can be to do when the processing matches those boundaries.
That's true, and that's why the AMD guy was politely sarcastic about
planting more cores and CPUs on one board, as mentioned in the "The
Register" article quoted here. The OS simply can't find enough jobs to
be done in parallel...
> Example: When commodity micros got fast enough that each could handle
> an entire web request all by itself at reasonable response time, what
> happened? The web, that's what. Server farms. Stick a simple
> dispatcher in front of a pile of commodity servers with a shared file
> system, and away you go.
Why a simple dispatcher in front of a pile of commodity servers
instead of a single mega-server?
I think the wisdom behind A*T^2 law is that it may be easier to build
smaller slow units (or buy slower computers) than faster units even if
they are larger. And that's where it ends. I don't think it says you
can do everything with that, or that the time T of individual small
units will be acceptable, because in the end to get it working n times
faster than a single unit, you need to parallelise processing among
n^2 units.
> (Well, OK, each server overlaps disk IO to run other threads in the IO
> wait time, but the code the majority of programmers wrote was straight
> sequential.)
>
> Think how hard that would be to do if each individual web request had
> to somehow be split up into much smaller grains that ran in parallel.
I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?
With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM. So the
question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.
> The scaling of most commercial clouds absolutely lives on
> relationships like this. It's why most HPC code is so much harder to
> parallelize: Most of it does not have lots of independent requests.
That's true, no doubt.
> There is a long-running repeating pattern in parallel processing where
> people propose a gazillion ants to run in parallel to get absolutely
> huge performance. On the other hand, Seymour Cray once said "I'd
> rather plow a field with four strong horses than with 512 chickens."
> It all depends on the application, of course; there are certainly
> cases that work well with 512 chickens, or 10E10 bacteria, or
> whatever. But most cases work better with he horses.
But we don't eat with ladles, we still eat with spoons :-) even if it
requires more trips.
Also, if you were to carry shopping of 100 people for 1 mile, you
could do 100 trips, or you could use a train of shopping trolleys and
just walk home.
As you said, it all depends on the job. And it looks like Google found
that using chickens is enough for the field they are about to plough.
Sassa
I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?
With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM.
So the question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.
Every little factor of 25 performance increase really helps.
Ramback is a new virtual device with the ability to back a ramdisk
by a real disk, obtaining the performance level of a ramdisk but with
the data durability of a hard disk. To work this magic, ramback needs
a little help from a UPS. In a typical test, ramback reduced a 25
second file operation[1] to under one second including sync. Even
greater gains are possible for seek-intensive applications.
The difference between ramback and an ordinary ramdisk is: when the
machine powers down the data does not vanish because it is continuously
saved to backing store. When line power returns, the backing store
repopulates the ramdisk while allowing application io to proceed
concurrently. Once fully populated, a little green light winks on and
file operations once again run at ramdisk speed.
2008/6/28 Greg Pfister <greg.p...@gmail.com>:
...
>> Why a simple dispatcher in front of a pile of commodity servers
>> instead of a single mega-server?
>
> Because the pile costs a whole lot less. Compared with, say, a
> mainframe, something like 100X less per unit of performance. Maybe
> it's not less counting continuing management costs, but lots of people
> have a hard time seeing past a 100X initial purchase price delta to
> TCO over many years.
sure, that's what I thought too :-)
...
>> With 100 servers you potentially get 400GB of RAM alone, whereas with
>> 10 servers on the same architecture you would get 40GB RAM.
>
> I'd quibble with that. If it's seriously X times faster, it better
> have something near X times more RAM or the added speed isn't useful.
Well, in general I also think so, but in the special case of serving
multitudes of web requests it is not obviously important, because
serving 100 requests sequentially doesn't require 100 times more
memory than for one request.
>> So the
>> question is not only about CPU speed.
>
> I won't argue with that. Lots of factors. IO rate, too.
Yes
>> With parallelism you get an
>> option to use local data store where there is no contention. But you
>> also get a headache of updating the state across all of them, meaning
>> slower writes.
>
> By local data store, do you mean RAM shared among processors? Better
> watch for bottlenecks there, too, in bandwidth, and contention
> definitely is an issue. If you have a hot lock, it's a hot lock
> whether it's in RAM or in an outboard DB. (With the outboard DB,
> you'll feel it sooner, but it's still there.)
Well, I meant a theoretic local data store. On a multi-processor
system it is the processor cache. On a multi-computer system it is the
RAM of individual computer. There is no contention in the local data
store, because it is local. The contention does appear when you want
to copy the data from/to the local data store to/from the remote store
(shared file/memory/DB).
>> > The scaling of most commercial clouds absolutely lives on
>> > relationships like this. It's why most HPC code is so much harder to
>> > parallelize: Most of it does not have lots of independent requests.
>>
>> That's true, no doubt.
>>
>> > There is a long-running repeating pattern in parallel processing where
>> > people propose a gazillion ants to run in parallel to get absolutely
>> > huge performance. On the other hand, Seymour Cray once said "I'd
>> > rather plow a field with four strong horses than with 512 chickens."
>> > It all depends on the application, of course; there are certainly
>> > cases that work well with 512 chickens, or 10E10 bacteria, or
>> > whatever. But most cases work better with he horses.
>>
>> But we don't eat with ladles, we still eat with spoons :-) even if it
>> requires more trips.
>
> Soup spoons!
I was building the analogy of a client computer running a browser - so
to speak, small appetite requires a little spoon travelling many
times. That's how the web requests better be served by 512 chickens
than 4 horses.
>> Also, if you were to carry shopping of 100 people for 1 mile, you
>> could do 100 trips, or you could use a train of shopping trolleys and
>> just walk home.
>>
>> As you said, it all depends on the job. And it looks like Google found
>> that using chickens is enough for the field they are about to plough.
>
> Yes, but I'd add that those chickens have been fed steroids. Google
> couldn't have done it in the mid-80s. Microprocessors weren't fast
> enough. Now they're the Chickens That Ate Chicago.
I thought mid-80s were ants :-) and now we have chickens.
Sassa
> --
> Greg Pfister
>
> >
>
Subject: Re: More with Moore, more or less
I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?
With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM.
So the question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.
________________________________
From: cloud-c...@googlegroups.com on behalf of Stephen Voorhees
Sent: Sun 6/29/2008 6:20 PM
To: 'cloud-c...@googlegroups.com'
Subject: Re: More with Moore, more or less
________________________________
From: cloud-c...@googlegroups.com
To: cloud-c...@googlegroups.com
Sent: Sat Jun 28 09:08:38 2008
Subject: Re: More with Moore, more or less
On Jun 28, 2008, at 3:41 AM, Sassa NF wrote:
I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?
With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM.
You've illustrated the memory gap. Memory doubles in density every 30 months compared to 18 months for processors.
Fred Weber, former CTO of AMD (spearheaded development of the Opteron), and Suresh Rajan, former Dir. of Marketing of nVidia hatched MetaRAM to increase memory footprints utilizing existing memory busses and commodity dram parts.
So the question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.
For those that might not follow LKML, read Linux Kernel Hacker Daniel Phillips' paper "Ramback- Faster than a Speeding Bullet <http://lwn.net/Articles/272534/> " - The thread became a 400+ post flame war which ended when Alan Cox conceded Daniel was right.
Here's the intro:
Every little factor of 25 performance increase really helps.
Ramback is a new virtual device with the ability to back a ramdisk
by a real disk, obtaining the performance level of a ramdisk but with
the data durability of a hard disk. To work this magic, ramback needs
a little help from a UPS. In a typical test, ramback reduced a 25
second file operation[1] to under one second including sync. Even
greater gains are possible for seek-intensive applications.
The difference between ramback and an ordinary ramdisk is: when the
machine powers down the data does not vanish because it is continuously
saved to backing store. When line power returns, the backing store
repopulates the ramdisk while allowing application io to proceed
concurrently. Once fully populated, a little green light winks on and
file operations once again run at ramdisk speed.
Daniel's code is rev 0.0, and is buggy like any new code, but the backbone is solid. Motivated parties should foster development of this open-source gem. Accelerating durable writes could dramatically accelerate cloud-based transactional systems.
Last bit- x64 PCIe chipsets are just now hitting the market. Big Honking multi-processor servers loaded with RAM can now have multiple x4 IB or 10GbE NICs. IB is pretty cool, especially when you consider the RDMA capabilities.
T
Timothy Huber
Strategic Account Development
tim....@metaram.com
cell 310 795.6599
MetaRAM Inc. <http://www.metaram.com/>
what do you think of the following server config comparisons?
1) 4P-128GB vs. 2P-128GB
2) 4P-128GB vs. 4P-256GB
In your opinion, what's optimal for a server config- number of procs,
memory size, chipset (PCIe lanes), etc?
Timothy
tim....@metaram.com
cell 310 795.6599
MetaRAM Inc.
A rule of thumb I noticed is 2x the sockets for roughly 4x the price, so
your power needs to be quite expensive to justify using large servers.
Also, it's questionable that larger servers provide better perf/W;
SPECpower results so far have shown that 2S servers use slightly more
than double the power of 1S servers; I wouldn't be surprised if the
trend continues at 4S and larger. This is not to mention SMP scalability
which is never perfect. OTOH, larger servers may benefit more from
statistical multiplexing.
I agree that power accounting is essential but I think the solution is
better accounting, not workarounds. You just can't afford to reserve the
same power capacity for things that are different. We need rack power
capping to drive those branch circuits to 80% utilization.
Wes Felter - wes...@felter.org
The configs for end point devices will vary based on workload but
generally it's nice for the endpoints to remain consistent. The closer
RAM is to the compute, the faster the software will perform
(particularly when RDMA is enabled) so large amounts of RAM have
benefits... this said... I wouldn't necessarily want 256GB at each end
point just to ensure my workloads don't overflow (primarily because
256GB would be powered up all the time and is fairly inefficient based
on today's motherboard designs). * Motherboards with larger RAM support
are more expensive. Larger machines today typically reserve larger
power supplies also which more standby energy waste and reserve more
power than is typically used.
No real opinion on size of RAM, but compute devices should only contain
what will be used and the power supply should be sized as such. Not
sure about you guys... but reserving 1300 WATTS * 2 for each 4P server
is a little over kill when less than 60% of the power is actually used.
Fabric Memory (which is scalable like a storage array using 3.5" disks)
is more ideal because it can more readily be repurposed and shared. By
repurposed it could look like a LUN, JMS, JDBC/ODBC, etc... or as an
RDMA cache for NFS / SAN. If the purpose is to have lots of RAM,
deploying RAM in a server is about the most inefficient way to do so
(video, hd controller, lots of PCI-Express slots, larger power supplies,
etc... just to serve RAM).
* Attached is the high level concept for a Processor Array.
*no errors, only seeking clarifications
How extensively is RDMA deployed in your applications now? How
latency sensitive are your cloud apps?
> If the purpose is to have lots of RAM,
> deploying RAM in a server is about the most inefficient way to do so
> (video, hd controller, lots of PCI-Express slots, larger power
> supplies,
> etc... just to serve RAM).
What would you propose as a more efficient mechanism to deploying lots
of RAM than sticking it directly in a server?
regards,
Timothy
> <Processor Array - Basic Idea v1.3.ppt>
timothy norman huber
timoth...@mac.com
310 795-6599
Realize this is an open-ended question....and more of a storage fishing expedition...anyone successfully fishing in the iSER/STGT pond?
In my opinion, latency has an impact on everything which utilizes I/O.
Specifically, I/O wait and latency impacts the efficiency and power
consumption of processors.
Now that we are seeing Virtual Machines run on the Cloud to provide HA
by Default without lots of dedicated standby nodes and the ability to
provision first... there will be a growing dependency on I/O (and hence
low latency). RDMA lowers latency.
Check out RNA Networks (Clive Cook cc'd) in regards to an example and
efficient deployment of Fabric Memory with RDMA software support. Think
of fabric memory not as a replacement for native RAM, but more like what
we see in processors with L1, L2, and sometimes L3 cache. Fabric memory
is essentially an L4 cache.
The best use of RDMA is when you get RMDA for free. SDP, SRP, iSER, MS
Network Direct
I am unable to share where we utilize RDMA today, but the above comments
summarize the general benefit.
Sassa
2008/7/1 Sniper Ninja <sniper...@gmail.com>:
Your "bandwidth" term describes part of what I meant. I meant that if
one component wants to read or modify the "master" copy, they'd need
to lock it for exclusive access. The others would need to wait to get
the lock, hence contention (if I am using the right term), and the
cause of contention is bandwidth.
So RAM of individual computers would be contention-free (on the scale
of the multi-computer system).
> On a multi-processor, there is a known effect that, with the right
> program, of course, gives real superlinear speedup sometimes because
> adding a processor adds cache: N processors have more memory (in
> cache) to work with. But you don't have the same effect with a multi-
> computer.
I was wondering why you don't have the same effect with a
multi-computer? (And I thought that is how things could work much
faster on lots of cheap computers with loads of RAM.)
Say, hypothetically Google splits its knowledge into a tree or a
hierarchy, each node represented by a cheap computer with loads of
RAM. Searching the net would mean redirecting the request through a
few branches in the tree and the response would be provided at the
speed of search in RAM for frequently sought keywords. On the other
hand, a Big Iron would face problems of a different kind; whereas its
processors might run faster, it might not get the same response time,
if it cannot fit the same amount of data into RAM, as the chicken
factory, and end up trundling across the ...disk?.
Also, the specifics of Google search are that the data doesn't have to
be updated synchronously in all the replicas of the same node...
Because the nature of the search is more of a guess (so if a node
starts guessing a more precise value a few hours later it may not be a
problem?). This means relaxed requirements towards replication speed
and accuracy.
> But maybe this is partly an issue of terminology. Say "memory
> bandwidth" instead of "contention" and I find something to agree with.
> It's hard (not impossible, just expensive) to build a multiprocessor
> with memory bandwidth that scales well with the number of processors.
> Front-side-bus - based memory starvation is a common issue, and AMD's
> scheme is better only with very clever memory allocation. MP caches
> don't have that issue, and MCs duplicate the whole memory system,
> adding proportional to the number of systems. If those kinds of memory
> bandwidth scaling effects are what you mean by "contention," then,
> well, we're in boring agreement again.
You are right, perhaps I should use the term describing the channel
through which the data is fed in/out of RAM, because the contention is
the attempt to own the channel (to make consistent changes). This is
perhaps called the bus.
So the point I was making is that obviously access to RAM cannot be
contended by other computers in the multi-computer system, whereas if
you have to store the data in a shared store, access to that causes
contention. That's how it may be better to have multiple computers
with loads of RAM. But you also said that it would be the norm for
faster computers to have proportionately more memory. Then why would
Google buy lots of cheaper computers, as someone claimed here?
I have to agree. :-)
I forgot that in both cases they are just a bunch of von Neumann
machines connected with channels. So if the MC case doesn't need to
send any messages and each computer can work autonomously, so does the
MP case.
But I somehow still think that using an MC has its benefits against an
MP, because in MP sharing resources between von Neumann machines
starts right outside the processor chip (let's say, outside each
processor's cache), whereas in MC sharing resources between von
Neumann machines starts on the other size of the network card.
> Assume you don't change the core logic of the application. Then if you
> needed a to acquire a lock in a multiprocessor, you still need to
> acquire that same lock on a multicomputer. (If for some reason you do
> NOT need the lock on the MC, then duplicate that lock-free logic in
> the MP, and its memory contention goes away.) So while you're not
> contending for memory, you've traded MP lock time (nsecs. or usecs.)
> for MC lock time (millisec.).
>
> Looking at it another way: You didn't eliminate the logical
> contention. You can't; it's inherent in the program logic + data
> access pattern. You just moved the contention from the MP memory
> system into the network / adapter / device driver.
>
> (By the way, this same issue was a flaw in the logic of the classic
> paper "On the Future of High-Performance Database Processing" by
> DeWitt and Gray in 1992, the paper that basically launched shared-
> nothing distributed databases. Unfortunately, it wasn't the only logic
> flaw, but nobody recognized that at the time.)
...only I found the same flaw ...ermmm.... 16 years later :-)
>> > On a multi-processor, there is a known effect that, with the right
>> > program, of course, gives real superlinear speedup sometimes because
>> > adding a processor adds cache: N processors have more memory (in
>> > cache) to work with. But you don't have the same effect with a multi-
>> > computer.
>>
>> I was wondering why you don't have the same effect with a
>> multi-computer? (And I thought that is how things could work much
>> faster on lots of cheap computers with loads of RAM.)
>>
>> Say, hypothetically Google splits its knowledge into a tree or a
>> hierarchy, each node represented by a cheap computer with loads of
>> RAM. Searching the net would mean redirecting the request through a
>> few branches in the tree and the response would be provided at the
>> speed of search in RAM for frequently sought keywords. On the other
>> hand, a Big Iron would face problems of a different kind; whereas its
>> processors might run faster, it might not get the same response time,
>> if it cannot fit the same amount of data into RAM, as the chicken
>> factory, and end up trundling across the ...disk?.
>
> Well, yes (ignoring the time spent sending the messages inter-
> computer), because you've increased the amount of RAM -- so in a sense
> it's not true "superlinear" speedup. Depending on the constraints you
> put on what you mean by "superlinear."
Actually, I wasn't after a superlinear speedup :-) I was after a
speedup that would be easier achievable on an MC than MP.
> The MP superlinear speedup is arguably more subtle because it occurs
> when you change nothing except the number of threads/processes used by
> the code.
>
> So you have this MP, with N processors and X GB of RAM. You run the
> program with number of threads set to, say, 1. It runs, using only one
> processor. Now you run it with number of threads = 2, so it uses 2
> processors. It might speed up by >2X -- *if* the bottleneck was in
> moving data in and out of some very cache-able memory, because along
> with that processor you acquired the processor's cache.
Yep, that's what I think might be the case with Google's search or
similar tasks. The RAM of individual nodes of the MC becomes what an
MP has as CPU cache, the only difference being the size.
I am not saying I am right. This is just what seems to be the case. In
an MP one von Neumann sequential processor is a CPU, and its local
data store is the cache (the rest of memory being a shared resource =
remote memory). In an MC one von Neumann sequential processor is a
single node, and its local data store is the RAM (the rest of storage
being a shared filesystem? = remote memory). They are sharing stuff on
different scale, so the gains are of different scale, too.
What's wrong with that? (are we in the region of 10+ messages on the
topic now? and we started from agreeing!)
Perhaps, the illusion of a gain is not in contention as I thought, but
in the end it is the IO bandwidth as you say. In the end in both cases
we have similar numbers of CPUs with proportionate sizes of cache,
RAM, disks and optic fibre. So for MC to make better use of Joules
than an MP the gain from IO confined to a single node in MC should
exceed the loss of IO due to more network communication.
> (Add here all kinds of caveats about multi-level cache structure,
> using real processors and not hardware threads on a single processor,
> etc.)
>
>> Also, the specifics of Google search are that the data doesn't have to
>> be updated synchronously in all the replicas of the same node...
>> Because the nature of the search is more of a guess (so if a node
>> starts guessing a more precise value a few hours later it may not be a
>> problem?). This means relaxed requirements towards replication speed
>> and accuracy.
>
> Yes, and if you relaxed those update requirements the same way in the
> code run on an MP (yes, that would require more MP memory), you would
> get less memory contention (but may have a memory bandwidth problem).
> That's the kind of thing I meant above when referring to "changing the
> the core logic of the application."
>
> But without the motivation of scaling much more than any single MP,
> you may not be willing to put the effort into changing the
> application.
Right, that's another point I missed. If the application can scale
much more than a single MP, then perhaps you can achieve the same
results at a similar speed but on a cheaper hardware?
Sassa
> --
> Greg Pfister
Yes, I got it :-) But to make an error, you need to find one, in a way...
> But they also concluded that the only disk access was to local disks,
> which isn't current practice any more. (SANs hadn't been invented.)
>
> ...
>
>> Actually, I wasn't after a superlinear speedup :-) I was after a
>> speedup that would be easier achievable on an MC than MP.
>
> After all this time... is that all? Heck, I'll agree that happens
> every day. There are lots of clusters / farms / MCs / whatever out
> there actively using far more memory and IO bandwidth than any one MP
> has ever achieved.
:-) That's good to know, but it doesn't help me understand why/when an
MC of "mid-range" computers is better than an MC of MP computers...
I can see the resource sharing happens on a different scale, but
aren't MPs designed to share the resources per CPU more efficiently in
general case? (Then what are the special cases when the resource
sharing between MC of "mid-range" computers share the resource between
the CPUs better than the MPs?) Or is it just the case of unfarily
overpricing the MPs?
Or is it the case of simpler app development (kind of, to build an MC
of MPs, you need to implement two kinds of resource sharing - inter-
and intra-computer resource sharing)
> ...
>
>> I am not saying I am right. This is just what seems to be the case. In
>> an MP one von Neumann sequential processor is a CPU, and its local
>> data store is the cache (the rest of memory being a shared resource =
>> remote memory). In an MC one von Neumann sequential processor is a
>> single node, and its local data store is the RAM (the rest of storage
>> being a shared filesystem? = remote memory). They are sharing stuff on
>> different scale, so the gains are of different scale, too.
>>
>> What's wrong with that? (are we in the region of 10+ messages on the
>> topic now? and we started from agreeing!)
>>
>> Perhaps, the illusion of a gain is not in contention as I thought, but
>> in the end it is the IO bandwidth as you say. In the end in both cases
>> we have similar numbers of CPUs with proportionate sizes of cache,
>> RAM, disks and optic fibre. So for MC to make better use of Joules
>> than an MP the gain from IO confined to a single node in MC should
>> exceed the loss of IO due to more network communication.
>
> Well, Joules are another layer of complexity, but leaving that aside
> I'll agree. I'll even agree (ahem) that the word "contention" can be
> used here -- as long as we agree that the contention is for basic
> hardware resources -- data carriers like busses, switches, etc. -- and
> not logical contention on locks that depends on the algorithms.
>
> So I'm afraid we've clarified ourselves into agreement.
>
> Dang, didn't get to use Caps Lock. :-)
OK
Sassa
> --
> Greg Pfister
I've seen your book, now I'll have to buy it :-)
Thanks, Greg!
Sassa
2008/7/7 Greg Pfister <greg.p...@gmail.com>:
...
> Advantages / disadvantages of MP vs. MC? When to use one vs. the
> other?
>
> Um, I wrote the book on this. Literally. Go see
> http://www.amazon.com/Search-Clusters-2nd-Gregory-Pfister/dp/0138997098/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1215468747&sr=1-1
> .
>
> I'm not going to post the whole book here -- and in any event the book
> explains the basic principles; it's not a checklist you can directly
> use to pick one or the other -- but here are some things that are
> relevant. It can be a very trivial or a very multi-dimensional
> decision.
...
> --
> Greg Pfister