At Intel's core, they are a company driven by one singular mantra,
"Moore's Law". According to wikipedia, Moore's law describes an
important trend in the history of computer hardware: that the number
of transistors that can be inexpensively placed on an integrated
circuit is increasing exponentially, doubling approximately every two
years. The observation was first made by Intel co-founder Gordon E.
Moore in a 1965 paper.
Over the last couple years we have been working very closely with
Intel, specifically in the areas of virtualization. During this time
we have learned a lot about how they think and what drives them as an
organization. In one of my early pitches we described our approach to
virtualization as "Doing more with Moore" A kind of play on the common
phases "doing more with less" combined with some of the ideas behind
"Moores Law" which is all about growth and greater efficiencies. They
loved the idea, for the first time someone was looking at
virtualization not purely as a way to consolidate a data center but as
a way to more effectively scale your overall capacity.
What is interesting about Moores law in regards to cloud computing is
it is no longer just about how many transistors you can get on a
single CPU, but more about how effectively you spread your compute
capacity on more then one CPU, be it multi-core chips, or among
hundreds, or even thousands of connected servers. Historically the
faster the CPU gets the more demanding the applications built for it
become. I am curious if we're on the verge of seeing a similar "Moores
Law" applied to the cloud? And if so, will it follow the same
principals? Will we start to see a "Ruv's Law" where every 18 months
the amount of cloud capacity will double or will we reach a point
where there is never enough excess capacity to meet the demand?
(Original Post:
http://elasticvapor.com/2008/06/more-with-moore-more-or-less.html)
--
--
Reuven Cohen
Founder & Chief Technologist, Enomaly Inc.
blog > www.elasticvapor.com
I see the cloud is more of an ecosystem, which is likely to defy any
simple, linear type rule due to emergent properties. I wouldn't be shocked
if it was as hard to predict as the weather ("5 day forecast calls for a
200 msec second standard deviation in latency with 10% probability of the
jitters"). I'm only slightly joking - my early experiences with sharing
hosted grid computing resources were pretty blustery.
On that note I am much more interested in the phase transitions, critical
junctures where the properties change.
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
Jeff Dean showed some interesting numbers about cluster sizes and job
throughput in this talk:
http://sites.google.com/site/io/underneath-the-covers-at-google-current-systems-and-future-directions
Maybe add a little historical price data for commodity hardware...
Paco
On Sun, Jun 22, 2008 at 9:32 PM, Ray Nugent <rnu...@yahoo.com> wrote:
> More likely you see "Ray's Law" (sorry, couldn't help myself ;-) where the
> amount of computing one can do on a given piece of the cloud increases
> exponentially while the cost for said piece drops by half...
>
> Ray
>
> ----- Original Message ----
> From: Reuven Cohen <r...@enomaly.com>
...
> At Intel's core, they are a company driven by one singular mantra,
> "Moore's Law". According to wikipedia, Moore's law describes an
> important trend in the history of computer hardware: that the number
> of transistors that can be inexpensively placed on an integrated
> circuit is increasing exponentially, doubling approximately every two
> years. The observation was first made by Intel co-founder Gordon E.
> Moore in a 1965 paper.
...
That is an absolutely brilliant link!
Attributes Aug, '04 Mar, '05 Mar, '06 Sep, '07
Number of jobs 29K 72K 171K 2,217K
Average completion time (secs) 634 934 874 395
Machine years used 217 981 2,002 11,081
Input data read (TB) 3,288 12,571 52,254 403,152
Intermediate data (TB) 758 2,756 6,743 34,774
Output data written (TB) 193 941 2,970 14,018
Average worker machines 157 232 268 394
2M jobs per month, 400PBytes read and an average job size of 400. That in a
nutshell is why Yahoo and MSN are being run out of town. That is using
computing to add value to your business. Google has to be the poster child
of "Competing on Analytics"
Of course, it takes serious money to get to that scale and in some way you
can think of this IT scale as the new business reality of distribution
power. Whereas big companies used to own the market because they were able
to distribute physical widgets, now big companies own the market because
they can aggregate information.
Does anyone have practical experience with the actual computes that take
place and how much they would cost using EC-2 instances? I would be
interested to hear what the top 5 algorithms are that are run at Google and
what the attributes, as mentioned above, are for these 5 algorithms. If we
know that we could see how the economics map onto other cloud models (IBM,
Sun, AWS).
Theo
I actually just read a wired article that touched up similar ideas. (I
was hoping to get quoted in the article but it was not to be,
interesting none the less)
http://www.wired.com/science/discoveries/magazine/16-07/pb_intro
reuven
--
--
Reuven Cohen
Founder & Chief Technologist, Enomaly Inc.
www.enomaly.com :: 416 848 6036 x 1
skype: ruv.net // aol: ruv6
blog > www.elasticvapor.com
-
Get Linked in> http://linkedin.com/pub/0/b72/7b4
Khazret:We should be able to do this from first principals: 0.25s and let's say they use 4000 servers. I think the size of the index server banks is large since they benefit from having the whole index in memory. Whole index in memory + redundancy + 1000s queries per sec indicates a large cluster for this (I was trying to dig up info on the size of the index but can't find it right now so it is a bit of a shot from the hip).0.25s on 4000 servers1 index server is 400Watts given that these have large memory configs and fastish cpusassume the network takes about 1/4 of the power of the cluster it servesso approximate the query consumes of the order of 500kJoules. Let's see if we can put this in perspective: A hair dryer is about 1500Watts: so the query takes as much energy as running the hair dryer for 5.5 minutes. Please double check my math.Theo
reuven
--
--
-- : : Geoffrey Fox g...@indiana.edu FAX 8128567972 http://www.infomall.org : Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977 : SkypeIn 812-669-0772 with voicemail, International cell 8123910207
To solve a problem of complexity N in time T, the chip area A design
should satisfy a relationship A*T^2 >= c*f(N), where c is a constant,
and f(N) is a function of the problem complexity.
The curious thing here is that if you have a chip of size n^2*A that
solves the problem in time T/n, the efficiency of the single-unit
design is A*T^2. The same efficiency can be achieved by making n^2
units of area A solving the problem in time T (i.e. n times slower
than the large unit). However, the system of n^2 units is capable of
splitting the work into n^2 pieces, perhaps ending up finishing the
whole job in T/n^2, i.e. n times faster than the large unit... Get it?
:-)
I wonder if the same goes for grids/clouds.
Sassa
2008/6/24 Geoffrey Fox <g...@grids.ucs.indiana.edu>:
There is a curious law about the chip size and the performance.
To solve a problem of complexity N in time T, the chip area A design
should satisfy a relationship A*T^2 >= c*f(N), where c is a constant,
and f(N) is a function of the problem complexity.
The curious thing here is that if you have a chip of size n^2*A that
solves the problem in time T/n, the efficiency of the single-unit
design is A*T^2. The same efficiency can be achieved by making n^2
units of area A solving the problem in time T (i.e. n times slower
than the large unit). However, the system of n^2 units is capable of
splitting the work into n^2 pieces, perhaps ending up finishing the
whole job in T/n^2, i.e. n times faster than the large unit... Get it?
:-)
I wonder if the same goes for grids/clouds.
Sassa
I think those two posts are on the right track, in terms of
optimization problems. The constraints in that system of equations
becomes interesting -- and in practice goes beyond what O(n)
complexity generally considers.
Case in point:
1. Suppose my business has a mission-critical system, which requires 4
servers. That system completes some specific processing in nearly 24
hrs, and it must run each day. Some people would look at that and
pronounce it "Close to full utilization - good." An LP model might
agree, given a superficial set of constraints and cost function.
2. Suppose further that it would be possible to move this same
processing off to some other grid, and run it on 48 servers + 2
controls (same size servers) within approximately 2 hours. Roughly
speaking, linear scaling. Slightly more expensive, due to the added
controllers.
In practice, however, one must consider other factors beyond the
complexity of the problem "N", the number of nodes "n", and the time
"T" required:
* overall capex vs. opex for the server resources -- particularly
if a large number of servers can be leased at an hourly rate, thereby
reducing capex.
* cost of moving the data -- both the time required and the network cost.
* operational risk in Approach 1 if one of those four servers goes
belly up -- versus the fault tolerance benefits of a cloud-based
framework which can accommodate node failures.
* operational risk in Approach 1 if some aspect of the data or code
involved is incorrect, and therefore "full utilization" implies the
job cannot be rerun within the 24 hour requirement.
* opportunity cost if the problem scales up suddenly, such that
Approach 1 requires either more than 4 servers or more than 24 hours
to run.
Considering those trade-offs in the constraints and cost function
allows an optimization problem to solve for minimized risk.
Managing risk is what will convince the decision makers at enterprise
businesses to adopt cloud computing for mission critical systems. But
not O(N) complexity :)
Hope that helps -
Paco
--
Manager, Behavioral Targeting
Adknowledge, Inc.
pna...@adknowledge.com
pac...@cs.stanford.edu
On Fri, Jun 27, 2008 at 8:50 AM, Khazret Sapenov <sap...@gmail.com> wrote:
Sassa
2008/6/28 Greg Pfister <greg.p...@gmail.com>:
That's an idea for Khaz ;-)
> All such statements are totally, undeniably, true. (Maybe not saving
> the planet all by ourselves.)
>
> They also ignore natural functional processing boundaries, and how
> difficult it is to parallelize code to units smaller than that, and
> how simple can be to do when the processing matches those boundaries.
That's true, and that's why the AMD guy was politely sarcastic about
planting more cores and CPUs on one board, as mentioned in the "The
Register" article quoted here. The OS simply can't find enough jobs to
be done in parallel...
> Example: When commodity micros got fast enough that each could handle
> an entire web request all by itself at reasonable response time, what
> happened? The web, that's what. Server farms. Stick a simple
> dispatcher in front of a pile of commodity servers with a shared file
> system, and away you go.
Why a simple dispatcher in front of a pile of commodity servers
instead of a single mega-server?
I think the wisdom behind A*T^2 law is that it may be easier to build
smaller slow units (or buy slower computers) than faster units even if
they are larger. And that's where it ends. I don't think it says you
can do everything with that, or that the time T of individual small
units will be acceptable, because in the end to get it working n times
faster than a single unit, you need to parallelise processing among
n^2 units.
> (Well, OK, each server overlaps disk IO to run other threads in the IO
> wait time, but the code the majority of programmers wrote was straight
> sequential.)
>
> Think how hard that would be to do if each individual web request had
> to somehow be split up into much smaller grains that ran in parallel.
I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?
With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM. So the
question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.
> The scaling of most commercial clouds absolutely lives on
> relationships like this. It's why most HPC code is so much harder to
> parallelize: Most of it does not have lots of independent requests.
That's true, no doubt.
> There is a long-running repeating pattern in parallel processing where
> people propose a gazillion ants to run in parallel to get absolutely
> huge performance. On the other hand, Seymour Cray once said "I'd
> rather plow a field with four strong horses than with 512 chickens."
> It all depends on the application, of course; there are certainly
> cases that work well with 512 chickens, or 10E10 bacteria, or
> whatever. But most cases work better with he horses.
But we don't eat with ladles, we still eat with spoons :-) even if it
requires more trips.
Also, if you were to carry shopping of 100 people for 1 mile, you
could do 100 trips, or you could use a train of shopping trolleys and
just walk home.
As you said, it all depends on the job. And it looks like Google found
that using chickens is enough for the field they are about to plough.
Sassa
I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?
With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM.
So the question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.
Every little factor of 25 performance increase really helps.
Ramback is a new virtual device with the ability to back a ramdisk
by a real disk, obtaining the performance level of a ramdisk but with
the data durability of a hard disk. To work this magic, ramback needs
a little help from a UPS. In a typical test, ramback reduced a 25
second file operation[1] to under one second including sync. Even
greater gains are possible for seek-intensive applications.
The difference between ramback and an ordinary ramdisk is: when the
machine powers down the data does not vanish because it is continuously
saved to backing store. When line power returns, the backing store
repopulates the ramdisk while allowing application io to proceed
concurrently. Once fully populated, a little green light winks on and
file operations once again run at ramdisk speed.
2008/6/28 Greg Pfister <greg.p...@gmail.com>:
...
>> Why a simple dispatcher in front of a pile of commodity servers
>> instead of a single mega-server?
>
> Because the pile costs a whole lot less. Compared with, say, a
> mainframe, something like 100X less per unit of performance. Maybe
> it's not less counting continuing management costs, but lots of people
> have a hard time seeing past a 100X initial purchase price delta to
> TCO over many years.
sure, that's what I thought too :-)
...
>> With 100 servers you potentially get 400GB of RAM alone, whereas with
>> 10 servers on the same architecture you would get 40GB RAM.
>
> I'd quibble with that. If it's seriously X times faster, it better
> have something near X times more RAM or the added speed isn't useful.
Well, in general I also think so, but in the special case of serving
multitudes of web requests it is not obviously important, because
serving 100 requests sequentially doesn't require 100 times more
memory than for one request.
>> So the
>> question is not only about CPU speed.
>
> I won't argue with that. Lots of factors. IO rate, too.
Yes
>> With parallelism you get an
>> option to use local data store where there is no contention. But you
>> also get a headache of updating the state across all of them, meaning
>> slower writes.
>
> By local data store, do you mean RAM shared among processors? Better
> watch for bottlenecks there, too, in bandwidth, and contention
> definitely is an issue. If you have a hot lock, it's a hot lock
> whether it's in RAM or in an outboard DB. (With the outboard DB,
> you'll feel it sooner, but it's still there.)
Well, I meant a theoretic local data store. On a multi-processor
system it is the processor cache. On a multi-computer system it is the
RAM of individual computer. There is no contention in the local data
store, because it is local. The contention does appear when you want
to copy the data from/to the local data store to/from the remote store
(shared file/memory/DB).
>> > The scaling of most commercial clouds absolutely lives on
>> > relationships like this. It's why most HPC code is so much harder to
>> > parallelize: Most of it does not have lots of independent requests.
>>
>> That's true, no doubt.
>>
>> > There is a long-running repeating pattern in parallel processing where
>> > people propose a gazillion ants to run in parallel to get absolutely
>> > huge performance. On the other hand, Seymour Cray once said "I'd
>> > rather plow a field with four strong horses than with 512 chickens."
>> > It all depends on the application, of course; there are certainly
>> > cases that work well with 512 chickens, or 10E10 bacteria, or
>> > whatever. But most cases work better with he horses.
>>
>> But we don't eat with ladles, we still eat with spoons :-) even if it
>> requires more trips.
>
> Soup spoons!
I was building the analogy of a client computer running a browser - so
to speak, small appetite requires a little spoon travelling many
times. That's how the web requests better be served by 512 chickens
than 4 horses.
>> Also, if you were to carry shopping of 100 people for 1 mile, you
>> could do 100 trips, or you could use a train of shopping trolleys and
>> just walk home.
>>
>> As you said, it all depends on the job. And it looks like Google found
>> that using chickens is enough for the field they are about to plough.
>
> Yes, but I'd add that those chickens have been fed steroids. Google
> couldn't have done it in the mid-80s. Microprocessors weren't fast
> enough. Now they're the Chickens That Ate Chicago.
I thought mid-80s were ants :-) and now we have chickens.
Sassa
> --
> Greg Pfister
>
> >
>
Subject: Re: More with Moore, more or less
I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?
With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM.
So the question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.
________________________________
From: cloud-c...@googlegroups.com on behalf of Stephen Voorhees
Sent: Sun 6/29/2008 6:20 PM
To: 'cloud-c...@googlegroups.com'
Subject: Re: More with Moore, more or less
________________________________
From: cloud-c...@googlegroups.com
To: cloud-c...@googlegroups.com
Sent: Sat Jun 28 09:08:38 2008
Subject: Re: More with Moore, more or less
On Jun 28, 2008, at 3:41 AM, Sassa NF wrote:
I don't think that's the intention. On the other hand if you had 100
servers processing 1000 requests a second, would you rather use that
or 10 servers that work 10 times faster?
With 100 servers you potentially get 400GB of RAM alone, whereas with
10 servers on the same architecture you would get 40GB RAM.
You've illustrated the memory gap. Memory doubles in density every 30 months compared to 18 months for processors.
Fred Weber, former CTO of AMD (spearheaded development of the Opteron), and Suresh Rajan, former Dir. of Marketing of nVidia hatched MetaRAM to increase memory footprints utilizing existing memory busses and commodity dram parts.
So the question is not only about CPU speed. With parallelism you get an
option to use local data store where there is no contention. But you
also get a headache of updating the state across all of them, meaning
slower writes.
For those that might not follow LKML, read Linux Kernel Hacker Daniel Phillips' paper "Ramback- Faster than a Speeding Bullet <http://lwn.net/Articles/272534/> " - The thread became a 400+ post flame war which ended when Alan Cox conceded Daniel was right.
Here's the intro:
Every little factor of 25 performance increase really helps.
Ramback is a new virtual device with the ability to back a ramdisk
by a real disk, obtaining the performance level of a ramdisk but with
the data durability of a hard disk. To work this magic, ramback needs
a little help from a UPS. In a typical test, ramback reduced a 25
second file operation[1] to under one second including sync. Even
greater gains are possible for seek-intensive applications.
The difference between ramback and an ordinary ramdisk is: when the
machine powers down the data does not vanish because it is continuously
saved to backing store. When line power returns, the backing store
repopulates the ramdisk while allowing application io to proceed
concurrently. Once fully populated, a little green light winks on and
file operations once again run at ramdisk speed.
Daniel's code is rev 0.0, and is buggy like any new code, but the backbone is solid. Motivated parties should foster development of this open-source gem. Accelerating durable writes could dramatically accelerate cloud-based transactional systems.
Last bit- x64 PCIe chipsets are just now hitting the market. Big Honking multi-processor servers loaded with RAM can now have multiple x4 IB or 10GbE NICs. IB is pretty cool, especially when you consider the RDMA capabilities.
T
Timothy Huber
Strategic Account Development
tim....@metaram.com
cell 310 795.6599
MetaRAM Inc. <http://www.metaram.com/>
what do you think of the following server config comparisons?
1) 4P-128GB vs. 2P-128GB
2) 4P-128GB vs. 4P-256GB
In your opinion, what's optimal for a server config- number of procs,
memory size, chipset (PCIe lanes), etc?
Timothy
tim....@metaram.com
cell 310 795.6599
MetaRAM Inc.
A rule of thumb I noticed is 2x the sockets for roughly 4x the price, so
your power needs to be quite expensive to justify using large servers.
Also, it's questionable that larger servers provide better perf/W;
SPECpower results so far have shown that 2S servers use slightly more
than double the power of 1S servers; I wouldn't be surprised if the
trend continues at 4S and larger. This is not to mention SMP scalability
which is never perfect. OTOH, larger servers may benefit more from
statistical multiplexing.
I agree that power accounting is essential but I think the solution is
better accounting, not workarounds. You just can't afford to reserve the
same power capacity for things that are different. We need rack power
capping to drive those branch circuits to 80% utilization.
Wes Felter - wes...@felter.org
The configs for end point devices will vary based on workload but
generally it's nice for the endpoints to remain consistent. The closer
RAM is to the compute, the faster the software will perform
(particularly when RDMA is enabled) so large amounts of RAM have
benefits... this said... I wouldn't necessarily want 256GB at each end
point just to ensure my workloads don't overflow (primarily because
256GB would be powered up all the time and is fairly inefficient based
on today's motherboard designs). * Motherboards with larger RAM support
are more expensive. Larger machines today typically reserve larger
power supplies also which more standby energy waste and reserve more
power than is typically used.
No real opinion on size of RAM, but compute devices should only contain
what will be used and the power supply should be sized as such. Not
sure about you guys... but reserving 1300 WATTS * 2 for each 4P server
is a little over kill when less than 60% of the power is actually used.
Fabric Memory (which is scalable like a storage array using 3.5" disks)
is more ideal because it can more readily be repurposed and shared. By
repurposed it could look like a LUN, JMS, JDBC/ODBC, etc... or as an
RDMA cache for NFS / SAN. If the purpose is to have lots of RAM,
deploying RAM in a server is about the most inefficient way to do so
(video, hd controller, lots of PCI-Express slots, larger power supplies,
etc... just to serve RAM).
* Attached is the high level concept for a Processor Array.
*no errors, only seeking clarifications
How extensively is RDMA deployed in your applications now? How
latency sensitive are your cloud apps?
> If the purpose is to have lots of RAM,
> deploying RAM in a server is about the most inefficient way to do so
> (video, hd controller, lots of PCI-Express slots, larger power
> supplies,
> etc... just to serve RAM).
What would you propose as a more efficient mechanism to deploying lots
of RAM than sticking it directly in a server?
regards,
Timothy
> <Processor Array - Basic Idea v1.3.ppt>
timothy norman huber
timoth...@mac.com
310 795-6599
Realize this is an open-ended question....and more of a storage fishing expedition...anyone successfully fishing in the iSER/STGT pond?