A quick question:
How much memory per processor (i.e. per core) should an optimal compute
node have?
***
The question is generic, so here is some specifics of my problem:
I am thinking of putting 8GB per node on a dual-processor quad-core
machine (1GB per core).
This number seems to be a trend out there, but I don't really know why,
as not long ago it used to be 512MB per core (which were actual CPUs
back then).
Is 1GB per core right?
Too much?
Too little?
We do climate model number crunching with MPI here.
The programs (a.k.a. "models", in the parlance of climate and oceans folks)
use domain decomposition, finite-differences, some FFTs, etc.
The "models" are "memory intensive",
with big 3D arrays being read from and written to memory all the time,
and not so big 2D arrays (sub domain boundary values) being passed
across processes through MPI
at every time step of the simulation.
I/O happens at a slower pace, typically every ~100 time steps or more,
and can be either funneled through a master process, or distributed
across all processes.
There is nothing new on this scenario , which I think is also typical of
many applications in
Engineering, Physics, Astronomy, Chemistry, etc.
Our old dual-processor single-core production cluster has 1GB per node
(512MB per "core").
Most of our models fit this configuration.
The larger problems use up to 70-80% RAM, but safely avoid memory
paging, process switching, etc.
We're planning to run larger, higher resolution models.
Hence, after some scaling calculations and budget considerations,
I thought of increasing the RAM-per-core ratio to 1GB,
to be able to run a bit larger models on less (and now faster) processors.
And to my surprise I've got to the same 1GB per core that most people
seem to be using.
However, on multicore machines there are other issues to consider:
memory bandwidth, cache size vs. RAM size, NUMA, cache eviction, etc, etc.
(I saw some heated discussions of memory bandwidth and the like on other
mailing lists, so, please stay cool.)
In any case, at this point it seems to me that
"get as much RAM as your money can buy and your motherboard can fit" may
not be a wise choice.
(Is there anybody using 32GB per node?)
I wonder if there is an optimal choice of RAM-per-core.
What is your rule of thumb?
Does it depend?
On what?
Many thanks,
Gus Correa
--
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: g...@ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hung-sheng_tsao.vcf
Type: text/x-vcard
Size: 366 bytes
Desc: not available
Url : https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20080716/5c3ab559/hung-sheng_tsao.vcf
we still buy at 1G/core unless we have defined a clear need for something
different.
-- michael
Then again, if the new cores are each 4 times faster (for example),
and your algorithm
scales as N^3, where N is the total number of grid points (which it
does for us,
but in your case it probably _doesn't_, so you should find out what
_your users'_ scaling
really is), you can't afford to double L since that'll make the
number of operations go up
by a factor of (2^3)^3 = 2^6 = 64, or a net overall slowdown of 16.
Noam
P.S. On the memory bandwidth issue, try to benchmark your proposed
nodes.
If you're really sweeping through large arrays in memory (bad cache
coherence),
limited memory bandwidth could be an issue, and you may find that
dual cores
are no slower than quad cores.
If you asked this question on the Beowulf mailing list, RGB would probably
write you a 10 page response :)
That being said, I've got WRF users who somehow manage to use 8GB/core and
Ocean modelers that run code which fits in 100MB/core. It's all over the
map.
If your users say they plan on running larger models, then I would
probably go with at least 8 and maybe 16GB. They may find they run into
memory (or even interconnect) bandwidth problems when running on the 8
core machines, but that is another topic altogether.
Never skimp on RAM :)
Tim
How much RAM depends a great deal on your workload. We've got encoding
applications (video stream data) that use 1 GB per core, while our 2D/3D
rendering systems use at least 2 GB per core (more now since the advent of
64-bit). There's unfortunately not a set answer, so profile your typical use
cases and then budget 1.5x the maximum used to give you some breathing room
for future expansion. I'm assuming you can't flip your gear out for new
equipment in less than two years. :-)
cheers,
Klaus
As others have noted, this is a complex question with no real simple
answers ... the obvious one (non-answer) is that it is a function of
your application(s). Its a great deal more complicated than that ...
>
> ***
[...]
> We do climate model number crunching with MPI here.
> The programs (a.k.a. "models", in the parlance of climate and oceans folks)
> use domain decomposition, finite-differences, some FFTs, etc.
> The "models" are "memory intensive",
I bet memory bandwidth intensive as well. Which suggests a few design
scenarios: Specifically, while trying to maximize the value of ram per
core, try to maximize the number of memory buses per core. Memory
bandwidth bound code won't scale well by throwing more CPUs onto the
same memory bus.
I have heard from some of the WRF users about scalability measurements
that suggest maximizing the number of memory buses is one of the most
important systemic optimizations.
Since you can put two complete compute nodes (with IB if you wish) per
1U system, this suggests a nice way to increase the ratio of memory
buses to cores. This can be done with blade systems as well (the right
blade systems anyway).
> with big 3D arrays being read from and written to memory all the time,
> and not so big 2D arrays (sub domain boundary values) being passed
> across processes through MPI
> at every time step of the simulation. I/O happens at a slower pace,
> typically every ~100 time steps or more,
> and can be either funneled through a master process, or distributed
> across all processes.
> There is nothing new on this scenario , which I think is also typical of
> many applications in
> Engineering, Physics, Astronomy, Chemistry, etc.
Yeah... you need (guessing on importance ordering)
1) maximize the number of memory buses
2) maximize the speed of each memory bus
3) maximize the speed of the cores attached to the memory
4) maximize the speed of the network fabric between them
This is of course a constrained optimization problem, as you likely do
not have infinite money to maximize all these elements.
>
> Our old dual-processor single-core production cluster has 1GB per node
> (512MB per "core").
> Most of our models fit this configuration.
> The larger problems use up to 70-80% RAM, but safely avoid memory
> paging, process switching, etc.
>
> We're planning to run larger, higher resolution models.
> Hence, after some scaling calculations and budget considerations,
> I thought of increasing the RAM-per-core ratio to 1GB,
> to be able to run a bit larger models on less (and now faster) processors.
> And to my surprise I've got to the same 1GB per core that most people
> seem to be using.
We were advising 1 GB/core several years ago. More recently (last 2
years) it has been 2 GB/core, and as of late, many customers are opting
for 4+ GB/core.
> However, on multicore machines there are other issues to consider:
> memory bandwidth, cache size vs. RAM size, NUMA, cache eviction, etc, etc.
> (I saw some heated discussions of memory bandwidth and the like on other
> mailing lists, so, please stay cool.)
:)
>
> In any case, at this point it seems to me that
> "get as much RAM as your money can buy and your motherboard can fit" may
> not be a wise choice.
> (Is there anybody using 32GB per node?)
Yes, we have a few customers doing this, and higher.
>
> I wonder if there is an optimal choice of RAM-per-core.
> What is your rule of thumb?
> Does it depend?
> On what?
At the end of the day, its a battle between what you need, what you
want, and your budget. The budget always wins. So you have to
compromise on what you want, and make sure you fill your needs.
Today, memory is cheap. There aren't many problems that wouldn't
benefit from more ram (well, Monte Carlo where there is no IO and a
small working set size is one, but apart from that ...).
Try to make sure your memory is distributed among as many memory buses
as you can, regardless of 1GB/core, 2GB/core, or more.
> Many thanks,
> Gus Correa
>
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: lan...@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
Tim Carlson wrote:
> On Wed, 16 Jul 2008, Gus Correa wrote:
>
> If you asked this question on the Beowulf mailing list, RGB would
> probably write you a 10 page response :)
>
And I thought I was prolix ...
Anyway, by popular demand I will ask the same question to the Beowulf list.
I think it will be of interest there too.
Let's see what comes out.
> That being said, I've got WRF users who somehow manage to use 8GB/core
> and Ocean modelers that run code which fits in 100MB/core. It's all
> over the map.
>
Ocean models running on 100MB/core is possible, either with small models
(coarse grids, limited regions),
or with simplified models (e.g., "slab" oceans, with one or two vertical
layers only).
Not necessarily a waste, as they tend to run fast and take modest resources.
However, I wonder if it is efficient to give 8GB of data arrays (I
presume most of it is data, not code)
for a single core to manage.
And this is maybe the main point of my original question: How much RAM
per core is ideal?
Is it wise to tell the processors to manipulate array sections that are
much larger than the cache,
while competing for cache with its sibling processors (who have as big
arrays to handle themselves)?
Wouldn't it be better to split the computation across more processors,
with smaller array sections?
I am not knowledgeable on these things, my background is not in computer
science, but I wonder
if these questions make sense.
When the users/scientists here want to buy a desktop, if they ask me,
I tell them to buy as much memory as their money can buy and fits the
motherboard.
This is because their main tool is Matlab, which is memory-greedy,
and will hog any memory chip available when processing the multi-GB
datasets typical from climate and oceans.
A lot of memory will avoid, or at least mitigate, the frustration of a
frozen system, a slow browser, etc,
and allows them to experiment with ever growing datasets.
However, Matlab is an interactive tool, tailored for desktops and
workstations, not an HPC tool.
For an HPC node, I am not yet convinced that "big is better".
For instance, last time I looked at this two years ago or so, the IBM
BlueGene had 512MB per core.
The same RAM-to-core ratio seems to be used by the SyCortex machines.
Of course these machines have a different architecture, but they still
use NUMA (I guess).
This suggests to me that there should be a cutoff point in efficiency,
beyond which there may be no point in increasing the RAM-per-core ratio.
Well, maybe IBM, SyCortex, Cray, etc, have other reasons to limit the
RAM-per-core ratio too:
heat dissipation, integration with other system components, etc.
Anyway, please, correct me if my perception is wrong.
Well, maybe I am only trying to find a technical reason for a decision
to buy a modest amount of RAM,
that I have to make due to budget constraints (which as Joe Landman
pointed out on this thread,
is the ultimate reason of all technical decisions).
Therefore, don't pay too much attention to my unrest about this.
> If your users say they plan on running larger models, then I would
> probably go with at least 8 and maybe 16GB. They may find they run
> into memory (or even interconnect) bandwidth problems when running on
> the 8 core machines, but that is another topic altogether.
>
Yes, but maybe not a totally separate topic.
This topic, or better, the associated topics of memory bandwidth, task
localization, how to explore NUMA,
which software component should take care of setting a balanced use of
resources
(the user script, mpiexec, PBS-Torque/SGE) triggered heated and
inconclusive discussions on
other mailing lists (MPICH and Beowulf at least, but there may be more
out there).
> Never skimp on RAM :)
For desktops running Matlab sure!
Even my mother-in-law, on her 80s, uses Skype these days,
and we had to upgrade her computer with more memory, to allow her to
chat live with us,
when she got bored of email and instant messaging.
So, the demand for RAM is strong and widespread.
But for HPC nodes I remain skeptical ...
Cheers,
Gus
--
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: g...@ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
>
I have done some simple studies on climate modeling utilizing the (GPU)
Nvidia Geforce 8800 graphics card as a co-processor, before CUDA. CUDA
now makes this much easier. I was able to outperform our 40 node
cluster utilizing a single GPU in a laptop on FFT's and 3D modeling. I
would recommend learning the NVIDIA CUDA libraries and incorporating
newer GPU hardware into your cluster. You will get more bang for your
buck and be using some bleeding edge technology. The Nvidia 8800 is not
around $400 which is about the same price as additional the additional
ram, and much less than the cost of additional CPUs. There is one other
weather modeling center using GPU technology in China, so there are
examples out there. If you are interested in trying something like
this, I have built 4 clusters utilizing the GPU as the main processing
core and there are special steps that need to be taken to get it
working. ROCKS at the time didn't play well with the cutting edge
graphics cards, and I am not sure it will yet. The clusters I installed
for that purpose used Bunt Linux with custom management scripts.
I would love the opportunity to work on a project like yours as I have
always found weather and climate studies interesting, if you are
interested in speaking more about this topic please feel free to contact
me directly.
That being said I would spend the money on GPU hardware over CPU's and
memory.
Thanks,
Scott
> Gus,
>
> I have done some simple studies on climate modeling utilizing the (GPU)
> Nvidia Geforce 8800 graphics card as a co-processor, before CUDA. CUDA
> now makes this much easier. I was able to outperform our 40 node
> cluster utilizing a single GPU in a laptop on FFT's and 3D modeling. I
> would recommend learning the NVIDIA CUDA libraries and incorporating
> newer GPU hardware into your cluster. You will get more bang for your
> buck and be using some bleeding edge technology. The Nvidia 8800 is not
> around $400 which is about the same price as additional the additional
> ram, and much less than the cost of additional CPUs. There is one other
> weather modeling center using GPU technology in China, so there are
> examples out there. If you are interested in trying something like
> this, I have built 4 clusters utilizing the GPU as the main processing
> core and there are special steps that need to be taken to get it
> working. ROCKS at the time didn't play well with the cutting edge
> graphics cards, and I am not sure it will yet. The clusters I installed
> for that purpose used Bunt Linux with custom management scripts.
While not having played with it personally, Nvidia supports CUDA as a Rocks
4.3 roll.
http://forums.nvidia.com/lofiversion/index.php?t49255.html
-P
--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20080717/6f278054/attachment.html
Scott, thank you very much for the suggestion.
However, I think that what we are looking for now is
just a stable (traditional, if you wish) computing platform.
Some difficulties that come to mind in using GPU and CUDA to
run our type of production code are the huge code base written in
Fortran (just a tiny fraction of basic routines are in C) and the
restriction to 32-bit.
I have been hoping to test GPU and CUDA with a small investment,
in a gamer PC of kinds, and toy code.
However, given the science budget restrictions we're facing these days,
even this hasn't been possible.
Anyway, thank you again for the interesting suggestion.
Gus Correa
--
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: g...@ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------