PowerPC or PARISC?

11 views
Skip to first unread message

mail...@earnshaw.com.au

unread,
Sep 2, 2006, 9:23:30 AM9/2/06
to
At work I am being pushed by a senior person to replace our four current
Alpha/Tru64 servers with PARISC/HPUX servers. I don't quiet see his point
given

1. The position of the PASRISC on its roadmap - no better than Alphas
2. I have to rework our bespoke applications
3. PARISC replacement is Itanium, which is not widely accepted?

If I must use HPC systems, why wouldn't we consider a PowerPC. The SPEC
for the new JS21 looks impressive.

Nick Maclaren

unread,
Sep 2, 2006, 9:42:59 AM9/2/06
to

In article <44f985d2$1...@quokka.wn.com.au>, mail...@earnshaw.com.au writes:
|> At work I am being pushed by a senior person to replace our four current
|> Alpha/Tru64 servers with PARISC/HPUX servers. I don't quiet see his point
|> given
|>
|> 1. The position of the PASRISC on its roadmap - no better than Alphas
|> 2. I have to rework our bespoke applications
|> 3. PARISC replacement is Itanium, which is not widely accepted?

You are making the mistake of thinking rationally. He is clearly senior
enough to be under no such constraint. I recommend updating your CV and
looking for another job.

|> If I must use HPC systems, why wouldn't we consider a PowerPC. The SPEC
|> for the new JS21 looks impressive.

Maybe, but the Opteron and Woodcrest are better value unless you need
large SMP systems.


Regards,
Nick Maclaren.

John Dallman

unread,
Sep 2, 2006, 10:36:00 AM9/2/06
to
In article <44f985d2$1...@quokka.wn.com.au>, mail...@earnshaw.com.au ()
wrote:

> At work I am being pushed by a senior person to replace our four
> current Alpha/Tru64 servers with PARISC/HPUX servers.

PA-RISC, and you're perfectly right that this doesn't make any sense
nowadays. He seems to be working off HP's plan of some years ago.

> 2. I have to rework our bespoke applications

And therefore, presumably, you want them to last for a while in the new
form?

> 3. PA-RISC replacement is Itanium, which is not widely accepted?

HP keep pushing, but SGI have given in. HP made an awful mistake, IMHO,
when they sold their Itanium development teams and intellectual property
to Intel. If Intel gets fed up with loosing money on Itanium, HP aren't
in a good position to carry on by themselves.

> If I must use HPC systems, why wouldn't we consider a PowerPC. The
> SPEC for the new JS21 looks impressive.

Look seriously at Linux systems with Opteron or Woodcrest. The
price-performance is good, and the absolute costs lower. If you can say
more about what your apps do, and the scale of your current servers, you
may be able to get better advice.

---
John Dallman j...@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"

Piotr Wyderski

unread,
Sep 2, 2006, 1:12:34 PM9/2/06
to
mail...@earnshaw.com.au wrote:

PA-RISC is almost dead, there is no reason to buy it. This
architecture does not support any inter-processor communication
mechanism except a simple atomic test-and-change operation to
implement a spinlock (and even that is quite broken, as it is test
-and-clear, not test-and-set and requires 16 byte alignment).
It's almost impossible to build a high-performance SMP system
out of such crippled CPUs. You are totally right, PowerPC is
a much, much better and efficient design. But be careful, PPCs,
or to be more specific, POWERs, are installed in the mainframes
that run under AIX, which is an extremely low quality operating
environment, almost as bad as HP-UX is (another reason not to
buy a PA-RISC...). It breaks my heart when I see a $600k
machnine equiped with 16 ingenious POWER5 cores (if there
is a designer of that chip here: chapeau bas, sir!) running AIX.
So, buy a good POWER (PC)-based machine and replace
its software with something normal, e.g. Debian.

Best regards
Piotr Wyderski

mas

unread,
Sep 2, 2006, 2:15:34 PM9/2/06
to

Now might be a good time to get off the proprietary road and if you
need largish SMP from a Tier-1 OEM, Sun make an 8-socket 16-core
Opteron server (x4600) which you could run with solaris, linux or
windows. The 2007 AMD quad-core should fit in that server making it
32-core. IBM also do upto 32-socket versions of Xeon MPs as well. You
are right, Power 5+ is the most powerful large-way SMP Unix/Linux
server right now which will only continue with the highly-clocked Power
6 in 2007. If you wish to stay with HP as a vendor they do 4-way 8-core
AMD/Intel x86-64 boxes. However their designated successor to Alpha is
Itanium which has just been released in dual-core form (9000 series)
which has the leading 2-socket 4-core TPC-C score and they have a
specific program for this called Alpha RetainTrust.

http://www.physorg.com/news72453549.html
http://www.hp.com/products1/evolution/alpha_retaintrust/
http://www.hp.com/products1/evolution/alpha_retaintrust/services_fee.html?jumpid=reg_R1002_USEN

Nick Maclaren

unread,
Sep 2, 2006, 2:48:26 PM9/2/06
to

In article <1157220934.0...@i42g2000cwa.googlegroups.com>,

"mas" <mas...@hotmail.com> writes:
|>
|> Now might be a good time to get off the proprietary road and if you
|> need largish SMP from a Tier-1 OEM, Sun make an 8-socket 16-core
|> Opteron server (x4600) which you could run with solaris, linux or
|> windows. The 2007 AMD quad-core should fit in that server making it
|> 32-core.

Don't bother. While Intel NGMA supports 4 sockets and Opteron supports
8, the memory architecture isn't man enough for more than 2 and 4,
respectively. And, in both cases, the same remarks apply to plugging
in the forthcoming quad-core chips into those motherboards - the sane
approach is to wait for the next generation of motherboards.

|> IBM also do upto 32-socket versions of Xeon MPs as well.

Do they just? Do you have a reference? I keep hearing reports of this,
but haven't tracked down any real products. There are a LOT of problems
scaling Xeons up to that point.


Regards,
Nick Maclaren.

mas

unread,
Sep 2, 2006, 3:07:57 PM9/2/06
to

http://www-03.ibm.com/systems/x/scalable/x3950/index.html
http://www-03.ibm.com/servers/eserver/xseries/x460.html

although due to the current 36-bit memory (64GB) limitation of EM64T
what you are in effect getting is 8 * 4-socket 64GB partitions in one
server with each partition's processors only accessing local memory.
Unisys do the same thing with the ES7000

http://www.unisys.com/products/enterprise__servers/high_d_end__servers/models/es7000_s_one__servers.htm

Bill Todd

unread,
Sep 2, 2006, 3:21:04 PM9/2/06
to
Nick Maclaren wrote:
> In article <1157220934.0...@i42g2000cwa.googlegroups.com>,
> "mas" <mas...@hotmail.com> writes:
> |>
> |> Now might be a good time to get off the proprietary road and if you
> |> need largish SMP from a Tier-1 OEM, Sun make an 8-socket 16-core
> |> Opteron server (x4600) which you could run with solaris, linux or
> |> windows. The 2007 AMD quad-core should fit in that server making it
> |> 32-core.
>
> Don't bother. While Intel NGMA supports 4 sockets and Opteron supports
> 8, the memory architecture isn't man enough for more than 2 and 4,
> respectively.

As usual, Nick's confidence exceeds his accuracy. While an 8-socket
Opteron system does not scale *linearly* up from a 4-socket system in
performance, it does scale significantly (around 1.5x in the benchmarks
I've seen). The problem appears to be congestion in the Hypertransport
links due to cache-coherence broadcasts (interestingly, from the
topology one might expect something like a 6-socket system to scale
noticeably more linearly, though I haven't worked that out in enough
detail to be sure).

And my impression is that Intel's *current* next-generation products are
limited to dual-socket use, period (i.e., not simply because a 4-socket
system wouldn't scale well - though indeed that would very likely be the
case, given the shared-bus implementation).

And, in both cases, the same remarks apply to plugging
> in the forthcoming quad-core chips into those motherboards

Quad-core Opterons should work just fine in current quad-socket boards,
though any added Hypertransport traffic would make the 8-socket
situation even worse (and the same is true for shared-bus congestion on
current Intel boards).

- the sane
> approach is to wait for the next generation of motherboards.

The next generation of boards (and in Intel's case supporting chipsets)
should help a lot, with both shared-bus bandwidth limitations in Intel's
case and with 8-socket Opteron configurations (the next Opteron
generation reportedly has 4 much-higher-bandwidth HT links per chip,
each of which is splittable, providing full direct connectivity and
drastically reducing cache-coherence congestion as a result).

>
> |> IBM also do upto 32-socket versions of Xeon MPs as well.
>
> Do they just? Do you have a reference? I keep hearing reports of this,
> but haven't tracked down any real products.

Try looking at the TPC-C and SAP SD 2-tier results.

There are a LOT of problems
> scaling Xeons up to that point.

Which not only IBM but others (such as Unisys) solved years ago.
Scaling for the IBM systems is mediocre as one moves beyond 4 sockets
(they cable multiple quad-socket boards together, and hide some of the
resulting latency by using large board-level caches to minimize
off-board references); Unisys's is better.

- bill

mas

unread,
Sep 2, 2006, 3:57:29 PM9/2/06
to

On a personal note the company I work for uses a 32-way 1 GHz 32MB
PA-8800 Superdome split into 3 8-way and 2 4-way partitions running
commercial ERP/WMS packages and a bespoke data warehouse application
all run under Oracle on HP-UX and it generally acquits itself well in
daytime OLTP usage and night batch work due to the nice fat caches and
high ipc albeit low-clocked core. However this cpu is one that has
fallen behind the compute performance curve and is very unforgiving if
say Oracle in its infinite CBO wisdom decides to give you a bad slow
plan to execute. If you are going to get it, most likely for the sheer
number of HP-UX applications which IPF can't quite match yet, make sure
you get the very last one made on the latest chipsets, i.e 1.1 GHz 64MB
PA-8900 on the zx2/sx2000 chipsets, to minimise the relatively poor
compute performance. Here is an upto 16-core SpecCpu2000INTRate
comparison to illustrate what I mean, notice how Woodcrest on 4 cores
nearly matches 16 PA-8800 !

http://www.aceshardware.com/SPECmine/index.jsp?b=1&s=1&v=1&if=0&ncf=1&nct=16&cpcf=1&cpct=2&mf=200&mt=3800&o=0&o=1&start=20

Anton Ertl

unread,
Sep 2, 2006, 3:39:19 PM9/2/06
to
Bill Todd <bill...@metrocast.net> writes:

>Nick Maclaren wrote:
> And, in both cases, the same remarks apply to plugging
>> in the forthcoming quad-core chips into those motherboards
>
>Quad-core Opterons should work just fine in current quad-socket boards,
>though any added Hypertransport traffic would make the 8-socket
>situation even worse

You definitely want a new board for the quad-core Opterons to get best
performance out of the CPUs: they have 4 16-bit HT v3 links (2.6GHz),
providing full single-hop connections in a 4-way system; these links
can be split into 8 8-bit links for full single-hop connectivity in an
8-way system. So a board designed for these new chips will be much
different from the old ones.

Of course, if these new chips are socket-compatible with the current
ones, it can still make sense to upgrade an existing machine with new
CPUs rather than buying a new box; the performance of the upgraded
machine will not be as good as that of a new box, but still better
than the original machine.

BTW, this information comes from a presentation that someone from AMD
gave at Euro-Par on Wednesday. Another interesting thing he mentioned
was that AMD has no plans to support fully-buffered DIMMs, but will
support DDR2 and DDR3 memory (probably registered); the reasons given
were energy consumption and latency. Hmm, the market for server RAMs
is small already, I hate seeing it split between registered and
fully-buffered RAM:-(.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Nick Maclaren

unread,
Sep 2, 2006, 4:03:33 PM9/2/06
to

In article <Y5adnRfBIdk8RGTZ...@metrocastcablevision.com>,

Bill Todd <bill...@metrocast.net> writes:
|>
|> As usual, Nick's confidence exceeds his accuracy. While an 8-socket
|> Opteron system does not scale *linearly* up from a 4-socket system in
|> performance, it does scale significantly (around 1.5x in the benchmarks
|> I've seen).

And it scales negatively in the ones that a colleague of mine ran.
That could well be because Tyan made a complete pig's ear of either
the design or the system, so let's not make too much of that.

All of the benchmarks that I have seen quoted have been low memory use
or low communication, or both, and are completely unrepresentative of
what most real customers want large SMPs for. Note that the ONLY
interesting benchmarks represent uses where using two 4-socket Opterons
would be impractical.

|> The problem appears to be congestion in the Hypertransport
|> links due to cache-coherence broadcasts (interestingly, from the
|> topology one might expect something like a 6-socket system to scale
|> noticeably more linearly, though I haven't worked that out in enough
|> detail to be sure).

"The" problem? Oh, my dear sir! You haven't looked hard enough. Yes,
that is perhaps the main one, but there are a number of related issues.

|> And my impression is that Intel's *current* next-generation products are
|> limited to dual-socket use, period (i.e., not simply because a 4-socket
|> system wouldn't scale well - though indeed that would very likely be the
|> case, given the shared-bus implementation).

Your impression is wrong. At least according to the Intel technical
people I have spoken to.

|> And, in both cases, the same remarks apply to plugging
|> > in the forthcoming quad-core chips into those motherboards
|>
|> Quad-core Opterons should work just fine in current quad-socket boards,
|> though any added Hypertransport traffic would make the 8-socket
|> situation even worse (and the same is true for shared-bus congestion on
|> current Intel boards).

And they would also be very short on memory bandwidth, which is "a" (not
"the") bottleneck for many applications.

|> The next generation of boards (and in Intel's case supporting chipsets)
|> should help a lot, with both shared-bus bandwidth limitations in Intel's
|> case and with 8-socket Opteron configurations (the next Opteron
|> generation reportedly has 4 much-higher-bandwidth HT links per chip,
|> each of which is splittable, providing full direct connectivity and
|> drastically reducing cache-coherence congestion as a result).

I could say more, but I can't remember what is public and what is NDA,
but you are right in principle even if not in detail.

|> > |> IBM also do upto 32-socket versions of Xeon MPs as well.
|> >
|> > Do they just? Do you have a reference? I keep hearing reports of this,
|> > but haven't tracked down any real products.
|>
|> Try looking at the TPC-C and SAP SD 2-tier results.

I am not interested in such issues; I want to see some description of
the architecture, what it supports, and what it doesn't. You may feel
that is unreasomable of me, but most readers of comp.arch won't.

Gluing multiple servers together and running lots of things in parallel
is easy - THAT never was a problem.


Regards,
Nick Maclaren.

mas

unread,
Sep 2, 2006, 4:26:55 PM9/2/06
to
Nick Maclaren wrote:
> In article <Y5adnRfBIdk8RGTZ...@metrocastcablevision.com>,
> Bill Todd <bill...@metrocast.net> writes:
> |>
> |> As usual, Nick's confidence exceeds his accuracy. While an 8-socket
> |> Opteron system does not scale *linearly* up from a 4-socket system in
> |> performance, it does scale significantly (around 1.5x in the benchmarks
> |> I've seen).
>
> And it scales negatively in the ones that a colleague of mine ran.
> That could well be because Tyan made a complete pig's ear of either
> the design or the system, so let's not make too much of that.
>

The Tyan K8QW (2 * 2-hops and 6 * 3-hops) has the best 8-way topology
design after Sun's x4600 (4 * 2-hops and 4 * 3-hops) with Fabric and
Iwill having 4-hop nodes in their 8-way designs.

http://www.aceshardware.com/forums/read_post.jsp?id=120061497&forumid=1
http://www.aceshardware.com/forums/read_post.jsp?id=120061524&forumid=1


> All of the benchmarks that I have seen quoted have been low memory use
> or low communication, or both, and are completely unrepresentative of
> what most real customers want large SMPs for. Note that the ONLY
> interesting benchmarks represent uses where using two 4-socket Opterons
> would be impractical.
>
> |> The problem appears to be congestion in the Hypertransport
> |> links due to cache-coherence broadcasts (interestingly, from the
> |> topology one might expect something like a 6-socket system to scale
> |> noticeably more linearly, though I haven't worked that out in enough
> |> detail to be sure).
>
> "The" problem? Oh, my dear sir! You haven't looked hard enough. Yes,
> that is perhaps the main one, but there are a number of related issues.
>

[....]

>
> Regards,
> Nick Maclaren.

Nick Maclaren

unread,
Sep 2, 2006, 5:00:36 PM9/2/06
to

In article <1157224077.6...@m73g2000cwd.googlegroups.com>,

"mas" <mas...@hotmail.com> writes:
|> >
|> > |> IBM also do upto 32-socket versions of Xeon MPs as well.
|> >
|> > Do they just? Do you have a reference? I keep hearing reports of this,
|> > but haven't tracked down any real products. There are a LOT of problems
|> > scaling Xeons up to that point.
|>
|> http://www-03.ibm.com/systems/x/scalable/x3950/index.html
|> http://www-03.ibm.com/servers/eserver/xseries/x460.html

Thanks, but those don't really say anything. What I am trying to find
out is whether they support single SMP system images and SMP applications
that scale beyond 4 sockets and, if they do, how. Your next paragraph
implies that they don't.

I agree that systems that have a single management infrastructure for
multiple SMP systems, or even single systems that can manage several
SMP sub-systems (with no application spanning a small SMP) are good
ideas, but they aren't what most people expect.

|> although due to the current 36-bit memory (64GB) limitation of EM64T
|> what you are in effect getting is 8 * 4-socket 64GB partitions in one
|> server with each partition's processors only accessing local memory.
|> Unisys do the same thing with the ES7000
|>
|> http://www.unisys.com/products/enterprise__servers/high_d_end__servers/models/es7000_s_one__servers.htm

Ah. Well, I am not really very interested in that. While such systems
may be very useful, they can really only support uses that would be
perfectly all right on 8 separate 4-socket SMPs. That isn't what most
people mean when they refer to architectural limitations on the number
of CPUs.


Regards,
Nick Maclaren.

Niels Jørgen Kruse

unread,
Sep 2, 2006, 5:05:51 PM9/2/06
to
<mail...@earnshaw.com.au> wrote:

You don't mention how large these Alpha/Tru64 servers are. If they can
be replaced with blade servers, I suppose they are not all that large.

It is surprisingly common for businesses to get all their IT equipment
from a single vendor, that is likely the reason for the weird choice.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Thor Lancelot Simon

unread,
Sep 2, 2006, 6:13:04 PM9/2/06
to
In article <edcjlq$7oj$1...@gemini.csx.cam.ac.uk>,

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>
>In article <1157220934.0...@i42g2000cwa.googlegroups.com>,
>"mas" <mas...@hotmail.com> writes:
>|>
>|> Now might be a good time to get off the proprietary road and if you
>|> need largish SMP from a Tier-1 OEM, Sun make an 8-socket 16-core
>|> Opteron server (x4600) which you could run with solaris, linux or
>|> windows. The 2007 AMD quad-core should fit in that server making it
>|> 32-core.
>
>Don't bother. While Intel NGMA supports 4 sockets and Opteron supports
>8, the memory architecture isn't man enough for more than 2 and 4,
>respectively.

I'd appreciate it if you'd elucidate on this, specifically with regard
to Opteron. Adding an Opteron, after all, adds a memory controller; in
fact, there are clearly applications where an 8-socket Opteron with
single-core chips should handily outperform a 4-socket one with dual-cores.

--
Thor Lancelot Simon t...@rek.tjls.com

"We cannot usually in social life pursue a single value or a single moral
aim, untroubled by the need to compromise with others." - H.L.A. Hart

Nick Maclaren

unread,
Sep 2, 2006, 6:52:16 PM9/2/06
to

In article <1157228815....@p79g2000cwp.googlegroups.com>,

"mas" <mas...@hotmail.com> writes:
|> >
|> > And it scales negatively in the ones that a colleague of mine ran.
|> > That could well be because Tyan made a complete pig's ear of either
|> > the design or the system, so let's not make too much of that.
|>
|> The Tyan K8QW (2 * 2-hops and 6 * 3-hops) has the best 8-way topology
|> design after Sun's x4600 (4 * 2-hops and 4 * 3-hops) with Fabric and
|> Iwill having 4-hop nodes in their 8-way designs.
|>
|> http://www.aceshardware.com/forums/read_post.jsp?id=120061497&forumid=1
|> http://www.aceshardware.com/forums/read_post.jsp?id=120061524&forumid=1

In theory, yes. I can't tell you why the system behaved so badly, and
my colleague (who is EXTREMELY competent) failed to find out and has
got no response from Tyan.


Regards,
Nick Maclaren.

Bill Todd

unread,
Sep 2, 2006, 11:44:02 PM9/2/06
to
Anton Ertl wrote:
> Bill Todd <bill...@metrocast.net> writes:
>> Nick Maclaren wrote:
>> And, in both cases, the same remarks apply to plugging
>>> in the forthcoming quad-core chips into those motherboards
>> Quad-core Opterons should work just fine in current quad-socket boards,
>> though any added Hypertransport traffic would make the 8-socket
>> situation even worse
>
> You definitely want a new board for the quad-core Opterons to get best
> performance out of the CPUs

That, of course, depends on what you may be doing with them. For
NUMA-optimizable workloads it's not clear that the new board will buy
you much, since the only HT traffic will be cache-coherence traffic.
The existing HT configuration in quad-socket boards doesn't appear to
have any trouble in this regard and since HT cache-coherence loading in
today's 8-socket boards is around 3x that in the quad-socket boards the
latter may well handle double the number of cores per socket just fine.

Though having no two-hop cache-coherence *latencies* to deal with might
be worth something for two of the four sockets - I forget whether AMD
manages to hide most of that latency behind the local RAM access or not.

Even for non-NUMA-optimized workloads, only two of the four sockets on
today's boards ever make two-hop remote accesses, and on average only
1/4 of the time (assuming that their RAM targets are evenly distributed
around the 4 sockets). So from the standpoint of average latency there
shouldn't be much difference with the new, fully-connected quad-socket
board: only if today's HT links can't handle the quad-core
remote-access *bandwidth* (itself of course variable by workload) are
you likely to see any significant difference.

...

Another interesting thing he mentioned
> was that AMD has no plans to support fully-buffered DIMMs,

Not this week, anyway: the amount of waffling they've done on this
issue suggests that they really don't know one way or the other and
probably won't decide for a while yet while the situation w.r.t.
FB-DIMMs develops.

but will
> support DDR2 and DDR3 memory (probably registered); the reasons given
> were energy consumption and latency. Hmm, the market for server RAMs
> is small already, I hate seeing it split between registered and
> fully-buffered RAM:-(.

ISTR seeing something recently about a new variant that promised to
replace both, but details were too sketchy to form any opinion about.

- bill

Bill Todd

unread,
Sep 3, 2006, 2:51:39 AM9/3/06
to
Nick Maclaren wrote:
> In article <Y5adnRfBIdk8RGTZ...@metrocastcablevision.com>,
> Bill Todd <bill...@metrocast.net> writes:

...

> |> The problem appears to be congestion in the Hypertransport
> |> links due to cache-coherence broadcasts (interestingly, from the
> |> topology one might expect something like a 6-socket system to scale
> |> noticeably more linearly, though I haven't worked that out in enough
> |> detail to be sure).
>
> "The" problem? Oh, my dear sir! You haven't looked hard enough. Yes,
> that is perhaps the main one, but there are a number of related issues.

Yawn. This kind of empty bloviation is far too much a habit with you,
Nick: be specific if you want anyone to pay attention.

...

> |> And, in both cases, the same remarks apply to plugging
> |> > in the forthcoming quad-core chips into those motherboards
> |>
> |> Quad-core Opterons should work just fine in current quad-socket boards,
> |> though any added Hypertransport traffic would make the 8-socket
> |> situation even worse (and the same is true for shared-bus congestion on
> |> current Intel boards).
>
> And they would also be very short on memory bandwidth, which is "a" (not
> "the") bottleneck for many applications.

For exactly what values of 'very short'? With today's HT bandwidth of 4
GB/s in each direction for each of the two HT links per socket on a
quad-socket board, one might observe that you've got to be doing rather
vigorous remote memory access (decidedly non-NUMA-optimized) before that
situation arises: true, local memory offers 6.4 GB/s using 400 MHz DDR
or over 10 GB/s using 667 MHz DDR2, but as long as even just a simple
majority of accesses go to a local rather than to a remote location (or
accesses just aren't taxing bandwidth limits that much anyway) today's
quad-socket boards should not be at a disadvantage compared with tomorrow's.

>
> |> The next generation of boards (and in Intel's case supporting chipsets)
> |> should help a lot, with both shared-bus bandwidth limitations in Intel's
> |> case and with 8-socket Opteron configurations (the next Opteron
> |> generation reportedly has 4 much-higher-bandwidth HT links per chip,
> |> each of which is splittable, providing full direct connectivity and
> |> drastically reducing cache-coherence congestion as a result).
>
> I could say more, but I can't remember what is public and what is NDA,
> but you are right in principle even if not in detail.

Yawn again, Nick: I've heard you claim uncitable sources to back up
utter garbage too often to be impressed - put up, or shut up.

>
> |> > |> IBM also do upto 32-socket versions of Xeon MPs as well.
> |> >
> |> > Do they just? Do you have a reference? I keep hearing reports of this,
> |> > but haven't tracked down any real products.
> |>
> |> Try looking at the TPC-C and SAP SD 2-tier results.
>
> I am not interested in such issues; I want to see some description of
> the architecture, what it supports, and what it doesn't. You may feel
> that is unreasomable of me, but most readers of comp.arch won't.

I don't feel that it's unreasonable, it's just not at all what you
originally asked for (still quoted above), which was, necessarily, what
I responded to: you asked for a reference indicating the existence of
'real products', and I gave you one.

>
> Gluing multiple servers together and running lots of things in parallel
> is easy - THAT never was a problem.

Perhaps you should actually look at the references that I provided
before babbling any further: the systems involved are not clustered
(you can also ascertain their nature by their software pricing).
Exactly how IBM goes about allowing its Xeons to access more than 64 GB
of physical RAM I don't know (unless they're correct in their assertion
that Potomac - and, presumably, newer - Xeons support 40 bits of
physical addressing rather than 36 bits, which seems reasonable), but
the fact that they do is indisputable: not only do they characterize
their 32-socket systems as being 'SMP' in their product presentation,
but a while ago they had a TPC-C submission that used a quad-Xeon board
(I think with the previous generation of their 'x architecture', which
also supported larger-than-64-GB systems) plus an additional chassis
that contained no processors at all, just to expand the available RAM to
128 GB.

Now, IBM's current 8-socket system isn't all that much better at TPC-C
than that augmented quad-socket system I just mentioned, and their
scaling falls off a lot more at 16 sockets: clearly, off-board
processor interactions suffer significantly. By contrast, the Unisys
systems (which also emphasize their SMP nature - "ES7000/one Server can
support a scale-up, single application instance") start less
impressively but scale up quite well, handily beating the best that
Itanic can offer in TPC-C scores at both the 16- and 32-core system sizes.

- bill

Bill Todd

unread,
Sep 3, 2006, 3:01:12 AM9/3/06
to
Bill Todd wrote:

..

By contrast, the Unisys
> systems (which also emphasize their SMP nature - "ES7000/one Server can
> support a scale-up, single application instance") start less
> impressively but scale up quite well, handily beating the best that
> Itanic can offer in TPC-C scores at both the 16- and 32-core system sizes.

My bad - I forgot about a recent 32-core Fujitsu Itanic submission that
just edges out the Unisys 32-core Xeon, so it's more accurate to say
that Xeon and Itanic are neck-and-neck right now in large-system TPC-C
performance (and Itanic might actually pull ahead if HP gets its act
together and submits a new-generation Superdome Montecito system, though
the fact that they have not done so may be significant in itself).

- bill

mail...@earnshaw.com.au

unread,
Sep 3, 2006, 3:51:56 AM9/3/06
to
We run two ES40 (Commercial: Oracle Financials, EMPAC and Technical: WENCO
+ custom apps), then two DS20 (Datamart and a Test/Dev). The machines are
not really over used, max spikes peridocally of 60% ... biggest problem I
have seen is a poor SQL script on Commercial. They are now 5 years old so
we are replacing.

As the localised person responsible, I don't want to go to HP because of
very poor support in the past 5 years. The cost I am being quoted for the
PARISC is also steep compared to the Power. Unfortunately Linux is out,
global corporate policy. Another factor is our location, surround by
desert and 600km between drinks, two flights a day. The blade option seems
to be a good fit for local maintenance.

If I have certification for our mission critical applications, either
Oracle EBS or Mincom Ellipse (replacing EMPAC), on "Tier-1 UNIX", am I
really missing anything going to Power? I am after a high available option
not "compute capabilities" - if that makes sense.

Thanks

ME


Rob Warnock

unread,
Sep 3, 2006, 4:07:31 AM9/3/06
to
[Apologies if I've messed up the attributions... the trail was messy...]

Bill Todd <bill...@metrocast.net> wrote:
+---------------


| Nick Maclaren wrote:
| > Bill Todd <bill...@metrocast.net> writes:

| > |> Quad-core Opterons should work just fine in current quad-socket boards,
| > |> though any added Hypertransport traffic would make the 8-socket
| > |> situation even worse (and the same is true for shared-bus congestion
| > |> on current Intel boards).
| >
| > And they would also be very short on memory bandwidth, which is "a"
| > (not "the") bottleneck for many applications.
|
| For exactly what values of 'very short'? With today's HT bandwidth of 4
| GB/s in each direction for each of the two HT links per socket on a
| quad-socket board, one might observe that you've got to be doing rather
| vigorous remote memory access (decidedly non-NUMA-optimized) before that
| situation arises: true, local memory offers 6.4 GB/s using 400 MHz DDR
| or over 10 GB/s using 667 MHz DDR2, but as long as even just a simple
| majority of accesses go to a local rather than to a remote location (or
| accesses just aren't taxing bandwidth limits that much anyway) today's
| quad-socket boards should not be at a disadvantage compared with tomorrow's.

+---------------

Don't forget that with Opteron even *local* memory accesses require
getting snoop responses back from *all* of the other CPUs. The HT
coherency traffic goes up with the number of CPU cores, not just
the number of sockets. And even if *all* the CPUs are hitting *only*
local memory [perfect NUMA placement], there will be HT coherency
traffic proportional to the product of the cache miss rate and the
number of CPU cores.

[Directory-based caches start to win really big on high-CPU-count
systems....]


-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Bill Todd

unread,
Sep 3, 2006, 5:00:36 AM9/3/06
to
Rob Warnock wrote:
> [Apologies if I've messed up the attributions... the trail was messy...]
>
> Bill Todd <bill...@metrocast.net> wrote:
> +---------------
> | Nick Maclaren wrote:
> | > Bill Todd <bill...@metrocast.net> writes:
> | > |> Quad-core Opterons should work just fine in current quad-socket boards,
> | > |> though any added Hypertransport traffic would make the 8-socket
> | > |> situation even worse (and the same is true for shared-bus congestion
> | > |> on current Intel boards).
> | >
> | > And they would also be very short on memory bandwidth, which is "a"
> | > (not "the") bottleneck for many applications.
> |
> | For exactly what values of 'very short'? With today's HT bandwidth of 4
> | GB/s in each direction for each of the two HT links per socket on a
> | quad-socket board, one might observe that you've got to be doing rather
> | vigorous remote memory access (decidedly non-NUMA-optimized) before that
> | situation arises: true, local memory offers 6.4 GB/s using 400 MHz DDR
> | or over 10 GB/s using 667 MHz DDR2, but as long as even just a simple
> | majority of accesses go to a local rather than to a remote location (or
> | accesses just aren't taxing bandwidth limits that much anyway) today's
> | quad-socket boards should not be at a disadvantage compared with tomorrow's.
> +---------------
>
> Don't forget that with Opteron even *local* memory accesses require
> getting snoop responses back from *all* of the other CPUs.

Since I addressed this elsewhere, it's not likely that I forgot it here.
My observation on that point was that an 8-socket configuration
generates around 3x the coherency traffic *per HT link* that a
quad-socket configuration does, suggesting that the quad-socket
configuration may have rather a lot of bandwidth to spare (given that
the 8-socket configuration manages to function at all).

The HT
> coherency traffic goes up with the number of CPU cores, not just
> the number of sockets.

AMD's presentations are not clear on this point, but they at least
suggest that coherency traffic on the HT links does *not* increase with
the number of cores, just with the number of sockets (which is certainly
at least possible, given the architecture).

And even if *all* the CPUs are hitting *only*
> local memory [perfect NUMA placement], there will be HT coherency
> traffic proportional to the product of the cache miss rate and the
> number of CPU cores.

Only up to the point where the local memory bandwidth is saturated:
once that point is reached, it doesn't matter how many more local cores
you add - there won't be any more coherency traffic, because there won't
be any more local accesses.

So if the links can support the coherency traffic generated by
local-only accesses sufficient to saturate the local memory bandwidth on
all sockets (are there STREAMS results for quad-socket Opteron systems
that could shed light on that?), the only question is how much link
bandwidth is left over to satisfy some percentage of remote accesses.

- bill

Nick Maclaren

unread,
Sep 3, 2006, 6:08:09 AM9/3/06
to

In article <edcvlg$s0q$1...@reader2.panix.com>,

t...@panix.com (Thor Lancelot Simon) writes:
|> >
|> >Don't bother. While Intel NGMA supports 4 sockets and Opteron supports
|> >8, the memory architecture isn't man enough for more than 2 and 4,
|> >respectively.
|>
|> I'd appreciate it if you'd elucidate on this, specifically with regard
|> to Opteron. Adding an Opteron, after all, adds a memory controller; in
|> fact, there are clearly applications where an 8-socket Opteron with
|> single-core chips should handily outperform a 4-socket one with dual-cores.

Certainly.

Let's start with Intel. The latest chipsets have two higher-speed FSBs,
and so overtake Opteron, but they can't be extended, and there are some
quite serious latency problems (which I find bizarre, given that the MCH
is a single chip. Everything works well for dual-socket Woodcrest, but
is definitely underpowered for quad-socket, dual-core of dual-socket,
quad core. But Intel are working on that.

The Opteron situation is very different. Purely local memory accesses
are purely local, but an access by CPU A to CPU B's memory or a HT packet
from CPU A to CPU C that needs to pass through CPU B needs dealing with
by the memory part of CPU B. Also, all cache coherence queries (even for
local memory) need to go everywhere, and I believe that Rob Warnock is
right that the number goes up with the number of cores.

But there is more. In a normal 4-socket design, each CPU has to handle
at most one simplex pass-through path, and only the two CPUs not supporting
I/O have to do even that. I haven't worked out the precise numbers for
the various 8-core topologies, but I am pretty sure that most of the CPUs
have to support at least 3 simplex pass-through routes. Even when the HT
links are not overloaded, that is not good for latency.

The coherence protocol and memory controller design are not published,
but my understanding is that they are synchronous and serial (too keep
the overheads minimal). And remember that cross-CPU memory references
start being critical as soon as you do much of any of true SMP work,
process migration, shared file/memory/FIFOs and so on.

Yes, there are applications where this doesn't matter, but experience with
large SMP systems is that even apparently independent applications cause
more inter-CPU traffic when run in a complete system than the naive would
expect. And it isn't always for the reasons that even the experts expect.


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Sep 3, 2006, 6:30:03 AM9/3/06
to

Oops. Sorry. I forgot to mention two things.

Bisection bandwidth per CPU. That is, of course, half the value for an
8-socket system than for a 4-socket, so problems with HT bandwidth start
to occur much earlier. And system loadings that are pushing that will
almost certainly start to impact on coherence latency.

And, as others have said, AMD are working on all of this, and are doing
their usual information leaking - a dribble here and a dribble there.
But it seems that the next generation of Opteron will reduce a lot of
these problems.

All the commentators who say that AMD/Intel has just scored a knockout
blow are just bullshitting - in 2005/6 AMD led, in 2007 Intel will, in
2008 who knows?


Regards,
Nick Maclaren.

Paradox

unread,
Sep 3, 2006, 6:47:43 AM9/3/06
to
Nick Maclaren wrote:

The list price for a 4-way 1.65 GHz POWER 5+ with 4 GB RAM and 4
hot-swappable SCSI disk bays (two 73.4 GB 10K drives included) is
$5,830. (The p5 510Q, prices from IBM's website. Yes, this price does
not include AIX, which will cost you around a grand more. But you can
also run linux on the box.)

So at least in some configurations I guess the POWER 5+ is actually
more-or-less the same value as Opteron or Woodcrest.

Nick Maclaren

unread,
Sep 3, 2006, 7:00:17 AM9/3/06
to

In article <1157280463....@m73g2000cwd.googlegroups.com>,

"Paradox" <parad...@gmail.com> writes:
|>
|> The list price for a 4-way 1.65 GHz POWER 5+ with 4 GB RAM and 4
|> hot-swappable SCSI disk bays (two 73.4 GB 10K drives included) is
|> $5,830. (The p5 510Q, prices from IBM's website. Yes, this price does
|> not include AIX, which will cost you around a grand more. But you can
|> also run linux on the box.)

Interesting. That is definitely competitive for the hardware.

In my experience of the main cost of AIX is not the money you pay for it.
I have no experience with Linux on those systems, but have good reason to
believe that it isn't as much of an improvement over AIX as it might be.

|> So at least in some configurations I guess the POWER 5+ is actually
|> more-or-less the same value as Opteron or Woodcrest.

Hmm. Yes and no. In terms of TCO, I would say it is double the cost.
The problem is support effort and finding technical staff and expertise.

I may be biassed at looking at it from a HPC viewpoint, and this may not
be the case for a commercial Oracle shop, but I have reason to believe
that many of the same issues apply.


Regards,
Nick Maclaren.

Paradox

unread,
Sep 3, 2006, 7:30:51 AM9/3/06
to
Nick Maclaren wrote:

Yes, the initial hardware price does not tell the whole story. But I
just found the pricing of the new entry pSeries machines from IBM
interesting. It has been true for a long time that x86 gave the most
"bang for the buck", often by a very large margin. But now one can get
a POWER 5+ for the same amount as a Woodcrest/Opteron. And download a
free version of e.g. SUSE linux for the pSeries on the net (this is of
course nothing new.)

Nick Maclaren

unread,
Sep 3, 2006, 9:29:44 AM9/3/06
to

In article <1157283051.1...@i3g2000cwc.googlegroups.com>,

"Paradox" <parad...@gmail.com> writes:
|>
|> Yes, the initial hardware price does not tell the whole story. But I
|> just found the pricing of the new entry pSeries machines from IBM
|> interesting. It has been true for a long time that x86 gave the most
|> "bang for the buck", often by a very large margin. But now one can get
|> a POWER 5+ for the same amount as a Woodcrest/Opteron. And download a
|> free version of e.g. SUSE linux for the pSeries on the net (this is of
|> course nothing new.)

So do I. Every since the very early days of the PowerPC, IBM have had
many opportunities to reshape the industry, and have thrown away every
one. It isn't yet out of the question that they will achieve it but,
if their record is anything to go by, it will be accident not design.


Regards,
Nick Maclaren.

Torbjorn Lindgren

unread,
Sep 3, 2006, 10:30:42 AM9/3/06
to
Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>In article <Y5adnRfBIdk8RGTZ...@metrocastcablevision.com>,
>Bill Todd <bill...@metrocast.net> writes:
>|> And my impression is that Intel's *current* next-generation products are
>|> limited to dual-socket use, period (i.e., not simply because a 4-socket
>|> system wouldn't scale well - though indeed that would very likely be the
>|> case, given the shared-bus implementation).
>
>Your impression is wrong. At least according to the Intel technical
>people I have spoken to.

So where IS Intel's Core (Core 2) microarchitecture quad-socket CPU
then? This is clearly Intels "current next-generation products"...

Yes, there's a new Xeon MP, the 7100 series named to make it look like
they're related to the Woodcrest (5100)... However, Intel's own pages
makes it fairly clear that it's a Netburst based part with a BIG L3
cache (4/8/16 MB) and more bandwidth (quite possibly multiple FSB's
like 5000 (Netburst) and 5100 (Woodcrest)).

There are a couple of models lower clocked models with less cache (and
lower power usage) which they then compare to the previous top models
to get impressive power reductions, but the top-end ones uses quite a
bit of power even if they're not quite as bad as the previous ones.

Yes, it's a significant improvement on what they had before in the 4P
space (though much of it probably comes from the giant cache), so some
chest beating from Intel is to be expected, but Bill Todd's statement
appears to be correct (your statement clearly targets Todd's
"impression", ie that specific statement).

http://www.intel.com/products/processor/xeon/index.htm

Most of the semitechnical 7100 information is under the 7000 header,
but tagged with 7100, following the actual 7100 links gives mostly
marketing fluff.

Nick Maclaren

unread,
Sep 3, 2006, 11:50:13 AM9/3/06
to

In article <mGBKg.57218$Df2....@fe05.news.easynews.com>,

Torbjorn Lindgren <t...@none.invalid> writes:
|> Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
|> >In article <Y5adnRfBIdk8RGTZ...@metrocastcablevision.com>,
|> >Bill Todd <bill...@metrocast.net> writes:
|> >|> And my impression is that Intel's *current* next-generation products are
|> >|> limited to dual-socket use, period (i.e., not simply because a 4-socket
|> >|> system wouldn't scale well - though indeed that would very likely be the
|> >|> case, given the shared-bus implementation).
|> >
|> >Your impression is wrong. At least according to the Intel technical
|> >people I have spoken to.
|>
|> So where IS Intel's Core (Core 2) microarchitecture quad-socket CPU
|> then? This is clearly Intels "current next-generation products"...

Quite. It am almost certain that it exists, works and is usable, and the
only question is whether it is worth bothering with. My understanding is
that it isn't, at least for most people, and a hell of a lot of OEMs seem
to agree with me (as they do about the 8-socket Opteron).


Regards,
Nick Maclaren.


Piotr Wyderski

unread,
Sep 3, 2006, 12:03:33 PM9/3/06
to
Nick Maclaren wrote:

> I have no experience with Linux on those systems, but have good reason to
> believe that it isn't as much of an improvement over AIX as it might be.

There are several reasons to choose Linux:

1. The installer of Oracle will not hang your pSeries machine. :-)

2. There is a working and fully tested C++ compiler, namely GCC.
On AIX there is XL (non-industraial quality because of the number
of bugs it contains -- I was able to find four per week) and a broken
port of GCC (has many problems with shared libraries).

3. The support is much better and faster. I wanted to report to IBM
the bugs I found in the newest version of XL, but the website claims
that I need to buy a support package to be able to report them. :-)))

BTW, just to be clear: I am not a fan of Linux; in case you wish to stick
a label on me, I am a Windows NT-series zealot [;-)], but from my
personal experience I may say for sure that it is much better than AIX
and HP-UX. One of my professional duties is porting business software
into various Unix systems, but only these two make my knees shake...

Best regards
Piotr Wyderski

John Dallman

unread,
Sep 3, 2006, 12:43:00 PM9/3/06
to
In article <44fa...@quokka.wn.com.au>, mail...@earnshaw.com.au ()
wrote:

> Unfortunately Linux is out, global corporate policy.

Is there a rationale for that that you can quote? I quite understand
that they aren't going to change it on the basis of what people say on
Usenet. I'm just interested in the reasoning.

Are HP saying that the replacement for a PA-RISC system will have to be
Itanium, or do they expect to be able to offer you PA-RISC again in 3-5
years? If your management are going to want to go HP, going straight to
Itanium now makes a certain amount of sense, so that you can get the
disaster over with while the Alphas are still running. Management often
worry if someone who's opposed a plan switches to supporting it more
than them; it might make them think.

But if the software you need isn't available on Itanium now, I'd be
/very/ cautious about believing that it will be in a few years time. I'd
hold out for a contract where HP paid all my transition costs to a
platform of my choice if my needs couldn't be met by Itanium, because
the software wasn't available and couldn't be met by PA-RISC because the
hardware was no longer available. Of course, HP will not sign such a
contract, but their reasons for not doing so may well make an impact on
your management.

If the software you need is available now, your "only worry" is the
suppliers dropping Itanium.

---
John Dallman j...@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"

Niels Jørgen Kruse

unread,
Sep 3, 2006, 3:44:38 PM9/3/06
to
John Dallman <j...@cix.co.uk> wrote:

> But if the software you need isn't available on Itanium now, I'd be
> /very/ cautious about believing that it will be in a few years time. I'd
> hold out for a contract where HP paid all my transition costs to a
> platform of my choice if my needs couldn't be met by Itanium, because
> the software wasn't available and couldn't be met by PA-RISC because the
> hardware was no longer available. Of course, HP will not sign such a
> contract, but their reasons for not doing so may well make an impact on
> your management.

Doesn't HP have emulation software to run any PA-RISC binaries on
Itanium?

John Dallman

unread,
Sep 3, 2006, 4:36:00 PM9/3/06
to
In article <1hl3yxg.i759trs41kmN%nos...@ab-katrinedal.dk>,
nos...@ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=) wrote:

> John Dallman <j...@cix.co.uk> wrote:
> > But if the software you need isn't available on Itanium now, I'd be
> > /very/ cautious about believing that it will be in a few years

> > time. ...


> Doesn't HP have emulation software to run any PA-RISC binaries on
> Itanium?

Good point. They do; it isn't superbly fast but it works well. One would
want to benchmark before starting to reply on it for major workloads.
The question of Itanium's long-term future still applies.

Rick Jones

unread,
Sep 3, 2006, 6:30:34 PM9/3/06
to
Niels J?rgen Kruse <nos...@ab-katrinedal.dk> wrote:
> Doesn't HP have emulation software to run any PA-RISC binaries on
> Itanium?

Yes. It is called Aries. Included in HP-UX 11iv2 (aka 11.23).

rick jones
--
denial, anger, bargaining, depression, acceptance, rebirth...
where do you want to be today?
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

davew...@gmail.com

unread,
Sep 3, 2006, 8:19:10 PM9/3/06
to
Bill Todd wrote:
> Rob Warnock wrote:
> > [Apologies if I've messed up the attributions... the trail was messy...]
> >
> > Don't forget that with Opteron even *local* memory accesses require
> > getting snoop responses back from *all* of the other CPUs.
>
> Since I addressed this elsewhere, it's not likely that I forgot it here.
> My observation on that point was that an 8-socket configuration
> generates around 3x the coherency traffic *per HT link* that a
> quad-socket configuration does, suggesting that the quad-socket
> configuration may have rather a lot of bandwidth to spare (given that
> the 8-socket configuration manages to function at all).

Three-and-half years ago, I wrote that doing snoop-broadcast and going
from 4 to 8 sockets, Hammer would not scale well without "additional
support".
At the time, you referred to it as sound like hot air that lacks
quantitative analysis for support.

http://www.realworldtech.com/forums/index.cfm?action=detail&id=14858&threadid=14827&roomid=11

> > The HT coherency traffic goes up with the number of CPU cores, not just
> > the number of sockets.
>
> AMD's presentations are not clear on this point, but they at least
> suggest that coherency traffic on the HT links does *not* increase with
> the number of cores, just with the number of sockets (which is certainly
> at least possible, given the architecture).

If you actually have something useful running on each one of those
cores, each threaded context will generate independent memory requests
that will likely have to leave the socket - unless it hits on a cache
somewhere within the socket. The coherency traffic scales relative to
the number of independent outstanding misses, not to the socket or cpu
per se.

Bill Todd

unread,
Sep 4, 2006, 12:35:31 AM9/4/06
to
davew...@gmail.com wrote:
> Bill Todd wrote:
>> Rob Warnock wrote:
>>> [Apologies if I've messed up the attributions... the trail was messy...]
>>>
>>> Don't forget that with Opteron even *local* memory accesses require
>>> getting snoop responses back from *all* of the other CPUs.
>> Since I addressed this elsewhere, it's not likely that I forgot it here.
>> My observation on that point was that an 8-socket configuration
>> generates around 3x the coherency traffic *per HT link* that a
>> quad-socket configuration does, suggesting that the quad-socket
>> configuration may have rather a lot of bandwidth to spare (given that
>> the 8-socket configuration manages to function at all).
>
> Three-and-half years ago, I wrote that doing snoop-broadcast and going
> from 4 to 8 sockets, Hammer would not scale well without "additional
> support".
> At the time, you referred to it as sound like hot air that lacks
> quantitative analysis for support.

As indeed it was: just stating that an 8-socket configuration required
a lot more snooping was meaningless without also establishing that the
HT links did not have the capacity to *support* that additional snooping.

Furthermore, I stated in the interchange which you cite below that

"the degree to which the additional snooping activity will compromise
scalability (by exceeding the rather substantial bandwidth capacity of
the HT interconnect) will vary according to the nature of the load"

i.e., that even if the additional snooping significantly compromised
scalability for memory-bandwidth-intensive workloads, other kinds of
workloads would not be similarly encumbered.

>
> http://www.realworldtech.com/forums/index.cfm?action=detail&id=14858&threadid=14827&roomid=11

The interchange to which you refer was this:

[quote]

> just wanted to make it clear that without additional
> support, going up above 4 isn't going to be easy for *hammer, and
> scalability will be poor.

Lacking a quantitative analysis proving that 8-processor Hammer systems
will scale poorly, I'm afraid your argument sounds a bit like hot air.

[end quote]

Which, of course, is precisely the observation I made again today, above.

Since you never provided any such quantitative analysis (in fact, David
Kanter was still babbling incompetently in a similar vein at RWT about
next-year's 8-socket configuration changes a few months ago, until I
spelled things out for him quantitatively - now that more quantitative
data *is* available), that statement hardly seems unreasonable.

>
>>> The HT coherency traffic goes up with the number of CPU cores, not just
>>> the number of sockets.
>> AMD's presentations are not clear on this point, but they at least
>> suggest that coherency traffic on the HT links does *not* increase with
>> the number of cores, just with the number of sockets (which is certainly
>> at least possible, given the architecture).
>
> If you actually have something useful running on each one of those
> cores, each threaded context will generate independent memory requests
> that will likely have to leave the socket - unless it hits on a cache
> somewhere within the socket. The coherency traffic scales relative to
> the number of independent outstanding misses, not to the socket or cpu
> per se.

If you had bothered to look at the context in which Rob made the
statement to which I replied, you would have found that it was that of
snoop *responses* from CPUs. My point was that (if I understand AMD's
presentations correctly) each socket gives a single response to a snoop
request regardless of the number of cores present there.

And had you finished reading my post (well, you also would have had to
have understood it, I guess) before writing your response, you would
have noticed that I fully understand that more cores may *generate* more
snoops - but only (at least for the NUMA-optimized access being
discussed there) up to the point where they have saturated local memory
bandwidth.

>
>> And even if *all* the CPUs are hitting *only*
>>> local memory [perfect NUMA placement], there will be HT coherency
>>> traffic proportional to the product of the cache miss rate and the
>>> number of CPU cores.
>> Only up to the point where the local memory bandwidth is saturated:
>> once that point is reached, it doesn't matter how many more local cores
>> you add - there won't be any more coherency traffic, because there won't
>> be any more local accesses.
>
>> So if the links can support the coherency traffic generated by
>> local-only accesses sufficient to saturate the local memory bandwidth on
>> all sockets (are there STREAMS results for quad-socket Opteron systems
>> that could shed light on that?), the only question is how much link
>> bandwidth is left over to satisfy some percentage of remote accesses.

A friendly sparring partner just pointed out via email that Sun's
8-socket/dual-core 2.6 GHz Opterons scale up from its presumably very
similar 4-socket/dual-core configuration at 1.73x for SPECint_rate_base
and 1.79x for SPECint_rate_peak: hardly stellar, but indicative that
for reasonably computationally-intense workloads scaling to 8 sockets
(even using dual-core processors) can be quite useful. By contrast, in
the far more bandwidth-intensive SPECfp_rate scores the scaling is far
worse: only 1.14x base and 1.27x peak (that's why I asked about STREAMS
results above, though the issue there was whether today's *quad*-socket
systems could satisfy the demands of quad-core processors for
NUMA-optimized workloads, which it now looks as if they may). Using the
roughly 3:1 increase in per-link snoop activity in the 8-socket system
for a given level of per-socket uncached memory access activity (though
this may vary noticeably according to its topology), this suggests that
a 4-socket system may currently have something close to twice the HT
bandwidth headroom that it needs even for memory-intense workloads,
while the 8-socket system varies from fairly acceptable to very
disappointing indeed depending on the intensity of memory accesses
(though still not going negative, as Nick suggested - unless he was
referring to per-processor rather than system throughput, in which case
*all* systems that did not scale perfectly linearly would exhibit that
behavior, just some a lot more than others).

- bill

Nick Maclaren

unread,
Sep 4, 2006, 5:07:08 AM9/4/06
to

In article <1157329150.8...@p79g2000cwp.googlegroups.com>,
davew...@gmail.com writes:

|> Bill Todd wrote:
|>
|> Three-and-half years ago, I wrote that doing snoop-broadcast and going
|> from 4 to 8 sockets, Hammer would not scale well without "additional
|> support".
|> At the time, you referred to it as sound like hot air that lacks
|> quantitative analysis for support.

Yes, he's fond of saying that - he's not quite so good on providing any
evidence for his claims, though.

|> If you actually have something useful running on each one of those
|> cores, each threaded context will generate independent memory requests
|> that will likely have to leave the socket - unless it hits on a cache
|> somewhere within the socket. The coherency traffic scales relative to
|> the number of independent outstanding misses, not to the socket or cpu
|> per se.

Not quite. Because the number of hops goes up with the number of sockets,
the coherence traffic is the number of outstanding misses (possibly just
the independent ones, as you say) multiplied by the average hop length.

My estimate is that going from 2->4 and from 4-8 with a uniform, n-way
workload increases the coherence traffic by a factor of about 6 in each
case. What I don't have is any decent information on what the constant
factor is.


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Sep 4, 2006, 6:37:49 AM9/4/06
to

In article <edgqbs$k3o$1...@gemini.csx.cam.ac.uk>,

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
|>
|> Not quite. Because the number of hops goes up with the number of sockets,
|> the coherence traffic is the number of outstanding misses (possibly just
|> the independent ones, as you say) multiplied by the average hop length.
|>
|> My estimate is that going from 2->4 and from 4-8 with a uniform, n-way
|> workload increases the coherence traffic by a factor of about 6 in each
|> case. What I don't have is any decent information on what the constant
|> factor is.

To focus the forthcoming flames, going from 2->4, the number of links
goes from 1 to 5, the bisection bandwidth from 1 to 2/3, the number of
hops needed for coherence traffic (with single-core Opterons) from
2 to 14, and the number of pass-through packets from 0 to 4.

Going from 4->8, the number of links goes from 5 to 11, and everything
else depends a bit on the topology. With a twisted ladder, the bisection
bandwidth from 2/3 to 4, the number of hops needed for coherence traffic
from 14 to 96, and the number of pass-through packets from 4 to 40.

At least if I have counted right :-)


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Sep 4, 2006, 6:43:21 AM9/4/06
to

In article <edgvlt$222$1...@gemini.csx.cam.ac.uk>,
nm...@cus.cam.ac.uk (Nick Maclaren) writes:
|>
|> Going from 4->8, the number of links goes from 5 to 11, ...

Sigh. 5 to 10 for a twisted ladder.

Bill Todd

unread,
Sep 4, 2006, 10:25:22 AM9/4/06
to
Nick Maclaren wrote:
> In article <1157329150.8...@p79g2000cwp.googlegroups.com>,
> davew...@gmail.com writes:
> |> Bill Todd wrote:
> |>
> |> Three-and-half years ago, I wrote that doing snoop-broadcast and going
> |> from 4 to 8 sockets, Hammer would not scale well without "additional
> |> support".
> |> At the time, you referred to it as sound like hot air that lacks
> |> quantitative analysis for support.
>
> Yes, he's fond of saying that - he's not quite so good on providing any
> evidence for his claims, though.

Yet another empty claim with nothing whatsoever in the way of evidence
to back it up, Nick. Hardly surprising, given how difficult coming up
with concrete examples might be.

>
> |> If you actually have something useful running on each one of those
> |> cores, each threaded context will generate independent memory requests
> |> that will likely have to leave the socket - unless it hits on a cache
> |> somewhere within the socket. The coherency traffic scales relative to
> |> the number of independent outstanding misses, not to the socket or cpu
> |> per se.
>
> Not quite. Because the number of hops goes up with the number of sockets,
> the coherence traffic is the number of outstanding misses (possibly just
> the independent ones, as you say) multiplied by the average hop length.

By George, you got that one right: total number of misses (or
misses-per-socket*number-of-sockets, regardless of how many cores there
are per socket)*average path length. To get the actual number of
messages rather than just a proportion, double that to include the
responses.

>
> My estimate is that going from 2->4 and from 4-8 with a uniform, n-way
> workload increases the coherence traffic by a factor of about 6 in each
> case.

Why estimate? Just count:

In the 2-socket system the average path length is 1, so for each cache
miss on each socket two messages (request + response) are generated over
that single link, or 4 messages total over the single link for each
miss-per-socket (call that a per-miss-per-socket per-link load of 4).

In the 4-socket system that's a simple square (does anyone actually sell
the square-plus-diagonal configuration that AMD's early presentations
described?), each miss on each socket generates a message to each
adjoining node plus one message to the far (2-hop-away) node, for a
total of 4 request messages and 4 response messages (counting both
sections of the 2-hop path). Multiple by the 4 sockets, spread out
across the 4 links, and you get a per-miss-per-socket per-link load of
8, exactly double that of the dual-socket case (the total number of
messages is 8x as great, but it's spread out over 4 links rather than 1).

If the square-plus-diagonal topology actually exists in the wild, then
half the cache misses replace the 2-hop path with a direct path,
reducing their total message count to 6 and thus making the average
message count 7 rather than 8. Multiply by 4 sockets, divide by the 5
links, and the average per-miss-per-socket per-link load drops to 5.6 -
but the worst-case per-link loads (over the links forming the sides of
the square) seem to be (if I'm counting correctly) 6 - still a
significant improvement over 8.

(Just to be complete, if the square were fully-connected, leaving no
links for external I/O until the 4-link Opterons appear next year, then
there are 6 messages per miss, x 4 sockets, spread evenly across 6
links, reducing the per-link loading for a given per-socket miss rate to
exactly what it was for the dual-socket configuration - which is hardly
surprising given that both topologies are directly-connected for every
path.)

For the 8-socket configuration, it's easiest to start with a simple
cube, though (as with the fully-connected square above) this leaves no
links for external I/O. For any given miss, there are 3 single-hop
requests, 3 double-hop requests, and 1 triple-hop request, for a total
of 12 hops or 24 messages. Multiply by 8 sockets, spread across 12
links, and the per-link load per-miss-per-socket rate is 16: twice that
of the simple square, about 3x that of the square-plus-diagonal, and 4x
that of the fully-connected square.

Break one of the cube's links to allow for external I/O (I suspect the
best possible case until more links arrive next year) and the message
count for a miss on each of the two affected nodes rises from 24 to 32,
while message traffic from other nodes that would have traveled over the
now-missing link must be diverted and thus increases loading on some of
the remaining links even more. Even without allowing for the increase
in worst-case loading due to the resulting asymmetry, the average
message count per miss rises to 26, x 8 sockets, spread across 11 links,
for an *average* per-socket-per-miss per-link load of 18.9 (2.36x that
of the simple square or 3.375x that of the average load in the
square-plus-diagonal, the comparable best-achievable case for the
quad-socket board).

I'll leave it to someone with more time on their hands to figure out
(even just on average, though worst-case would be even better) exactly
how the best *existing* 8-socket topology compares with the best
*existing* quad-socket topology in this regard, but will suggest on the
basis of the numbers above that it's likely fairly close to the factor
of 3x higher per-link loading that I originally put forward. For
non-NUMA-optimized loads you of course must add another message (x the
average hop length) for the initial request and the final response, but
that doesn't change the totals all that much in the larger configurations.

- bill

Bill Todd

unread,
Sep 4, 2006, 10:40:17 AM9/4/06
to
Bill Todd wrote:

...

> Break one of the cube's links to allow for external I/O (I suspect the
> best possible case until more links arrive next year) and the message
> count for a miss on each of the two affected nodes rises from 24 to 32

An extra factor of 2 sneaked in there while I was still waking up: it
only rises from 24 to 28, so the other numbers change accordingly.

- bill

already...@yahoo.com

unread,
Sep 5, 2006, 5:23:46 AM9/5/06
to

Nick Maclaren wrote:
> In article <1157224077.6...@m73g2000cwd.googlegroups.com>,

> "mas" <mas...@hotmail.com> writes:
> |> >
> |> > |> IBM also do upto 32-socket versions of Xeon MPs as well.
> |> >
> |> > Do they just? Do you have a reference? I keep hearing reports of this,
> |> > but haven't tracked down any real products. There are a LOT of problems
> |> > scaling Xeons up to that point.
> |>
> |> http://www-03.ibm.com/systems/x/scalable/x3950/index.html
> |> http://www-03.ibm.com/servers/eserver/xseries/x460.html
>
> Thanks, but those don't really say anything. What I am trying to find
> out is whether they support single SMP system images and SMP applications
> that scale beyond 4 sockets and, if they do, how. Your next paragraph
> implies that they don't.
>
>

His next paragraph is pile of bullsh**t.
XeonMPs have 48-bit virtual and 40-bit physical address spaces. RTFM.

IA-32/EM64T Intel Architecture Software Developer's Manual Volume 1:
Basic Architecture
ftp://download.intel.com/design/Pentium4/manuals/25366520.pdf
See 3.2.1. and 3.3.7.1.

1st generation 64-bit single-core XeonMP (Potomac)
ftp://download.intel.com/design/Xeon/datashts/30675401.pdf
1st generation 64-bit dual-core XeonMP (PaxvilleMP).
ftp://download.intel.com/design/Xeon/datashts/30962601.pdf
2nd generation 64-bit dual-core XeonMP (Tulsa).
ftp://download.intel.com/design/Xeon/datashts/31455301.pdf

Unisys 512GB single-image system without partitioning runs TPC-C
benchmark.
http://www.tpc.org/results/individual_results/Unisys/Unisys_es7000-one_16x-749K_es.pdf
Of course, the memory access latency is non-uniform but it's not
different from any other 16-way system shipping today.

already...@yahoo.com

unread,
Sep 5, 2006, 5:58:36 AM9/5/06
to

Torbjorn Lindgren wrote:
>
> Yes, there's a new Xeon MP, the 7100 series named to make it look like
> they're related to the Woodcrest (5100)... However, Intel's own pages
> makes it fairly clear that it's a Netburst based part with a BIG L3
> cache (4/8/16 MB) and more bandwidth (quite possibly multiple FSB's
> like 5000 (Netburst) and 5100 (Woodcrest)).
>

Yes, big shared L3 cache.
No, no more bandwidth. Xeon 7100 has the same two bus variants as the
previous generataion Xeon 7000 - 667MT/s and 800MT/s.
Another improvement over 7000 is higher clock rate.

Nick Maclaren

unread,
Sep 5, 2006, 6:14:34 AM9/5/06
to

In article <1157448226....@e3g2000cwe.googlegroups.com>,

already...@yahoo.com writes:
|>
|> His next paragraph is pile of bullsh**t.
|> XeonMPs have 48-bit virtual and 40-bit physical address spaces. RTFM.

No, it isn't. He may have got that figure wrong, but there are a fair
number of reasons why nobody currently builds large Xeon-based cache
coherent SMP systems (above 4 sockets). This may change, but I am not
holding my breath.

|> Unisys 512GB single-image system without partitioning runs TPC-C
|> benchmark.

The TPC-C benchmark also runs on distributed memory clusters - it isn't
much evidence of SMP capability.

My understanding is that the Unisys system has a coherent shared cache
for up to 4 sockets, and can link up to 8 such nodes together with some
kind of shared memory (incoherent?) That is actually a very good model
for many purposes, but it DOESN'T allow the running of large SMP
applications as most people think of them.


Regards,
Nick Maclaren.

already...@yahoo.com

unread,
Sep 5, 2006, 7:48:16 AM9/5/06
to

Nick Maclaren wrote:
> In article <1157448226....@e3g2000cwe.googlegroups.com>,
> already...@yahoo.com writes:
> |>
> |> His next paragraph is pile of bullsh**t.
> |> XeonMPs have 48-bit virtual and 40-bit physical address spaces. RTFM.
>
> No, it isn't. He may have got that figure wrong, but there are a fair
> number of reasons why nobody currently builds large Xeon-based cache
> coherent SMP systems (above 4 sockets). This may change, but I am not
> holding my breath.
>

If IBM and Unisys fall under your definition of "nobody" then you are
right. Otherwise...

> |> Unisys 512GB single-image system without partitioning runs TPC-C
> |> benchmark.
>
> The TPC-C benchmark also runs on distributed memory clusters - it isn't
> much evidence of SMP capability.
>

When TPC-C benchmark runs on cluster it has to be explicitly marked as
such. That's the part of TPC rules. The Unisys submission is not marked
as a cluster.
For example, compare with that:
http://www.tpc.org/results/individual_results/HP/HP%20Integrity%20rx5670%20Cluster%2064P_ES.pdf


> My understanding is that the Unisys system has a coherent shared cache
> for up to 4 sockets, and can link up to 8 such nodes together with some
> kind of shared memory (incoherent?) That is actually a very good model
> for many purposes, but it DOESN'T allow the running of large SMP
> applications as most people think of them.
>

Your understanding is wrong. Unisys system has a coherent shared cache
(both on-chip L1/L2+ optional L3 and external L4) over all (1 to 8)
4-way cells. The same applies to IBM x460 (recently renamed to x3950).

Nick Maclaren

unread,
Sep 5, 2006, 8:07:31 AM9/5/06
to

In article <1157456895.9...@i42g2000cwa.googlegroups.com>,

already...@yahoo.com writes:
|>
|> > No, it isn't. He may have got that figure wrong, but there are a fair
|> > number of reasons why nobody currently builds large Xeon-based cache
|> > coherent SMP systems (above 4 sockets). This may change, but I am not
|> > holding my breath.
|>
|> If IBM and Unisys fall under your definition of "nobody" then you are
|> right. Otherwise...

Well, some evidence for your statement would be nice. I have looked at
what I can find of the specifications of those, and damned if I can find
anything that describes what they do in any detail, and what I did find
supports my statement.

|> When TPC-C benchmark runs on cluster it has to be explicitly marked as
|> such. That's the part of TPC rules. The Unisys submission is not marked
|> as a cluster.

That is irrelevant. The fact that it can do so means that it doesn't need
the attibutes of an SMP system that 'true' SMP applications do. The
Hitachi SR2201 ran a single Unix image and had distributed memory - if it
had run TPC-C, it could perfectly well have said that it wasn't a cluster.

|> > My understanding is that the Unisys system has a coherent shared cache
|> > for up to 4 sockets, and can link up to 8 such nodes together with some
|> > kind of shared memory (incoherent?) That is actually a very good model
|> > for many purposes, but it DOESN'T allow the running of large SMP
|> > applications as most people think of them.
|>
|> Your understanding is wrong. Unisys system has a coherent shared cache
|> (both on-chip L1/L2+ optional L3 and external L4) over all (1 to 8)
|> 4-way cells. The same applies to IBM x460 (recently renamed to x3950).

Some references would be nice. What I have found on Unisys's and IBM's
Web pages implies that YOUR understanding is wrong, but I failed to find
a precise and detailed architectural reference.


Regards,
Nick Maclaren.

already...@yahoo.com

unread,
Sep 5, 2006, 1:32:55 PM9/5/06
to

The fact that both systems run single-image Windows is sufficient
evidence by itself. Windows HAL doesn't support "partial coherence" of
the sort that you described in you previous post.

If its not enough read this redbook:
http://www.redbooks.ibm.com/redbooks/pdfs/sg246797.pdf

Unfortunatly, Unisys documentation is not as nice as IBM's but it's
crystal clear that Unisys also supports system-wide HW-enforced cache
coherence.

Sander Vesik

unread,
Sep 5, 2006, 2:10:19 PM9/5/06
to
already...@yahoo.com wrote:
>
> Nick Maclaren wrote:
> > In article <1157448226....@e3g2000cwe.googlegroups.com>,
> > already...@yahoo.com writes:
> > |>
> > |> His next paragraph is pile of bullsh**t.
> > |> XeonMPs have 48-bit virtual and 40-bit physical address spaces. RTFM.
> >
> > No, it isn't. He may have got that figure wrong, but there are a fair
> > number of reasons why nobody currently builds large Xeon-based cache
> > coherent SMP systems (above 4 sockets). This may change, but I am not
> > holding my breath.
> >
>
> If IBM and Unisys fall under your definition of "nobody" then you are
> right. Otherwise...
>

IBM and Unisys do in fact fall under 'nobody' in this case if their
cutomers are not running large workloads that way on these machines.
A machine having a maximum capacity of X x86 cpu-s has in the past
seldom meant large unpartitioned workloads are run on them.

Want to offer evidence that these machines are used in production
in those configurations?

--
Sander

+++ Out of cheese error +++

already...@yahoo.com

unread,
Sep 5, 2006, 2:35:50 PM9/5/06
to

The question of actual use is out of scope of discussion. We are
discussing what IBM and Unisys machines are capable of rather than how
they are used in practice.

Are they capable to run 32 sockets and up to 512GB in a single
partition?
Does it mean that mas's original post contained a good portion of
high-quality genuine BS?
The answer is "yes" to both questions.

Nick Maclaren

unread,
Sep 5, 2006, 3:22:37 PM9/5/06
to

In article <1157477575....@i3g2000cwc.googlegroups.com>,

already...@yahoo.com writes:
|>
|> > Some references would be nice. What I have found on Unisys's and IBM's
|> > Web pages implies that YOUR understanding is wrong, but I failed to find
|> > a precise and detailed architectural reference.
|>
|> The fact that both systems run single-image Windows is sufficient
|> evidence by itself. Windows HAL doesn't support "partial coherence" of
|> the sort that you described in you previous post.

Sigh. No, it doesn't. That would be the case only if there was evidence
that a single, threaded application (OpenMP, pthreads or similar) could
span multiple nodes, or the kernel itself could. I have so far seen no
evidence of that.

|> If its not enough read this redbook:
|> http://www.redbooks.ibm.com/redbooks/pdfs/sg246797.pdf

That contains one cryptic reference to maintaining cache coherence.
You may be aware of only one meaning to that, but there are a lot more.
I saw that document before, but I was asking for TECHNICAL documentation,
and that is only SEMI-technical.

|> Unfortunatly, Unisys documentation is not as nice as IBM's but it's
|> crystal clear that Unisys also supports system-wide HW-enforced cache
|> coherence.

I think that you are reading it through your rose-tinted glasses.

|> The question of actual use is out of scope of discussion. We are
|> discussing what IBM and Unisys machines are capable of rather than how
|> they are used in practice.

That was the sense I meant it, but you have still missed the point. It
is whether the cache is coherent enough to support things like OpenMP
code that spans multiple nodes. You don't need full coherence just to
run a single system image.

|> Are they capable to run 32 sockets and up to 512GB in a single
|> partition?
|> Does it mean that mas's original post contained a good portion of
|> high-quality genuine BS?
|> The answer is "yes" to both questions.

That is twaddle. The Hitachi SR2201 could run 2048 sockets and 512 GB
in a single partition, and it was a distributed memory machine with one
CPU/socket per node.


Regards,
Nick Maclaren.

David Kanter

unread,
Sep 6, 2006, 4:43:46 AM9/6/06
to
> Sigh. No, it doesn't. That would be the case only if there was evidence
> that a single, threaded application (OpenMP, pthreads or similar) could
> span multiple nodes, or the kernel itself could. I have so far seen no
> evidence of that.

RTFM

> |> If its not enough read this redbook:
> |> http://www.redbooks.ibm.com/redbooks/pdfs/sg246797.pdf
>
> That contains one cryptic reference to maintaining cache coherence.
> You may be aware of only one meaning to that, but there are a lot more.
> I saw that document before, but I was asking for TECHNICAL documentation,
> and that is only SEMI-technical.

> I think that you are reading it through your rose-tinted glasses.

No, you really just don't have a clue here. Both Unisys and IBM have
large ccNUMA systems that run Linux and Windows. I've spoken rather
extensively with the folks at IBM, including the lead architect for the
chipset...it's most certainly ccNUMA.

> That was the sense I meant it, but you have still missed the point. It
> is whether the cache is coherent enough to support things like OpenMP
> code that spans multiple nodes. You don't need full coherence just to
> run a single system image.

Of course it can run OMP code across multiple nodes. Did you even look
at those TPCC submissions?

DK

Nick Maclaren

unread,
Sep 6, 2006, 5:21:46 AM9/6/06
to

In article <1157532226....@i3g2000cwc.googlegroups.com>,

"David Kanter" <dka...@gmail.com> writes:
|> > Sigh. No, it doesn't. That would be the case only if there was evidence
|> > that a single, threaded application (OpenMP, pthreads or similar) could
|> > span multiple nodes, or the kernel itself could. I have so far seen no
|> > evidence of that.
|>
|> RTFM

Try not to be more of a clot than you can help. I have asked repeatedly
in many quarters (including earlier in this thread, and to people in IBM)
for a pointer to the "Fucking Manual" and have so far drawn a complete
blank. The best that there seems to be is a semi-technical Red Book,
that waves its hand vigorously.

|> > I think that you are reading it through your rose-tinted glasses.
|>
|> No, you really just don't have a clue here. Both Unisys and IBM have
|> large ccNUMA systems that run Linux and Windows. I've spoken rather
|> extensively with the folks at IBM, including the lead architect for the
|> chipset...it's most certainly ccNUMA.

Fine. So can you point me at a reference that says what that means in
technical terms? Or to someone who is capable of and prepared to
explain? That is what I have been asking for for several years now.
And, if you claim that book does so, provide the chapter and verse;
if you are right, I will take back my statement.

Considerable experience over many decades shows that people who start off
with argument by abuse (as you are doing) are typically the ones that
don't have a clue. Why should I assume that you weren't being taken in
by a load of bullshit?

|> Of course it can run OMP code across multiple nodes. Did you even look
|> at those TPCC submissions?

Yes. More hand-waving. You clearly don't understand just how much more
mechanism is needed to support general use of OpenMP than TPC-C.


Regards,
Nick Maclaren.

Bill Todd

unread,
Sep 6, 2006, 5:36:56 AM9/6/06
to
Nick Maclaren wrote:
> In article <1157477575....@i3g2000cwc.googlegroups.com>,
> already...@yahoo.com writes:
> |>
> |> > Some references would be nice. What I have found on Unisys's and IBM's
> |> > Web pages implies that YOUR understanding is wrong, but I failed to find
> |> > a precise and detailed architectural reference.
> |>
> |> The fact that both systems run single-image Windows is sufficient
> |> evidence by itself. Windows HAL doesn't support "partial coherence" of
> |> the sort that you described in you previous post.
>
> Sigh. No, it doesn't. That would be the case only if there was evidence
> that a single, threaded application (OpenMP, pthreads or similar) could
> span multiple nodes, or the kernel itself could. I have so far seen no
> evidence of that.

Are you senile, Nick? Exactly what do you think that the statement
"ES7000/one Server can support a scale-up, single application instance"
meant? Exactly what do you think IBM means when it characterizes its
32-socket Xeon systems as "SMP"? How do you explain the single-instance
software charges for Windows and databases on such machines in the TPC-C
submissions? Or the IBM submission that used a *memory-only* (no
processors at all) node to expand a quad-socket configuration to handle
128 GB of RAM?

I pointed all those out to you three days ago.

...

you have still missed the point. It
> is whether the cache is coherent enough to support things like OpenMP
> code that spans multiple nodes. You don't need full coherence just to
> run a single system image.

But you damn well do to run a single system *instance*, at least if it's
Windows (as the systems in question are).

>
> |> Are they capable to run 32 sockets and up to 512GB in a single
> |> partition?
> |> Does it mean that mas's original post contained a good portion of
> |> high-quality genuine BS?
> |> The answer is "yes" to both questions.
>
> That is twaddle. The Hitachi SR2201 could run 2048 sockets and 512 GB
> in a single partition, and it was a distributed memory machine with one
> CPU/socket per node.

And one 'microkernel' instance per node as well: whatever kind of
'single partition' they might have been running, it was nothing
resembling a system partition as supported in Windows.

- bill

already...@yahoo.com

unread,
Sep 6, 2006, 12:16:24 PM9/6/06
to

Please show me "the evidence" that following systems have
hardware-enforce cache-coherence of the sort you are looking for in
respective vendor's manuals datasheets:
1. IBM p5 595
2. Sun Fire E25K
3. HP Integrity Superdome
4. SGI Altix 4700

You have a chance for case #4 because 512-socket ccNUMA is something
special. But for cases 1, 2 and 3 I wish you a good luck.

See my point? If not read here:
http://www.memorable-quotes.com/charles+anderson+dana,a131.html

mas

unread,
Sep 6, 2006, 1:58:54 PM9/6/06
to

Thanks for the correction. The fact that all EM64T Xeon DPs have 36-bit
memory address limits led me to assume that the MPs had similiar
constraints but of course I should have checked.

Del Cecchi

unread,
Sep 6, 2006, 2:42:32 PM9/6/06
to
Nick Maclaren wrote:
> In article <1157224077.6...@m73g2000cwd.googlegroups.com>,
> "mas" <mas...@hotmail.com> writes:
> |> >
> |> > |> IBM also do upto 32-socket versions of Xeon MPs as well.
> |> >
> |> > Do they just? Do you have a reference? I keep hearing reports of this,
> |> > but haven't tracked down any real products. There are a LOT of problems
> |> > scaling Xeons up to that point.
> |>
> |> http://www-03.ibm.com/systems/x/scalable/x3950/index.html
> |> http://www-03.ibm.com/servers/eserver/xseries/x460.html
>
> Thanks, but those don't really say anything. What I am trying to find
> out is whether they support single SMP system images and SMP applications
> that scale beyond 4 sockets and, if they do, how. Your next paragraph
> implies that they don't.
>
> I agree that systems that have a single management infrastructure for
> multiple SMP systems, or even single systems that can manage several
> SMP sub-systems (with no application spanning a small SMP) are good
> ideas, but they aren't what most people expect.
>
> |> although due to the current 36-bit memory (64GB) limitation of EM64T
> |> what you are in effect getting is 8 * 4-socket 64GB partitions in one
> |> server with each partition's processors only accessing local memory.
> |> Unisys do the same thing with the ES7000
> |>
> |> http://www.unisys.com/products/enterprise__servers/high_d_end__servers/models/es7000_s_one__servers.htm
>
> Ah. Well, I am not really very interested in that. While such systems
> may be very useful, they can really only support uses that would be
> perfectly all right on 8 separate 4-socket SMPs. That isn't what most
> people mean when they refer to architectural limitations on the number
> of CPUs.
>
>
> Regards,
> Nick Maclaren.

Nick, I believe the X460 is a NUMA box with 4 socket nodes. So I
believe it does support a single system image for all 32 sockets.
However there is a performance penalty for non-local access to memory.

There is probably a Redbook that discusses these issues, and some
relatively high level overview of the X3 architecture.

del

--
Del Cecchi
"This post is my own and doesn’t necessarily represent IBM’s positions,
strategies or opinions.”

Del Cecchi

unread,
Sep 6, 2006, 2:45:02 PM9/6/06