PowerPC or PARISC?

mail...@earnshaw.com.au

unread,

Sep 2, 2006, 9:23:30 AM9/2/06

to

At work I am being pushed by a senior person to replace our four current
Alpha/Tru64 servers with PARISC/HPUX servers. I don't quiet see his point
given

1. The position of the PASRISC on its roadmap - no better than Alphas
2. I have to rework our bespoke applications
3. PARISC replacement is Itanium, which is not widely accepted?

If I must use HPC systems, why wouldn't we consider a PowerPC. The SPEC
for the new JS21 looks impressive.

Nick Maclaren

unread,

Sep 2, 2006, 9:42:59 AM9/2/06

to

In article <44f985d2$1...@quokka.wn.com.au>, mail...@earnshaw.com.au writes:
|> At work I am being pushed by a senior person to replace our four current
|> Alpha/Tru64 servers with PARISC/HPUX servers. I don't quiet see his point
|> given
|>
|> 1. The position of the PASRISC on its roadmap - no better than Alphas
|> 2. I have to rework our bespoke applications
|> 3. PARISC replacement is Itanium, which is not widely accepted?

You are making the mistake of thinking rationally. He is clearly senior
enough to be under no such constraint. I recommend updating your CV and
looking for another job.

|> If I must use HPC systems, why wouldn't we consider a PowerPC. The SPEC
|> for the new JS21 looks impressive.

Maybe, but the Opteron and Woodcrest are better value unless you need
large SMP systems.

Regards,
Nick Maclaren.

John Dallman

unread,

Sep 2, 2006, 10:36:00 AM9/2/06

to

In article <44f985d2$1...@quokka.wn.com.au>, mail...@earnshaw.com.au ()
wrote:

> At work I am being pushed by a senior person to replace our four
> current Alpha/Tru64 servers with PARISC/HPUX servers.

PA-RISC, and you're perfectly right that this doesn't make any sense
nowadays. He seems to be working off HP's plan of some years ago.

> 2. I have to rework our bespoke applications

And therefore, presumably, you want them to last for a while in the new
form?

> 3. PA-RISC replacement is Itanium, which is not widely accepted?

HP keep pushing, but SGI have given in. HP made an awful mistake, IMHO,
when they sold their Itanium development teams and intellectual property
to Intel. If Intel gets fed up with loosing money on Itanium, HP aren't
in a good position to carry on by themselves.

> If I must use HPC systems, why wouldn't we consider a PowerPC. The
> SPEC for the new JS21 looks impressive.

Look seriously at Linux systems with Opteron or Woodcrest. The
price-performance is good, and the absolute costs lower. If you can say
more about what your apps do, and the scale of your current servers, you
may be able to get better advice.

---
John Dallman j...@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"

Piotr Wyderski

unread,

Sep 2, 2006, 1:12:34 PM9/2/06

to

mail...@earnshaw.com.au wrote:

PA-RISC is almost dead, there is no reason to buy it. This
architecture does not support any inter-processor communication
mechanism except a simple atomic test-and-change operation to
implement a spinlock (and even that is quite broken, as it is test
-and-clear, not test-and-set and requires 16 byte alignment).
It's almost impossible to build a high-performance SMP system
out of such crippled CPUs. You are totally right, PowerPC is
a much, much better and efficient design. But be careful, PPCs,
or to be more specific, POWERs, are installed in the mainframes
that run under AIX, which is an extremely low quality operating
environment, almost as bad as HP-UX is (another reason not to
buy a PA-RISC...). It breaks my heart when I see a $600k
machnine equiped with 16 ingenious POWER5 cores (if there
is a designer of that chip here: chapeau bas, sir!) running AIX.
So, buy a good POWER (PC)-based machine and replace
its software with something normal, e.g. Debian.

Best regards
Piotr Wyderski

mas

unread,

Sep 2, 2006, 2:15:34 PM9/2/06

to

Now might be a good time to get off the proprietary road and if you
need largish SMP from a Tier-1 OEM, Sun make an 8-socket 16-core
Opteron server (x4600) which you could run with solaris, linux or
windows. The 2007 AMD quad-core should fit in that server making it
32-core. IBM also do upto 32-socket versions of Xeon MPs as well. You
are right, Power 5+ is the most powerful large-way SMP Unix/Linux
server right now which will only continue with the highly-clocked Power
6 in 2007. If you wish to stay with HP as a vendor they do 4-way 8-core
AMD/Intel x86-64 boxes. However their designated successor to Alpha is
Itanium which has just been released in dual-core form (9000 series)
which has the leading 2-socket 4-core TPC-C score and they have a
specific program for this called Alpha RetainTrust.

http://www.physorg.com/news72453549.html
http://www.hp.com/products1/evolution/alpha_retaintrust/
http://www.hp.com/products1/evolution/alpha_retaintrust/services_fee.html?jumpid=reg_R1002_USEN

Nick Maclaren

unread,

Sep 2, 2006, 2:48:26 PM9/2/06

to

In article <1157220934.0...@i42g2000cwa.googlegroups.com>,

"mas" <mas...@hotmail.com> writes:
|>
|> Now might be a good time to get off the proprietary road and if you
|> need largish SMP from a Tier-1 OEM, Sun make an 8-socket 16-core
|> Opteron server (x4600) which you could run with solaris, linux or
|> windows. The 2007 AMD quad-core should fit in that server making it
|> 32-core.

Don't bother. While Intel NGMA supports 4 sockets and Opteron supports
8, the memory architecture isn't man enough for more than 2 and 4,
respectively. And, in both cases, the same remarks apply to plugging
in the forthcoming quad-core chips into those motherboards - the sane
approach is to wait for the next generation of motherboards.

|> IBM also do upto 32-socket versions of Xeon MPs as well.

Do they just? Do you have a reference? I keep hearing reports of this,
but haven't tracked down any real products. There are a LOT of problems
scaling Xeons up to that point.

Regards,
Nick Maclaren.

mas

unread,

Sep 2, 2006, 3:07:57 PM9/2/06

to

http://www-03.ibm.com/systems/x/scalable/x3950/index.html
http://www-03.ibm.com/servers/eserver/xseries/x460.html

although due to the current 36-bit memory (64GB) limitation of EM64T
what you are in effect getting is 8 * 4-socket 64GB partitions in one
server with each partition's processors only accessing local memory.
Unisys do the same thing with the ES7000

http://www.unisys.com/products/enterprise__servers/high_d_end__servers/models/es7000_s_one__servers.htm

Bill Todd

unread,

Sep 2, 2006, 3:21:04 PM9/2/06

to

Nick Maclaren wrote:
> In article <1157220934.0...@i42g2000cwa.googlegroups.com>,
> "mas" <mas...@hotmail.com> writes:
> |>
> |> Now might be a good time to get off the proprietary road and if you
> |> need largish SMP from a Tier-1 OEM, Sun make an 8-socket 16-core
> |> Opteron server (x4600) which you could run with solaris, linux or
> |> windows. The 2007 AMD quad-core should fit in that server making it
> |> 32-core.
>
> Don't bother. While Intel NGMA supports 4 sockets and Opteron supports
> 8, the memory architecture isn't man enough for more than 2 and 4,
> respectively.

As usual, Nick's confidence exceeds his accuracy. While an 8-socket
Opteron system does not scale *linearly* up from a 4-socket system in
performance, it does scale significantly (around 1.5x in the benchmarks
I've seen). The problem appears to be congestion in the Hypertransport
links due to cache-coherence broadcasts (interestingly, from the
topology one might expect something like a 6-socket system to scale
noticeably more linearly, though I haven't worked that out in enough
detail to be sure).

And my impression is that Intel's *current* next-generation products are
limited to dual-socket use, period (i.e., not simply because a 4-socket
system wouldn't scale well - though indeed that would very likely be the
case, given the shared-bus implementation).

And, in both cases, the same remarks apply to plugging
> in the forthcoming quad-core chips into those motherboards

Quad-core Opterons should work just fine in current quad-socket boards,
though any added Hypertransport traffic would make the 8-socket
situation even worse (and the same is true for shared-bus congestion on
current Intel boards).

- the sane
> approach is to wait for the next generation of motherboards.

The next generation of boards (and in Intel's case supporting chipsets)
should help a lot, with both shared-bus bandwidth limitations in Intel's
case and with 8-socket Opteron configurations (the next Opteron
generation reportedly has 4 much-higher-bandwidth HT links per chip,
each of which is splittable, providing full direct connectivity and
drastically reducing cache-coherence congestion as a result).

>
> |> IBM also do upto 32-socket versions of Xeon MPs as well.
>
> Do they just? Do you have a reference? I keep hearing reports of this,
> but haven't tracked down any real products.

Try looking at the TPC-C and SAP SD 2-tier results.

There are a LOT of problems
> scaling Xeons up to that point.

Which not only IBM but others (such as Unisys) solved years ago.
Scaling for the IBM systems is mediocre as one moves beyond 4 sockets
(they cable multiple quad-socket boards together, and hide some of the
resulting latency by using large board-level caches to minimize
off-board references); Unisys's is better.

- bill

mas

unread,

Sep 2, 2006, 3:57:29 PM9/2/06

to

On a personal note the company I work for uses a 32-way 1 GHz 32MB
PA-8800 Superdome split into 3 8-way and 2 4-way partitions running
commercial ERP/WMS packages and a bespoke data warehouse application
all run under Oracle on HP-UX and it generally acquits itself well in
daytime OLTP usage and night batch work due to the nice fat caches and
high ipc albeit low-clocked core. However this cpu is one that has
fallen behind the compute performance curve and is very unforgiving if
say Oracle in its infinite CBO wisdom decides to give you a bad slow
plan to execute. If you are going to get it, most likely for the sheer
number of HP-UX applications which IPF can't quite match yet, make sure
you get the very last one made on the latest chipsets, i.e 1.1 GHz 64MB
PA-8900 on the zx2/sx2000 chipsets, to minimise the relatively poor
compute performance. Here is an upto 16-core SpecCpu2000INTRate
comparison to illustrate what I mean, notice how Woodcrest on 4 cores
nearly matches 16 PA-8800 !

http://www.aceshardware.com/SPECmine/index.jsp?b=1&s=1&v=1&if=0&ncf=1&nct=16&cpcf=1&cpct=2&mf=200&mt=3800&o=0&o=1&start=20

Anton Ertl

unread,

Sep 2, 2006, 3:39:19 PM9/2/06

to

Bill Todd <bill...@metrocast.net> writes:

>Nick Maclaren wrote:
> And, in both cases, the same remarks apply to plugging
>> in the forthcoming quad-core chips into those motherboards
>
>Quad-core Opterons should work just fine in current quad-socket boards,
>though any added Hypertransport traffic would make the 8-socket
>situation even worse

You definitely want a new board for the quad-core Opterons to get best
performance out of the CPUs: they have 4 16-bit HT v3 links (2.6GHz),
providing full single-hop connections in a 4-way system; these links
can be split into 8 8-bit links for full single-hop connectivity in an
8-way system. So a board designed for these new chips will be much
different from the old ones.

Of course, if these new chips are socket-compatible with the current
ones, it can still make sense to upgrade an existing machine with new
CPUs rather than buying a new box; the performance of the upgraded
machine will not be as good as that of a new box, but still better
than the original machine.

BTW, this information comes from a presentation that someone from AMD
gave at Euro-Par on Wednesday. Another interesting thing he mentioned
was that AMD has no plans to support fully-buffered DIMMs, but will
support DDR2 and DDR3 memory (probably registered); the reasons given
were energy consumption and latency. Hmm, the market for server RAMs
is small already, I hate seeing it split between registered and
fully-buffered RAM:-(.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Nick Maclaren

unread,

Sep 2, 2006, 4:03:33 PM9/2/06

to

In article <Y5adnRfBIdk8RGTZ...@metrocastcablevision.com>,

Bill Todd <bill...@metrocast.net> writes:
|>
|> As usual, Nick's confidence exceeds his accuracy. While an 8-socket
|> Opteron system does not scale *linearly* up from a 4-socket system in
|> performance, it does scale significantly (around 1.5x in the benchmarks
|> I've seen).

And it scales negatively in the ones that a colleague of mine ran.
That could well be because Tyan made a complete pig's ear of either
the design or the system, so let's not make too much of that.

All of the benchmarks that I have seen quoted have been low memory use
or low communication, or both, and are completely unrepresentative of
what most real customers want large SMPs for. Note that the ONLY
interesting benchmarks represent uses where using two 4-socket Opterons
would be impractical.

|> The problem appears to be congestion in the Hypertransport
|> links due to cache-coherence broadcasts (interestingly, from the
|> topology one might expect something like a 6-socket system to scale
|> noticeably more linearly, though I haven't worked that out in enough
|> detail to be sure).

"The" problem? Oh, my dear sir! You haven't looked hard enough. Yes,
that is perhaps the main one, but there are a number of related issues.

|> And my impression is that Intel's *current* next-generation products are
|> limited to dual-socket use, period (i.e., not simply because a 4-socket
|> system wouldn't scale well - though indeed that would very likely be the
|> case, given the shared-bus implementation).

Your impression is wrong. At least according to the Intel technical
people I have spoken to.

|> And, in both cases, the same remarks apply to plugging
|> > in the forthcoming quad-core chips into those motherboards
|>
|> Quad-core Opterons should work just fine in current quad-socket boards,
|> though any added Hypertransport traffic would make the 8-socket
|> situation even worse (and the same is true for shared-bus congestion on
|> current Intel boards).

And they would also be very short on memory bandwidth, which is "a" (not
"the") bottleneck for many applications.

|> The next generation of boards (and in Intel's case supporting chipsets)
|> should help a lot, with both shared-bus bandwidth limitations in Intel's
|> case and with 8-socket Opteron configurations (the next Opteron
|> generation reportedly has 4 much-higher-bandwidth HT links per chip,
|> each of which is splittable, providing full direct connectivity and
|> drastically reducing cache-coherence congestion as a result).

I could say more, but I can't remember what is public and what is NDA,
but you are right in principle even if not in detail.

|> > |> IBM also do upto 32-socket versions of Xeon MPs as well.
|> >
|> > Do they just? Do you have a reference? I keep hearing reports of this,
|> > but haven't tracked down any real products.
|>
|> Try looking at the TPC-C and SAP SD 2-tier results.

I am not interested in such issues; I want to see some description of
the architecture, what it supports, and what it doesn't. You may feel
that is unreasomable of me, but most readers of comp.arch won't.

Gluing multiple servers together and running lots of things in parallel
is easy - THAT never was a problem.

Regards,
Nick Maclaren.

mas

unread,

Sep 2, 2006, 4:26:55 PM9/2/06

to

Nick Maclaren wrote:
> In article <Y5adnRfBIdk8RGTZ...@metrocastcablevision.com>,
> Bill Todd <bill...@metrocast.net> writes:
> |>
> |> As usual, Nick's confidence exceeds his accuracy. While an 8-socket
> |> Opteron system does not scale *linearly* up from a 4-socket system in
> |> performance, it does scale significantly (around 1.5x in the benchmarks
> |> I've seen).
>
> And it scales negatively in the ones that a colleague of mine ran.
> That could well be because Tyan made a complete pig's ear of either
> the design or the system, so let's not make too much of that.
>

The Tyan K8QW (2 * 2-hops and 6 * 3-hops) has the best 8-way topology
design after Sun's x4600 (4 * 2-hops and 4 * 3-hops) with Fabric and
Iwill having 4-hop nodes in their 8-way designs.

http://www.aceshardware.com/forums/read_post.jsp?id=120061497&forumid=1
http://www.aceshardware.com/forums/read_post.jsp?id=120061524&forumid=1

> All of the benchmarks that I have seen quoted have been low memory use
> or low communication, or both, and are completely unrepresentative of
> what most real customers want large SMPs for. Note that the ONLY
> interesting benchmarks represent uses where using two 4-socket Opterons
> would be impractical.
>
> |> The problem appears to be congestion in the Hypertransport
> |> links due to cache-coherence broadcasts (interestingly, from the
> |> topology one might expect something like a 6-socket system to scale
> |> noticeably more linearly, though I haven't worked that out in enough
> |> detail to be sure).
>
> "The" problem? Oh, my dear sir! You haven't looked hard enough. Yes,
> that is perhaps the main one, but there are a number of related issues.
>

[....]

>
> Regards,
> Nick Maclaren.

Nick Maclaren

unread,

Sep 2, 2006, 5:00:36 PM9/2/06

to

In article <1157224077.6...@m73g2000cwd.googlegroups.com>,

"mas" <mas...@hotmail.com> writes:
|> >
|> > |> IBM also do upto 32-socket versions of Xeon MPs as well.
|> >
|> > Do they just? Do you have a reference? I keep hearing reports of this,
|> > but haven't tracked down any real products. There are a LOT of problems
|> > scaling Xeons up to that point.
|>

|> http://www-03.ibm.com/systems/x/scalable/x3950/index.html
|> http://www-03.ibm.com/servers/eserver/xseries/x460.html

Thanks, but those don't really say anything. What I am trying to find
out is whether they support single SMP system images and SMP applications
that scale beyond 4 sockets and, if they do, how. Your next paragraph
implies that they don't.

I agree that systems that have a single management infrastructure for
multiple SMP systems, or even single systems that can manage several
SMP sub-systems (with no application spanning a small SMP) are good
ideas, but they aren't what most people expect.

|> although due to the current 36-bit memory (64GB) limitation of EM64T
|> what you are in effect getting is 8 * 4-socket 64GB partitions in one
|> server with each partition's processors only accessing local memory.
|> Unisys do the same thing with the ES7000
|>
|> http://www.unisys.com/products/enterprise__servers/high_d_end__servers/models/es7000_s_one__servers.htm

Ah. Well, I am not really very interested in that. While such systems
may be very useful, they can really only support uses that would be
perfectly all right on 8 separate 4-socket SMPs. That isn't what most
people mean when they refer to architectural limitations on the number
of CPUs.

Regards,
Nick Maclaren.

Niels Jørgen Kruse

unread,

Sep 2, 2006, 5:05:51 PM9/2/06

to

<mail...@earnshaw.com.au> wrote:

You don't mention how large these Alpha/Tru64 servers are. If they can
be replaced with blade servers, I suppose they are not all that large.

It is surprisingly common for businesses to get all their IT equipment
from a single vendor, that is likely the reason for the weird choice.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Thor Lancelot Simon

unread,

Sep 2, 2006, 6:13:04 PM9/2/06

to

In article <edcjlq$7oj$1...@gemini.csx.cam.ac.uk>,

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>
>In article <1157220934.0...@i42g2000cwa.googlegroups.com>,
>"mas" <mas...@hotmail.com> writes:
>|>
>|> Now might be a good time to get off the proprietary road and if you
>|> need largish SMP from a Tier-1 OEM, Sun make an 8-socket 16-core
>|> Opteron server (x4600) which you could run with solaris, linux or
>|> windows. The 2007 AMD quad-core should fit in that server making it
>|> 32-core.
>
>Don't bother. While Intel NGMA supports 4 sockets and Opteron supports
>8, the memory architecture isn't man enough for more than 2 and 4,
>respectively.

I'd appreciate it if you'd elucidate on this, specifically with regard
to Opteron. Adding an Opteron, after all, adds a memory controller; in
fact, there are clearly applications where an 8-socket Opteron with
single-core chips should handily outperform a 4-socket one with dual-cores.

--
Thor Lancelot Simon t...@rek.tjls.com

"We cannot usually in social life pursue a single value or a single moral
aim, untroubled by the need to compromise with others." - H.L.A. Hart

Nick Maclaren

unread,

Sep 2, 2006, 6:52:16 PM9/2/06

to

In article <1157228815....@p79g2000cwp.googlegroups.com>,

"mas" <mas...@hotmail.com> writes:
|> >
|> > And it scales negatively in the ones that a colleague of mine ran.
|> > That could well be because Tyan made a complete pig's ear of either
|> > the design or the system, so let's not make too much of that.
|>
|> The Tyan K8QW (2 * 2-hops and 6 * 3-hops) has the best 8-way topology
|> design after Sun's x4600 (4 * 2-hops and 4 * 3-hops) with Fabric and
|> Iwill having 4-hop nodes in their 8-way designs.
|>
|> http://www.aceshardware.com/forums/read_post.jsp?id=120061497&forumid=1
|> http://www.aceshardware.com/forums/read_post.jsp?id=120061524&forumid=1

In theory, yes. I can't tell you why the system behaved so badly, and
my colleague (who is EXTREMELY competent) failed to find out and has
got no response from Tyan.

Regards,
Nick Maclaren.

Bill Todd

unread,

Sep 2, 2006, 11:44:02 PM9/2/06

to

Anton Ertl wrote:
> Bill Todd <bill...@metrocast.net> writes:
>> Nick Maclaren wrote:
>> And, in both cases, the same remarks apply to plugging
>>> in the forthcoming quad-core chips into those motherboards
>> Quad-core Opterons should work just fine in current quad-socket boards,
>> though any added Hypertransport traffic would make the 8-socket
>> situation even worse
>
> You definitely want a new board for the quad-core Opterons to get best
> performance out of the CPUs

That, of course, depends on what you may be doing with them. For
NUMA-optimizable workloads it's not clear that the new board will buy
you much, since the only HT traffic will be cache-coherence traffic.
The existing HT configuration in quad-socket boards doesn't appear to
have any trouble in this regard and since HT cache-coherence loading in
today's 8-socket boards is around 3x that in the quad-socket boards the
latter may well handle double the number of cores per socket just fine.

Though having no two-hop cache-coherence *latencies* to deal with might
be worth something for two of the four sockets - I forget whether AMD
manages to hide most of that latency behind the local RAM access or not.

Even for non-NUMA-optimized workloads, only two of the four sockets on
today's boards ever make two-hop remote accesses, and on average only
1/4 of the time (assuming that their RAM targets are evenly distributed
around the 4 sockets). So from the standpoint of average latency there
shouldn't be much difference with the new, fully-connected quad-socket
board: only if today's HT links can't handle the quad-core
remote-access *bandwidth* (itself of course variable by workload) are
you likely to see any significant difference.

...

Another interesting thing he mentioned
> was that AMD has no plans to support fully-buffered DIMMs,

Not this week, anyway: the amount of waffling they've done on this
issue suggests that they really don't know one way or the other and
probably won't decide for a while yet while the situation w.r.t.
FB-DIMMs develops.

but will
> support DDR2 and DDR3 memory (probably registered); the reasons given
> were energy consumption and latency. Hmm, the market for server RAMs
> is small already, I hate seeing it split between registered and
> fully-buffered RAM:-(.

ISTR seeing something recently about a new variant that promised to
replace both, but details were too sketchy to form any opinion about.

- bill

Bill Todd

unread,

Sep 3, 2006, 2:51:39 AM9/3/06

to

Nick Maclaren wrote:
> In article <Y5adnRfBIdk8RGTZ...@metrocastcablevision.com>,
> Bill Todd <bill...@metrocast.net> writes:

...

> |> The problem appears to be congestion in the Hypertransport
> |> links due to cache-coherence broadcasts (interestingly, from the
> |> topology one might expect something like a 6-socket system to scale
> |> noticeably more linearly, though I haven't worked that out in enough
> |> detail to be sure).
>
> "The" problem? Oh, my dear sir! You haven't looked hard enough. Yes,
> that is perhaps the main one, but there are a number of related issues.

Yawn. This kind of empty bloviation is far too much a habit with you,
Nick: be specific if you want anyone to pay attention.

...

> |> And, in both cases, the same remarks apply to plugging
> |> > in the forthcoming quad-core chips into those motherboards
> |>
> |> Quad-core Opterons should work just fine in current quad-socket boards,
> |> though any added Hypertransport traffic would make the 8-socket
> |> situation even worse (and the same is true for shared-bus congestion on
> |> current Intel boards).
>
> And they would also be very short on memory bandwidth, which is "a" (not
> "the") bottleneck for many applications.

For exactly what values of 'very short'? With today's HT bandwidth of 4
GB/s in each direction for each of the two HT links per socket on a
quad-socket board, one might observe that you've got to be doing rather
vigorous remote memory access (decidedly non-NUMA-optimized) before that
situation arises: true, local memory offers 6.4 GB/s using 400 MHz DDR
or over 10 GB/s using 667 MHz DDR2, but as long as even just a simple
majority of accesses go to a local rather than to a remote location (or
accesses just aren't taxing bandwidth limits that much anyway) today's
quad-socket boards should not be at a disadvantage compared with tomorrow's.

>
> |> The next generation of boards (and in Intel's case supporting chipsets)
> |> should help a lot, with both shared-bus bandwidth limitations in Intel's
> |> case and with 8-socket Opteron configurations (the next Opteron
> |> generation reportedly has 4 much-higher-bandwidth HT links per chip,
> |> each of which is splittable, providing full direct connectivity and
> |> drastically reducing cache-coherence congestion as a result).
>
> I could say more, but I can't remember what is public and what is NDA,
> but you are right in principle even if not in detail.

Yawn again, Nick: I've heard you claim uncitable sources to back up
utter garbage too often to be impressed - put up, or shut up.

>
> |> > |> IBM also do upto 32-socket versions of Xeon MPs as well.
> |> >
> |> > Do they just? Do you have a reference? I keep hearing reports of this,
> |> > but haven't tracked down any real products.
> |>
> |> Try looking at the TPC-C and SAP SD 2-tier results.
>
> I am not interested in such issues; I want to see some description of
> the architecture, what it supports, and what it doesn't. You may feel
> that is unreasomable of me, but most readers of comp.arch won't.

I don't feel that it's unreasonable, it's just not at all what you
originally asked for (still quoted above), which was, necessarily, what
I responded to: you asked for a reference indicating the existence of
'real products', and I gave you one.

>
> Gluing multiple servers together and running lots of things in parallel
> is easy - THAT never was a problem.

Perhaps you should actually look at the references that I provided
before babbling any further: the systems involved are not clustered
(you can also ascertain their nature by their software pricing).
Exactly how IBM goes about allowing its Xeons to access more than 64 GB
of physical RAM I don't know (unless they're correct in their assertion
that Potomac - and, presumably, newer - Xeons support 40 bits of
physical addressing rather than 36 bits, which seems reasonable), but
the fact that they do is indisputable: not only do they characterize
their 32-socket systems as being 'SMP' in their product presentation,
but a while ago they had a TPC-C submission that used a quad-Xeon board
(I think with the previous generation of their 'x architecture', which
also supported larger-than-64-GB systems) plus an additional chassis
that contained no processors at all, just to expand the available RAM to
128 GB.

Now, IBM's current 8-socket system isn't all that much better at TPC-C
than that augmented quad-socket system I just mentioned, and their
scaling falls off a lot more at 16 sockets: clearly, off-board
processor interactions suffer significantly. By contrast, the Unisys
systems (which also emphasize their SMP nature - "ES7000/one Server can
support a scale-up, single application instance") start less
impressively but scale up quite well, handily beating the best that
Itanic can offer in TPC-C scores at both the 16- and 32-core system sizes.

- bill

Bill Todd

unread,

Sep 3, 2006, 3:01:12 AM9/3/06

to

Bill Todd wrote:

..

By contrast, the Unisys
> systems (which also emphasize their SMP nature - "ES7000/one Server can
> support a scale-up, single application instance") start less
> impressively but scale up quite well, handily beating the best that
> Itanic can offer in TPC-C scores at both the 16- and 32-core system sizes.

My bad - I forgot about a recent 32-core Fujitsu Itanic submission that
just edges out the Unisys 32-core Xeon, so it's more accurate to say
that Xeon and Itanic are neck-and-neck right now in large-system TPC-C
performance (and Itanic might actually pull ahead if HP gets its act
together and submits a new-generation Superdome Montecito system, though
the fact that they have not done so may be significant in itself).

- bill

mail...@earnshaw.com.au

unread,

Sep 3, 2006, 3:51:56 AM9/3/06

to

We run two ES40 (Commercial: Oracle Financials, EMPAC and Technical: WENCO
+ custom apps), then two DS20 (Datamart and a Test/Dev). The machines are
not really over used, max spikes peridocally of 60% ... biggest problem I
have seen is a poor SQL script on Commercial. They are now 5 years old so
we are replacing.

As the localised person responsible, I don't want to go to HP because of
very poor support in the past 5 years. The cost I am being quoted for the
PARISC is also steep compared to the Power. Unfortunately Linux is out,
global corporate policy. Another factor is our location, surround by
desert and 600km between drinks, two flights a day. The blade option seems
to be a good fit for local maintenance.

If I have certification for our mission critical applications, either
Oracle EBS or Mincom Ellipse (replacing EMPAC), on "Tier-1 UNIX", am I
really missing anything going to Power? I am after a high available option
not "compute capabilities" - if that makes sense.

Thanks

ME

Rob Warnock

unread,

Sep 3, 2006, 4:07:31 AM9/3/06

to

[Apologies if I've messed up the attributions... the trail was messy...]

Bill Todd <bill...@metrocast.net> wrote:
+---------------

| Nick Maclaren wrote:
| > Bill Todd <bill...@metrocast.net> writes:

| > |> Quad-core Opterons should work just fine in current quad-socket boards,
| > |> though any added Hypertransport traffic would make the 8-socket
| > |> situation even worse (and the same is true for shared-bus congestion
| > |> on current Intel boards).
| >
| > And they would also be very short on memory bandwidth, which is "a"
| > (not "the") bottleneck for many applications.
|
| For exactly what values of 'very short'? With today's HT bandwidth of 4
| GB/s in each direction for each of the two HT links per socket on a
| quad-socket board, one might observe that you've got to be doing rather
| vigorous remote memory access (decidedly non-NUMA-optimized) before that
| situation arises: true, local memory offers 6.4 GB/s using 400 MHz DDR
| or over 10 GB/s using 667 MHz DDR2, but as long as even just a simple
| majority of accesses go to a local rather than to a remote location (or
| accesses just aren't taxing bandwidth limits that much anyway) today's
| quad-socket boards should not be at a disadvantage compared with tomorrow's.

+---------------

Don't forget that with Opteron even *local* memory accesses require
getting snoop responses back from *all* of the other CPUs. The HT
coherency traffic goes up with the number of CPU cores, not just
the number of sockets. And even if *all* the CPUs are hitting *only*
local memory [perfect NUMA placement], there will be HT coherency
traffic proportional to the product of the cache miss rate and the
number of CPU cores.

[Directory-based caches start to win really big on high-CPU-count
systems....]

-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Bill Todd

unread,

Sep 3, 2006, 5:00:36 AM9/3/06

to

Rob Warnock wrote:
> [Apologies if I've messed up the attributions... the trail was messy...]
>
> Bill Todd <bill...@metrocast.net> wrote:
> +---------------
> | Nick Maclaren wrote:
> | > Bill Todd <bill...@metrocast.net> writes:
> | > |> Quad-core Opterons should work just fine in current quad-socket boards,
> | > |> though any added Hypertransport traffic would make the 8-socket
> | > |> situation even worse (and the same is true for shared-bus congestion
> | > |> on current Intel boards).
> | >
> | > And they would also be very short on memory bandwidth, which is "a"
> | > (not "the") bottleneck for many applications.
> |
> | For exactly what values of 'very short'? With today's HT bandwidth of 4
> | GB/s in each direction for each of the two HT links per socket on a
> | quad-socket board, one might observe that you've got to be doing rather
> | vigorous remote memory access (decidedly non-NUMA-optimized) before that
> | situation arises: true, local memory offers 6.4 GB/s using 400 MHz DDR
> | or over 10 GB/s using 667 MHz DDR2, but as long as even just a simple
> | majority of accesses go to a local rather than to a remote location (or
> | accesses just aren't taxing bandwidth limits that much anyway) today's
> | quad-socket boards should not be at a disadvantage compared with tomorrow's.
> +---------------
>
> Don't forget that with Opteron even *local* memory accesses require
> getting snoop responses back from *all* of the other CPUs.

Since I addressed this elsewhere, it's not likely that I forgot it here.
My observation on that point was that an 8-socket configuration
generates around 3x the coherency traffic *per HT link* that a
quad-socket configuration does, suggesting that the quad-socket
configuration may have rather a lot of bandwidth to spare (given that
the 8-socket configuration manages to function at all).

The HT
> coherency traffic goes up with the number of CPU cores, not just
> the number of sockets.

AMD's presentations are not clear on this point, but they at least
suggest that coherency traffic on the HT links does *not* increase with
the number of cores, just with the number of sockets (which is certainly
at least possible, given the architecture).

And even if *all* the CPUs are hitting *only*
> local memory [perfect NUMA placement], there will be HT coherency
> traffic proportional to the product of the cache miss rate and the
> number of CPU cores.

Only up to the point where the local memory bandwidth is saturated:
once that point is reached, it doesn't matter how many more local cores
you add - there won't be any more coherency traffic, because there won't
be any more local accesses.

So if the links can support the coherency traffic generated by
local-only accesses sufficient to saturate the local memory bandwidth on
all sockets (are there STREAMS results for quad-socket Opteron systems
that could shed light on that?), the only question is how much link
bandwidth is left over to satisfy some percentage of remote accesses.

- bill

Nick Maclaren

unread,

Sep 3, 2006, 6:08:09 AM9/3/06

to

In article <edcvlg$s0q$1...@reader2.panix.com>,

t...@panix.com (Thor Lancelot Simon) writes:
|> >
|> >Don't bother. While Intel NGMA supports 4 sockets and Opteron supports
|> >8, the memory architecture isn't man enough for more than 2 and 4,
|> >respectively.
|>
|> I'd appreciate it if you'd elucidate on this, specifically with regard
|> to Opteron. Adding an Opteron, after all, adds a memory controller; in
|> fact, there are clearly applications where an 8-socket Opteron with
|> single-core chips should handily outperform a 4-socket one with dual-cores.

Certainly.

Let's start with Intel. The latest chipsets have two higher-speed FSBs,
and so overtake Opteron, but they can't be extended, and there are some
quite serious latency problems (which I find bizarre, given that the MCH
is a single chip. Everything works well for dual-socket Woodcrest, but
is definitely underpowered for quad-socket, dual-core of dual-socket,
quad core. But Intel are working on that.

The Opteron situation is very different. Purely local memory accesses
are purely local, but an access by CPU A to CPU B's memory or a HT packet
from CPU A to CPU C that needs to pass through CPU B needs dealing with
by the memory part of CPU B. Also, all cache coherence queries (even for
local memory) need to go everywhere, and I believe that Rob Warnock is
right that the number goes up with the number of cores.

But there is more. In a normal 4-socket design, each CPU has to handle
at most one simplex pass-through path, and only the two CPUs not supporting
I/O have to do even that. I haven't worked out the precise numbers for
the various 8-core topologies, but I am pretty sure that most of the CPUs
have to support at least 3 simplex pass-through routes. Even when the HT
links are not overloaded, that is not good for latency.

The coherence protocol and memory controller design are not published,
but my understanding is that they are synchronous and serial (too keep
the overheads minimal). And remember that cross-CPU memory references
start being critical as soon as you do much of any of true SMP work,
process migration, shared file/memory/FIFOs and so on.

Yes, there are applications where this doesn't matter, but experience with
large SMP systems is that even apparently independent applications cause
more inter-CPU traffic when run in a complete system than the naive would
expect. And it isn't always for the reasons that even the experts expect.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Sep 3, 2006, 6:30:03 AM9/3/06

to

Oops. Sorry. I forgot to mention two things.

Bisection bandwidth per CPU. That is, of course, half the value for an
8-socket system than for a 4-socket, so problems with HT bandwidth start
to occur much earlier. And system loadings that are pushing that will
almost certainly start to impact on coherence latency.

And, as others have said, AMD are working on all of this, and are doing
their usual information leaking - a dribble here and a dribble there.
But it seems that the next generation of Opteron will reduce a lot of
these problems.

All the commentators who say that AMD/Intel has just scored a knockout
blow are just bullshitting - in 2005/6 AMD led, in 2007 Intel will, in
2008 who knows?

Regards,
Nick Maclaren.

Paradox

unread,

Sep 3, 2006, 6:47:43 AM9/3/06

to

Nick Maclaren wrote:

The list price for a 4-way 1.65 GHz POWER 5+ with 4 GB RAM and 4
hot-swappable SCSI disk bays (two 73.4 GB 10K drives included) is
$5,830. (The p5 510Q, prices from IBM's website. Yes, this price does
not include AIX, which will cost you around a grand more. But you can
also run linux on the box.)

So at least in some configurations I guess the POWER 5+ is actually
more-or-less the same value as Opteron or Woodcrest.

Nick Maclaren

unread,

Sep 3, 2006, 7:00:17 AM9/3/06

to

In article <1157280463....@m73g2000cwd.googlegroups.com>,

"Paradox" <parad...@gmail.com> writes:
|>
|> The list price for a 4-way 1.65 GHz POWER 5+ with 4 GB RAM and 4
|> hot-swappable SCSI disk bays (two 73.4 GB 10K drives included) is
|> $5,830. (The p5 510Q, prices from IBM's website. Yes, this price does
|> not include AIX, which will cost you around a grand more. But you can
|> also run linux on the box.)

Interesting. That is definitely competitive for the hardware.

In my experience of the main cost of AIX is not the money you pay for it.
I have no experience with Linux on those systems, but have good reason to
believe that it isn't as much of an improvement over AIX as it might be.

|> So at least in some configurations I guess the POWER 5+ is actually
|> more-or-less the same value as Opteron or Woodcrest.

Hmm. Yes and no. In terms of TCO, I would say it is double the cost.
The problem is support effort and finding technical staff and expertise.

I may be biassed at looking at it from a HPC viewpoint, and this may not
be the case for a commercial Oracle shop, but I have reason to believe
that many of the same issues apply.

Regards,
Nick Maclaren.

Paradox

unread,

Sep 3, 2006, 7:30:51 AM9/3/06

to

Nick Maclaren wrote:

Yes, the initial hardware price does not tell the whole story. But I
just found the pricing of the new entry pSeries machines from IBM
interesting. It has been true for a long time that x86 gave the most
"bang for the buck", often by a very large margin. But now one can get
a POWER 5+ for the same amount as a Woodcrest/Opteron. And download a
free version of e.g. SUSE linux for the pSeries on the net (this is of
course nothing new.)

Nick Maclaren

unread,

Sep 3, 2006, 9:29:44 AM9/3/06

to

In article <1157283051.1...@i3g2000cwc.googlegroups.com>,

"Paradox" <parad...@gmail.com> writes:
|>
|> Yes, the initial hardware price does not tell the whole story. But I
|> just found the pricing of the new entry pSeries machines from IBM
|> interesting. It has been true for a long time that x86 gave the most
|> "bang for the buck", often by a very large margin. But now one can get
|> a POWER 5+ for the same amount as a Woodcrest/Opteron. And download a
|> free version of e.g. SUSE linux for the pSeries on the net (this is of
|> course nothing new.)

So do I. Every since the very early days of the PowerPC, IBM have had
many opportunities to reshape the industry, and have thrown away every
one. It isn't yet out of the question that they will achieve it but,
if their record is anything to go by, it will be accident not design.

Regards,
Nick Maclaren.

Torbjorn Lindgren

unread,

Sep 3, 2006, 10:30:42 AM9/3/06

to

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>In article <Y5adnRfBIdk8RGTZ...@metrocastcablevision.com>,
>Bill Todd <bill...@metrocast.net> writes:
>|> And my impression is that Intel's *current* next-generation products are
>|> limited to dual-socket use, period (i.e., not simply because a 4-socket
>|> system wouldn't scale well - though indeed that would very likely be the
>|> case, given the shared-bus implementation).
>
>Your impression is wrong. At least according to the Intel technical
>people I have spoken to.

So where IS Intel's Core (Core 2) microarchitecture quad-socket CPU
then? This is clearly Intels "current next-generation products"...

Yes, there's a new Xeon MP, the 7100 series named to make it look like
they're related to the Woodcrest (5100)... However, Intel's own pages
makes it fairly clear that it's a Netburst based part with a BIG L3
cache (4/8/16 MB) and more bandwidth (quite possibly multiple FSB's
like 5000 (Netburst) and 5100 (Woodcrest)).

There are a couple of models lower clocked models with less cache (and
lower power usage) which they then compare to the previous top models
to get impressive power reductions, but the top-end ones uses quite a
bit of power even if they're not quite as bad as the previous ones.

Yes, it's a significant improvement on what they had before in the 4P
space (though much of it probably comes from the giant cache), so some
chest beating from Intel is to be expected, but Bill Todd's statement
appears to be correct (your statement clearly targets Todd's
"impression", ie that specific statement).

http://www.intel.com/products/processor/xeon/index.htm

Most of the semitechnical 7100 information is under the 7000 header,
but tagged with 7100, following the actual 7100 links gives mostly
marketing fluff.

Nick Maclaren

unread,

Sep 3, 2006, 11:50:13 AM9/3/06

to

In article <mGBKg.57218$Df2....@fe05.news.easynews.com>,

Quite. It am almost certain that it exists, works and is usable, and the
only question is whether it is worth bothering with. My understanding is
that it isn't, at least for most people, and a hell of a lot of OEMs seem
to agree with me (as they do about the 8-socket Opteron).

Regards,
Nick Maclaren.

Piotr Wyderski

unread,

Sep 3, 2006, 12:03:33 PM9/3/06

to

Nick Maclaren wrote:

> I have no experience with Linux on those systems, but have good reason to
> believe that it isn't as much of an improvement over AIX as it might be.

There are several reasons to choose Linux:

1. The installer of Oracle will not hang your pSeries machine. :-)

2. There is a working and fully tested C++ compiler, namely GCC.
On AIX there is XL (non-industraial quality because of the number
of bugs it contains -- I was able to find four per week) and a broken
port of GCC (has many problems with shared libraries).

3. The support is much better and faster. I wanted to report to IBM
the bugs I found in the newest version of XL, but the website claims
that I need to buy a support package to be able to report them. :-)))

BTW, just to be clear: I am not a fan of Linux; in case you wish to stick
a label on me, I am a Windows NT-series zealot [;-)], but from my
personal experience I may say for sure that it is much better than AIX
and HP-UX. One of my professional duties is porting business software
into various Unix systems, but only these two make my knees shake...

Best regards
Piotr Wyderski

John Dallman

unread,

Sep 3, 2006, 12:43:00 PM9/3/06

to

In article <44fa...@quokka.wn.com.au>, mail...@earnshaw.com.au ()
wrote:

> Unfortunately Linux is out, global corporate policy.

Is there a rationale for that that you can quote? I quite understand
that they aren't going to change it on the basis of what people say on
Usenet. I'm just interested in the reasoning.

Are HP saying that the replacement for a PA-RISC system will have to be
Itanium, or do they expect to be able to offer you PA-RISC again in 3-5
years? If your management are going to want to go HP, going straight to
Itanium now makes a certain amount of sense, so that you can get the
disaster over with while the Alphas are still running. Management often
worry if someone who's opposed a plan switches to supporting it more
than them; it might make them think.

But if the software you need isn't available on Itanium now, I'd be
/very/ cautious about believing that it will be in a few years time. I'd
hold out for a contract where HP paid all my transition costs to a
platform of my choice if my needs couldn't be met by Itanium, because
the software wasn't available and couldn't be met by PA-RISC because the
hardware was no longer available. Of course, HP will not sign such a
contract, but their reasons for not doing so may well make an impact on
your management.

If the software you need is available now, your "only worry" is the
suppliers dropping Itanium.

---
John Dallman j...@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"

Niels Jørgen Kruse

unread,

Sep 3, 2006, 3:44:38 PM9/3/06

to

John Dallman <j...@cix.co.uk> wrote:

> But if the software you need isn't available on Itanium now, I'd be
> /very/ cautious about believing that it will be in a few years time. I'd
> hold out for a contract where HP paid all my transition costs to a
> platform of my choice if my needs couldn't be met by Itanium, because
> the software wasn't available and couldn't be met by PA-RISC because the
> hardware was no longer available. Of course, HP will not sign such a
> contract, but their reasons for not doing so may well make an impact on
> your management.

Doesn't HP have emulation software to run any PA-RISC binaries on
Itanium?

John Dallman

unread,

Sep 3, 2006, 4:36:00 PM9/3/06

to

In article <1hl3yxg.i759trs41kmN%nos...@ab-katrinedal.dk>,
nos...@ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=) wrote:

> John Dallman <j...@cix.co.uk> wrote:
> > But if the software you need isn't available on Itanium now, I'd be
> > /very/ cautious about believing that it will be in a few years

> > time. ...

> Doesn't HP have emulation software to run any PA-RISC binaries on
> Itanium?

Good point. They do; it isn't superbly fast but it works well. One would
want to benchmark before starting to reply on it for major workloads.
The question of Itanium's long-term future still applies.

Rick Jones

unread,

Sep 3, 2006, 6:30:34 PM9/3/06

to

Niels J?rgen Kruse <nos...@ab-katrinedal.dk> wrote:
> Doesn't HP have emulation software to run any PA-RISC binaries on
> Itanium?

Yes. It is called Aries. Included in HP-UX 11iv2 (aka 11.23).

rick jones
--
denial, anger, bargaining, depression, acceptance, rebirth...
where do you want to be today?
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

davew...@gmail.com

unread,

Sep 3, 2006, 8:19:10 PM9/3/06

to

Bill Todd wrote:
> Rob Warnock wrote:
> > [Apologies if I've messed up the attributions... the trail was messy...]
> >

> > Don't forget that with Opteron even *local* memory accesses require
> > getting snoop responses back from *all* of the other CPUs.
>
> Since I addressed this elsewhere, it's not likely that I forgot it here.
> My observation on that point was that an 8-socket configuration
> generates around 3x the coherency traffic *per HT link* that a
> quad-socket configuration does, suggesting that the quad-socket
> configuration may have rather a lot of bandwidth to spare (given that
> the 8-socket configuration manages to function at all).

Three-and-half years ago, I wrote that doing snoop-broadcast and going
from 4 to 8 sockets, Hammer would not scale well without "additional
support".
At the time, you referred to it as sound like hot air that lacks
quantitative analysis for support.

http://www.realworldtech.com/forums/index.cfm?action=detail&id=14858&threadid=14827&roomid=11

> > The HT coherency traffic goes up with the number of CPU cores, not just
> > the number of sockets.
>
> AMD's presentations are not clear on this point, but they at least
> suggest that coherency traffic on the HT links does *not* increase with
> the number of cores, just with the number of sockets (which is certainly
> at least possible, given the architecture).

If you actually have something useful running on each one of those
cores, each threaded context will generate independent memory requests
that will likely have to leave the socket - unless it hits on a cache
somewhere within the socket. The coherency traffic scales relative to
the number of independent outstanding misses, not to the socket or cpu
per se.

Bill Todd

unread,

Sep 4, 2006, 12:35:31 AM9/4/06

to

davew...@gmail.com wrote:
> Bill Todd wrote:
>> Rob Warnock wrote:
>>> [Apologies if I've messed up the attributions... the trail was messy...]
>>>
>>> Don't forget that with Opteron even *local* memory accesses require
>>> getting snoop responses back from *all* of the other CPUs.
>> Since I addressed this elsewhere, it's not likely that I forgot it here.
>> My observation on that point was that an 8-socket configuration
>> generates around 3x the coherency traffic *per HT link* that a
>> quad-socket configuration does, suggesting that the quad-socket
>> configuration may have rather a lot of bandwidth to spare (given that
>> the 8-socket configuration manages to function at all).
>
> Three-and-half years ago, I wrote that doing snoop-broadcast and going
> from 4 to 8 sockets, Hammer would not scale well without "additional
> support".
> At the time, you referred to it as sound like hot air that lacks
> quantitative analysis for support.

As indeed it was: just stating that an 8-socket configuration required
a lot more snooping was meaningless without also establishing that the
HT links did not have the capacity to *support* that additional snooping.

Furthermore, I stated in the interchange which you cite below that

"the degree to which the additional snooping activity will compromise
scalability (by exceeding the rather substantial bandwidth capacity of
the HT interconnect) will vary according to the nature of the load"

i.e., that even if the additional snooping significantly compromised
scalability for memory-bandwidth-intensive workloads, other kinds of
workloads would not be similarly encumbered.

>
> http://www.realworldtech.com/forums/index.cfm?action=detail&id=14858&threadid=14827&roomid=11

The interchange to which you refer was this:

[quote]

> just wanted to make it clear that without additional
> support, going up above 4 isn't going to be easy for *hammer, and
> scalability will be poor.

Lacking a quantitative analysis proving that 8-processor Hammer systems
will scale poorly, I'm afraid your argument sounds a bit like hot air.

[end quote]

Which, of course, is precisely the observation I made again today, above.

Since you never provided any such quantitative analysis (in fact, David
Kanter was still babbling incompetently in a similar vein at RWT about
next-year's 8-socket configuration changes a few months ago, until I
spelled things out for him quantitatively - now that more quantitative
data *is* available), that statement hardly seems unreasonable.

>
>>> The HT coherency traffic goes up with the number of CPU cores, not just
>>> the number of sockets.
>> AMD's presentations are not clear on this point, but they at least
>> suggest that coherency traffic on the HT links does *not* increase with
>> the number of cores, just with the number of sockets (which is certainly
>> at least possible, given the architecture).
>
> If you actually have something useful running on each one of those
> cores, each threaded context will generate independent memory requests
> that will likely have to leave the socket - unless it hits on a cache
> somewhere within the socket. The coherency traffic scales relative to
> the number of independent outstanding misses, not to the socket or cpu
> per se.

If you had bothered to look at the context in which Rob made the
statement to which I replied, you would have found that it was that of
snoop *responses* from CPUs. My point was that (if I understand AMD's
presentations correctly) each socket gives a single response to a snoop
request regardless of the number of cores present there.

And had you finished reading my post (well, you also would have had to
have understood it, I guess) before writing your response, you would
have noticed that I fully understand that more cores may *generate* more
snoops - but only (at least for the NUMA-optimized access being
discussed there) up to the point where they have saturated local memory
bandwidth.

>
>> And even if *all* the CPUs are hitting *only*
>>> local memory [perfect NUMA placement], there will be HT coherency
>>> traffic proportional to the product of the cache miss rate and the
>>> number of CPU cores.
>> Only up to the point where the local memory bandwidth is saturated:
>> once that point is reached, it doesn't matter how many more local cores
>> you add - there won't be any more coherency traffic, because there won't
>> be any more local accesses.
>
>> So if the links can support the coherency traffic generated by
>> local-only accesses sufficient to saturate the local memory bandwidth on
>> all sockets (are there STREAMS results for quad-socket Opteron systems
>> that could shed light on that?), the only question is how much link
>> bandwidth is left over to satisfy some percentage of remote accesses.

A friendly sparring partner just pointed out via email that Sun's
8-socket/dual-core 2.6 GHz Opterons scale up from its presumably very
similar 4-socket/dual-core configuration at 1.73x for SPECint_rate_base
and 1.79x for SPECint_rate_peak: hardly stellar, but indicative that
for reasonably computationally-intense workloads scaling to 8 sockets
(even using dual-core processors) can be quite useful. By contrast, in
the far more bandwidth-intensive SPECfp_rate scores the scaling is far
worse: only 1.14x base and 1.27x peak (that's why I asked about STREAMS
results above, though the issue there was whether today's *quad*-socket
systems could satisfy the demands of quad-core processors for
NUMA-optimized workloads, which it now looks as if they may). Using the
roughly 3:1 increase in per-link snoop activity in the 8-socket system
for a given level of per-socket uncached memory access activity (though
this may vary noticeably according to its topology), this suggests that
a 4-socket system may currently have something close to twice the HT
bandwidth headroom that it needs even for memory-intense workloads,
while the 8-socket system varies from fairly acceptable to very
disappointing indeed depending on the intensity of memory accesses
(though still not going negative, as Nick suggested - unless he was
referring to per-processor rather than system throughput, in which case
*all* systems that did not scale perfectly linearly would exhibit that
behavior, just some a lot more than others).

- bill

Nick Maclaren

unread,

Sep 4, 2006, 5:07:08 AM9/4/06

to

In article <1157329150.8...@p79g2000cwp.googlegroups.com>,
davew...@gmail.com writes:

Yes, he's fond of saying that - he's not quite so good on providing any
evidence for his claims, though.

|> If you actually have something useful running on each one of those
|> cores, each threaded context will generate independent memory requests
|> that will likely have to leave the socket - unless it hits on a cache
|> somewhere within the socket. The coherency traffic scales relative to
|> the number of independent outstanding misses, not to the socket or cpu
|> per se.

Not quite. Because the number of hops goes up with the number of sockets,
the coherence traffic is the number of outstanding misses (possibly just
the independent ones, as you say) multiplied by the average hop length.

My estimate is that going from 2->4 and from 4-8 with a uniform, n-way
workload increases the coherence traffic by a factor of about 6 in each
case. What I don't have is any decent information on what the constant
factor is.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Sep 4, 2006, 6:37:49 AM9/4/06

to

In article <edgqbs$k3o$1...@gemini.csx.cam.ac.uk>,

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
|>
|> Not quite. Because the number of hops goes up with the number of sockets,
|> the coherence traffic is the number of outstanding misses (possibly just
|> the independent ones, as you say) multiplied by the average hop length.
|>
|> My estimate is that going from 2->4 and from 4-8 with a uniform, n-way
|> workload increases the coherence traffic by a factor of about 6 in each
|> case. What I don't have is any decent information on what the constant
|> factor is.

To focus the forthcoming flames, going from 2->4, the number of links
goes from 1 to 5, the bisection bandwidth from 1 to 2/3, the number of
hops needed for coherence traffic (with single-core Opterons) from
2 to 14, and the number of pass-through packets from 0 to 4.

Going from 4->8, the number of links goes from 5 to 11, and everything
else depends a bit on the topology. With a twisted ladder, the bisection
bandwidth from 2/3 to 4, the number of hops needed for coherence traffic
from 14 to 96, and the number of pass-through packets from 4 to 40.

At least if I have counted right :-)

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Sep 4, 2006, 6:43:21 AM9/4/06

to

In article <edgvlt$222$1...@gemini.csx.cam.ac.uk>,
nm...@cus.cam.ac.uk (Nick Maclaren) writes:
|>
|> Going from 4->8, the number of links goes from 5 to 11, ...

Sigh. 5 to 10 for a twisted ladder.

Bill Todd

unread,

Sep 4, 2006, 10:25:22 AM9/4/06

to

Nick Maclaren wrote:
> In article <1157329150.8...@p79g2000cwp.googlegroups.com>,
> davew...@gmail.com writes:
> |> Bill Todd wrote:
> |>
> |> Three-and-half years ago, I wrote that doing snoop-broadcast and going
> |> from 4 to 8 sockets, Hammer would not scale well without "additional
> |> support".
> |> At the time, you referred to it as sound like hot air that lacks
> |> quantitative analysis for support.
>
> Yes, he's fond of saying that - he's not quite so good on providing any
> evidence for his claims, though.

Yet another empty claim with nothing whatsoever in the way of evidence
to back it up, Nick. Hardly surprising, given how difficult coming up
with concrete examples might be.

>
> |> If you actually have something useful running on each one of those
> |> cores, each threaded context will generate independent memory requests
> |> that will likely have to leave the socket - unless it hits on a cache
> |> somewhere within the socket. The coherency traffic scales relative to
> |> the number of independent outstanding misses, not to the socket or cpu
> |> per se.
>
> Not quite. Because the number of hops goes up with the number of sockets,
> the coherence traffic is the number of outstanding misses (possibly just
> the independent ones, as you say) multiplied by the average hop length.

By George, you got that one right: total number of misses (or
misses-per-socket*number-of-sockets, regardless of how many cores there
are per socket)*average path length. To get the actual number of
messages rather than just a proportion, double that to include the
responses.

>
> My estimate is that going from 2->4 and from 4-8 with a uniform, n-way
> workload increases the coherence traffic by a factor of about 6 in each
> case.

Why estimate? Just count:

In the 2-socket system the average path length is 1, so for each cache
miss on each socket two messages (request + response) are generated over
that single link, or 4 messages total over the single link for each
miss-per-socket (call that a per-miss-per-socket per-link load of 4).

In the 4-socket system that's a simple square (does anyone actually sell
the square-plus-diagonal configuration that AMD's early presentations
described?), each miss on each socket generates a message to each
adjoining node plus one message to the far (2-hop-away) node, for a
total of 4 request messages and 4 response messages (counting both
sections of the 2-hop path). Multiple by the 4 sockets, spread out
across the 4 links, and you get a per-miss-per-socket per-link load of
8, exactly double that of the dual-socket case (the total number of
messages is 8x as great, but it's spread out over 4 links rather than 1).

If the square-plus-diagonal topology actually exists in the wild, then
half the cache misses replace the 2-hop path with a direct path,
reducing their total message count to 6 and thus making the average
message count 7 rather than 8. Multiply by 4 sockets, divide by the 5
links, and the average per-miss-per-socket per-link load drops to 5.6 -
but the worst-case per-link loads (over the links forming the sides of
the square) seem to be (if I'm counting correctly) 6 - still a
significant improvement over 8.

(Just to be complete, if the square were fully-connected, leaving no
links for external I/O until the 4-link Opterons appear next year, then
there are 6 messages per miss, x 4 sockets, spread evenly across 6
links, reducing the per-link loading for a given per-socket miss rate to
exactly what it was for the dual-socket configuration - which is hardly
surprising given that both topologies are directly-connected for every
path.)

For the 8-socket configuration, it's easiest to start with a simple
cube, though (as with the fully-connected square above) this leaves no
links for external I/O. For any given miss, there are 3 single-hop
requests, 3 double-hop requests, and 1 triple-hop request, for a total
of 12 hops or 24 messages. Multiply by 8 sockets, spread across 12
links, and the per-link load per-miss-per-socket rate is 16: twice that
of the simple square, about 3x that of the square-plus-diagonal, and 4x
that of the fully-connected square.

Break one of the cube's links to allow for external I/O (I suspect the
best possible case until more links arrive next year) and the message
count for a miss on each of the two affected nodes rises from 24 to 32,
while message traffic from other nodes that would have traveled over the
now-missing link must be diverted and thus increases loading on some of
the remaining links even more. Even without allowing for the increase
in worst-case loading due to the resulting asymmetry, the average
message count per miss rises to 26, x 8 sockets, spread across 11 links,
for an *average* per-socket-per-miss per-link load of 18.9 (2.36x that
of the simple square or 3.375x that of the average load in the
square-plus-diagonal, the comparable best-achievable case for the
quad-socket board).

I'll leave it to someone with more time on their hands to figure out
(even just on average, though worst-case would be even better) exactly
how the best *existing* 8-socket topology compares with the best
*existing* quad-socket topology in this regard, but will suggest on the
basis of the numbers above that it's likely fairly close to the factor
of 3x higher per-link loading that I originally put forward. For
non-NUMA-optimized loads you of course must add another message (x the
average hop length) for the initial request and the final response, but
that doesn't change the totals all that much in the larger configurations.

- bill

Bill Todd

unread,

Sep 4, 2006, 10:40:17 AM9/4/06

to

Bill Todd wrote:

...

> Break one of the cube's links to allow for external I/O (I suspect the
> best possible case until more links arrive next year) and the message
> count for a miss on each of the two affected nodes rises from 24 to 32

An extra factor of 2 sneaked in there while I was still waking up: it
only rises from 24 to 28, so the other numbers change accordingly.

- bill

already...@yahoo.com

unread,

Sep 5, 2006, 5:23:46 AM9/5/06

to

Nick Maclaren wrote:
> In article <1157224077.6...@m73g2000cwd.googlegroups.com>,

> "mas" <mas...@hotmail.com> writes:
> |> >
> |> > |> IBM also do upto 32-socket versions of Xeon MPs as well.
> |> >
> |> > Do they just? Do you have a reference? I keep hearing reports of this,

> |> > but haven't tracked down any real products. There are a LOT of problems
> |> > scaling Xeons up to that point.
> |>
> |> http://www-03.ibm.com/systems/x/scalable/x3950/index.html
> |> http://www-03.ibm.com/servers/eserver/xseries/x460.html
>
> Thanks, but those don't really say anything. What I am trying to find
> out is whether they support single SMP system images and SMP applications
> that scale beyond 4 sockets and, if they do, how. Your next paragraph
> implies that they don't.
>
>

His next paragraph is pile of bullsh**t.
XeonMPs have 48-bit virtual and 40-bit physical address spaces. RTFM.

IA-32/EM64T Intel Architecture Software Developer's Manual Volume 1:
Basic Architecture
ftp://download.intel.com/design/Pentium4/manuals/25366520.pdf
See 3.2.1. and 3.3.7.1.

1st generation 64-bit single-core XeonMP (Potomac)
ftp://download.intel.com/design/Xeon/datashts/30675401.pdf
1st generation 64-bit dual-core XeonMP (PaxvilleMP).
ftp://download.intel.com/design/Xeon/datashts/30962601.pdf
2nd generation 64-bit dual-core XeonMP (Tulsa).
ftp://download.intel.com/design/Xeon/datashts/31455301.pdf

Unisys 512GB single-image system without partitioning runs TPC-C
benchmark.
http://www.tpc.org/results/individual_results/Unisys/Unisys_es7000-one_16x-749K_es.pdf
Of course, the memory access latency is non-uniform but it's not
different from any other 16-way system shipping today.

already...@yahoo.com

unread,

Sep 5, 2006, 5:58:36 AM9/5/06

to

Torbjorn Lindgren wrote:
>
> Yes, there's a new Xeon MP, the 7100 series named to make it look like
> they're related to the Woodcrest (5100)... However, Intel's own pages
> makes it fairly clear that it's a Netburst based part with a BIG L3
> cache (4/8/16 MB) and more bandwidth (quite possibly multiple FSB's
> like 5000 (Netburst) and 5100 (Woodcrest)).
>

Yes, big shared L3 cache.
No, no more bandwidth. Xeon 7100 has the same two bus variants as the
previous generataion Xeon 7000 - 667MT/s and 800MT/s.
Another improvement over 7000 is higher clock rate.

Nick Maclaren

unread,

Sep 5, 2006, 6:14:34 AM9/5/06

to

In article <1157448226....@e3g2000cwe.googlegroups.com>,

already...@yahoo.com writes:
|>
|> His next paragraph is pile of bullsh**t.
|> XeonMPs have 48-bit virtual and 40-bit physical address spaces. RTFM.

No, it isn't. He may have got that figure wrong, but there are a fair
number of reasons why nobody currently builds large Xeon-based cache
coherent SMP systems (above 4 sockets). This may change, but I am not
holding my breath.

|> Unisys 512GB single-image system without partitioning runs TPC-C
|> benchmark.

The TPC-C benchmark also runs on distributed memory clusters - it isn't
much evidence of SMP capability.

My understanding is that the Unisys system has a coherent shared cache
for up to 4 sockets, and can link up to 8 such nodes together with some
kind of shared memory (incoherent?) That is actually a very good model
for many purposes, but it DOESN'T allow the running of large SMP
applications as most people think of them.

Regards,
Nick Maclaren.

already...@yahoo.com

unread,

Sep 5, 2006, 7:48:16 AM9/5/06

to

Nick Maclaren wrote:
> In article <1157448226....@e3g2000cwe.googlegroups.com>,
> already...@yahoo.com writes:
> |>
> |> His next paragraph is pile of bullsh**t.
> |> XeonMPs have 48-bit virtual and 40-bit physical address spaces. RTFM.
>
> No, it isn't. He may have got that figure wrong, but there are a fair
> number of reasons why nobody currently builds large Xeon-based cache
> coherent SMP systems (above 4 sockets). This may change, but I am not
> holding my breath.
>

If IBM and Unisys fall under your definition of "nobody" then you are
right. Otherwise...

> |> Unisys 512GB single-image system without partitioning runs TPC-C
> |> benchmark.
>
> The TPC-C benchmark also runs on distributed memory clusters - it isn't
> much evidence of SMP capability.
>

When TPC-C benchmark runs on cluster it has to be explicitly marked as
such. That's the part of TPC rules. The Unisys submission is not marked
as a cluster.
For example, compare with that:
http://www.tpc.org/results/individual_results/HP/HP%20Integrity%20rx5670%20Cluster%2064P_ES.pdf

> My understanding is that the Unisys system has a coherent shared cache
> for up to 4 sockets, and can link up to 8 such nodes together with some
> kind of shared memory (incoherent?) That is actually a very good model
> for many purposes, but it DOESN'T allow the running of large SMP
> applications as most people think of them.
>

Your understanding is wrong. Unisys system has a coherent shared cache
(both on-chip L1/L2+ optional L3 and external L4) over all (1 to 8)
4-way cells. The same applies to IBM x460 (recently renamed to x3950).

Nick Maclaren

unread,

Sep 5, 2006, 8:07:31 AM9/5/06

to

In article <1157456895.9...@i42g2000cwa.googlegroups.com>,

Well, some evidence for your statement would be nice. I have looked at
what I can find of the specifications of those, and damned if I can find
anything that describes what they do in any detail, and what I did find
supports my statement.

|> When TPC-C benchmark runs on cluster it has to be explicitly marked as
|> such. That's the part of TPC rules. The Unisys submission is not marked
|> as a cluster.

That is irrelevant. The fact that it can do so means that it doesn't need
the attibutes of an SMP system that 'true' SMP applications do. The
Hitachi SR2201 ran a single Unix image and had distributed memory - if it
had run TPC-C, it could perfectly well have said that it wasn't a cluster.

|> > My understanding is that the Unisys system has a coherent shared cache
|> > for up to 4 sockets, and can link up to 8 such nodes together with some
|> > kind of shared memory (incoherent?) That is actually a very good model
|> > for many purposes, but it DOESN'T allow the running of large SMP
|> > applications as most people think of them.
|>
|> Your understanding is wrong. Unisys system has a coherent shared cache
|> (both on-chip L1/L2+ optional L3 and external L4) over all (1 to 8)
|> 4-way cells. The same applies to IBM x460 (recently renamed to x3950).

Some references would be nice. What I have found on Unisys's and IBM's
Web pages implies that YOUR understanding is wrong, but I failed to find
a precise and detailed architectural reference.

Regards,
Nick Maclaren.

already...@yahoo.com

unread,

Sep 5, 2006, 1:32:55 PM9/5/06

to

The fact that both systems run single-image Windows is sufficient
evidence by itself. Windows HAL doesn't support "partial coherence" of
the sort that you described in you previous post.

If its not enough read this redbook:
http://www.redbooks.ibm.com/redbooks/pdfs/sg246797.pdf

Unfortunatly, Unisys documentation is not as nice as IBM's but it's
crystal clear that Unisys also supports system-wide HW-enforced cache
coherence.

Sander Vesik

unread,

Sep 5, 2006, 2:10:19 PM9/5/06

to

already...@yahoo.com wrote:
>
> Nick Maclaren wrote:
> > In article <1157448226....@e3g2000cwe.googlegroups.com>,
> > already...@yahoo.com writes:
> > |>
> > |> His next paragraph is pile of bullsh**t.
> > |> XeonMPs have 48-bit virtual and 40-bit physical address spaces. RTFM.
> >
> > No, it isn't. He may have got that figure wrong, but there are a fair
> > number of reasons why nobody currently builds large Xeon-based cache
> > coherent SMP systems (above 4 sockets). This may change, but I am not
> > holding my breath.
> >
>
> If IBM and Unisys fall under your definition of "nobody" then you are
> right. Otherwise...
>

IBM and Unisys do in fact fall under 'nobody' in this case if their
cutomers are not running large workloads that way on these machines.
A machine having a maximum capacity of X x86 cpu-s has in the past
seldom meant large unpartitioned workloads are run on them.

Want to offer evidence that these machines are used in production
in those configurations?

--
Sander

+++ Out of cheese error +++

already...@yahoo.com

unread,

Sep 5, 2006, 2:35:50 PM9/5/06

to

The question of actual use is out of scope of discussion. We are
discussing what IBM and Unisys machines are capable of rather than how
they are used in practice.

Are they capable to run 32 sockets and up to 512GB in a single
partition?
Does it mean that mas's original post contained a good portion of
high-quality genuine BS?
The answer is "yes" to both questions.

Nick Maclaren

unread,

Sep 5, 2006, 3:22:37 PM9/5/06

to

In article <1157477575....@i3g2000cwc.googlegroups.com>,

|> The fact that both systems run single-image Windows is sufficient
|> evidence by itself. Windows HAL doesn't support "partial coherence" of
|> the sort that you described in you previous post.

Sigh. No, it doesn't. That would be the case only if there was evidence
that a single, threaded application (OpenMP, pthreads or similar) could
span multiple nodes, or the kernel itself could. I have so far seen no
evidence of that.

|> If its not enough read this redbook:
|> http://www.redbooks.ibm.com/redbooks/pdfs/sg246797.pdf

That contains one cryptic reference to maintaining cache coherence.
You may be aware of only one meaning to that, but there are a lot more.
I saw that document before, but I was asking for TECHNICAL documentation,
and that is only SEMI-technical.

|> Unfortunatly, Unisys documentation is not as nice as IBM's but it's
|> crystal clear that Unisys also supports system-wide HW-enforced cache
|> coherence.

I think that you are reading it through your rose-tinted glasses.

|> The question of actual use is out of scope of discussion. We are
|> discussing what IBM and Unisys machines are capable of rather than how
|> they are used in practice.

That was the sense I meant it, but you have still missed the point. It
is whether the cache is coherent enough to support things like OpenMP
code that spans multiple nodes. You don't need full coherence just to
run a single system image.

|> Are they capable to run 32 sockets and up to 512GB in a single
|> partition?
|> Does it mean that mas's original post contained a good portion of
|> high-quality genuine BS?
|> The answer is "yes" to both questions.

That is twaddle. The Hitachi SR2201 could run 2048 sockets and 512 GB
in a single partition, and it was a distributed memory machine with one
CPU/socket per node.

Regards,
Nick Maclaren.

David Kanter

unread,

Sep 6, 2006, 4:43:46 AM9/6/06

to

> Sigh. No, it doesn't. That would be the case only if there was evidence
> that a single, threaded application (OpenMP, pthreads or similar) could
> span multiple nodes, or the kernel itself could. I have so far seen no
> evidence of that.

RTFM

> |> If its not enough read this redbook:
> |> http://www.redbooks.ibm.com/redbooks/pdfs/sg246797.pdf
>
> That contains one cryptic reference to maintaining cache coherence.
> You may be aware of only one meaning to that, but there are a lot more.
> I saw that document before, but I was asking for TECHNICAL documentation,
> and that is only SEMI-technical.

> I think that you are reading it through your rose-tinted glasses.

No, you really just don't have a clue here. Both Unisys and IBM have
large ccNUMA systems that run Linux and Windows. I've spoken rather
extensively with the folks at IBM, including the lead architect for the
chipset...it's most certainly ccNUMA.

> That was the sense I meant it, but you have still missed the point. It
> is whether the cache is coherent enough to support things like OpenMP
> code that spans multiple nodes. You don't need full coherence just to
> run a single system image.

Of course it can run OMP code across multiple nodes. Did you even look
at those TPCC submissions?

DK

Nick Maclaren

unread,

Sep 6, 2006, 5:21:46 AM9/6/06

to

In article <1157532226....@i3g2000cwc.googlegroups.com>,

"David Kanter" <dka...@gmail.com> writes:
|> > Sigh. No, it doesn't. That would be the case only if there was evidence
|> > that a single, threaded application (OpenMP, pthreads or similar) could
|> > span multiple nodes, or the kernel itself could. I have so far seen no
|> > evidence of that.
|>
|> RTFM

Try not to be more of a clot than you can help. I have asked repeatedly
in many quarters (including earlier in this thread, and to people in IBM)
for a pointer to the "Fucking Manual" and have so far drawn a complete
blank. The best that there seems to be is a semi-technical Red Book,
that waves its hand vigorously.

|> > I think that you are reading it through your rose-tinted glasses.
|>
|> No, you really just don't have a clue here. Both Unisys and IBM have
|> large ccNUMA systems that run Linux and Windows. I've spoken rather
|> extensively with the folks at IBM, including the lead architect for the
|> chipset...it's most certainly ccNUMA.

Fine. So can you point me at a reference that says what that means in
technical terms? Or to someone who is capable of and prepared to
explain? That is what I have been asking for for several years now.
And, if you claim that book does so, provide the chapter and verse;
if you are right, I will take back my statement.

Considerable experience over many decades shows that people who start off
with argument by abuse (as you are doing) are typically the ones that
don't have a clue. Why should I assume that you weren't being taken in
by a load of bullshit?

|> Of course it can run OMP code across multiple nodes. Did you even look
|> at those TPCC submissions?

Yes. More hand-waving. You clearly don't understand just how much more
mechanism is needed to support general use of OpenMP than TPC-C.

Regards,
Nick Maclaren.

Bill Todd

unread,

Sep 6, 2006, 5:36:56 AM9/6/06

to

Nick Maclaren wrote:
> In article <1157477575....@i3g2000cwc.googlegroups.com>,
> already...@yahoo.com writes:
> |>
> |> > Some references would be nice. What I have found on Unisys's and IBM's
> |> > Web pages implies that YOUR understanding is wrong, but I failed to find
> |> > a precise and detailed architectural reference.
> |>
> |> The fact that both systems run single-image Windows is sufficient
> |> evidence by itself. Windows HAL doesn't support "partial coherence" of
> |> the sort that you described in you previous post.
>
> Sigh. No, it doesn't. That would be the case only if there was evidence
> that a single, threaded application (OpenMP, pthreads or similar) could
> span multiple nodes, or the kernel itself could. I have so far seen no
> evidence of that.

Are you senile, Nick? Exactly what do you think that the statement
"ES7000/one Server can support a scale-up, single application instance"
meant? Exactly what do you think IBM means when it characterizes its
32-socket Xeon systems as "SMP"? How do you explain the single-instance
software charges for Windows and databases on such machines in the TPC-C
submissions? Or the IBM submission that used a *memory-only* (no
processors at all) node to expand a quad-socket configuration to handle
128 GB of RAM?

I pointed all those out to you three days ago.

...

you have still missed the point. It
> is whether the cache is coherent enough to support things like OpenMP
> code that spans multiple nodes. You don't need full coherence just to
> run a single system image.

But you damn well do to run a single system *instance*, at least if it's
Windows (as the systems in question are).

>
> |> Are they capable to run 32 sockets and up to 512GB in a single
> |> partition?
> |> Does it mean that mas's original post contained a good portion of
> |> high-quality genuine BS?
> |> The answer is "yes" to both questions.
>
> That is twaddle. The Hitachi SR2201 could run 2048 sockets and 512 GB
> in a single partition, and it was a distributed memory machine with one
> CPU/socket per node.

And one 'microkernel' instance per node as well: whatever kind of
'single partition' they might have been running, it was nothing
resembling a system partition as supported in Windows.

- bill

already...@yahoo.com

unread,

Sep 6, 2006, 12:16:24 PM9/6/06

to

Please show me "the evidence" that following systems have
hardware-enforce cache-coherence of the sort you are looking for in
respective vendor's manuals datasheets:
1. IBM p5 595
2. Sun Fire E25K
3. HP Integrity Superdome
4. SGI Altix 4700

You have a chance for case #4 because 512-socket ccNUMA is something
special. But for cases 1, 2 and 3 I wish you a good luck.

See my point? If not read here:
http://www.memorable-quotes.com/charles+anderson+dana,a131.html

mas

unread,

Sep 6, 2006, 1:58:54 PM9/6/06

to

Thanks for the correction. The fact that all EM64T Xeon DPs have 36-bit
memory address limits led me to assume that the MPs had similiar
constraints but of course I should have checked.

Del Cecchi

unread,

Sep 6, 2006, 2:42:32 PM9/6/06

to

Nick Maclaren wrote:
> In article <1157224077.6...@m73g2000cwd.googlegroups.com>,
> "mas" <mas...@hotmail.com> writes:
> |> >
> |> > |> IBM also do upto 32-socket versions of Xeon MPs as well.
> |> >
> |> > Do they just? Do you have a reference? I keep hearing reports of this,
> |> > but haven't tracked down any real products. There are a LOT of problems
> |> > scaling Xeons up to that point.
> |>
> |> http://www-03.ibm.com/systems/x/scalable/x3950/index.html
> |> http://www-03.ibm.com/servers/eserver/xseries/x460.html
>
> Thanks, but those don't really say anything. What I am trying to find
> out is whether they support single SMP system images and SMP applications
> that scale beyond 4 sockets and, if they do, how. Your next paragraph
> implies that they don't.
>

> I agree that systems that have a single management infrastructure for
> multiple SMP systems, or even single systems that can manage several
> SMP sub-systems (with no application spanning a small SMP) are good
> ideas, but they aren't what most people expect.
>
> |> although due to the current 36-bit memory (64GB) limitation of EM64T
> |> what you are in effect getting is 8 * 4-socket 64GB partitions in one
> |> server with each partition's processors only accessing local memory.
> |> Unisys do the same thing with the ES7000
> |>
> |> http://www.unisys.com/products/enterprise__servers/high_d_end__servers/models/es7000_s_one__servers.htm
>
> Ah. Well, I am not really very interested in that. While such systems
> may be very useful, they can really only support uses that would be
> perfectly all right on 8 separate 4-socket SMPs. That isn't what most
> people mean when they refer to architectural limitations on the number
> of CPUs.
>
>
> Regards,
> Nick Maclaren.

Nick, I believe the X460 is a NUMA box with 4 socket nodes. So I
believe it does support a single system image for all 32 sockets.
However there is a performance penalty for non-local access to memory.

There is probably a Redbook that discusses these issues, and some
relatively high level overview of the X3 architecture.

del

--
Del Cecchi
"This post is my own and doesn’t necessarily represent IBM’s positions,
strategies or opinions.”

Del Cecchi

unread,

Sep 6, 2006, 2:45:02 PM9/6/06

to

Nick Maclaren wrote:
> In article <1157448226....@e3g2000cwe.googlegroups.com>,
> already...@yahoo.com writes:
> |>
> |> His next paragraph is pile of bullsh**t.
> |> XeonMPs have 48-bit virtual and 40-bit physical address spaces. RTFM.
>
> No, it isn't. He may have got that figure wrong, but there are a fair
> number of reasons why nobody currently builds large Xeon-based cache
> coherent SMP systems (above 4 sockets). This may change, but I am not
> holding my breath.

You may relax and breathe now, at least if you consider ccNUMA to be a
member of the class "SMP". IBM has been building ccNUMA Xeons above 4
sockets for years now.

>
> |> Unisys 512GB single-image system without partitioning runs TPC-C
> |> benchmark.
>
> The TPC-C benchmark also runs on distributed memory clusters - it isn't
> much evidence of SMP capability.
>
> My understanding is that the Unisys system has a coherent shared cache
> for up to 4 sockets, and can link up to 8 such nodes together with some
> kind of shared memory (incoherent?) That is actually a very good model
> for many purposes, but it DOESN'T allow the running of large SMP
> applications as most people think of them.
>
>
> Regards,
> Nick Maclaren.

Del Cecchi

unread,

Sep 6, 2006, 2:56:32 PM9/6/06

to

Are there particular concerns you have about the coherence? I am trying
to figure out what kind of documentation you need and what you will use
it for. There have been several generations of CCNUMA xseries machines
designed here in Rochester. I would not be surprised that documents of
the internal microarchitecture don't exist in a form that is available
to the public.

I am trying to be helpful. Is there some particular code you need to
run? I might note that I don't detect much interest in selling these
boxes as some sort of supercomputer. We have this nice BlueGene box for
that. But anything for a buck.

Seongbae Park

unread,

Sep 6, 2006, 2:56:51 PM9/6/06

to

SPEC OMP results for all of above machines at:

http://www.spec.org/omp/results/ompl2001.html

are enough evidence for me.

Seongbae

David Kanter

unread,

Sep 6, 2006, 3:08:55 PM9/6/06

to

> Try not to be more of a clot than you can help. I have asked repeatedly
> in many quarters (including earlier in this thread, and to people in IBM)
> for a pointer to the "Fucking Manual" and have so far drawn a complete
> blank. The best that there seems to be is a semi-technical Red Book,
> that waves its hand vigorously.

if you don't like the redbook, that's really too bad.

Call and ask IBM...

> |> > I think that you are reading it through your rose-tinted glasses.
> |>
> |> No, you really just don't have a clue here. Both Unisys and IBM have
> |> large ccNUMA systems that run Linux and Windows. I've spoken rather
> |> extensively with the folks at IBM, including the lead architect for the
> |> chipset...it's most certainly ccNUMA.
>
> Fine. So can you point me at a reference that says what that means in
> technical terms?

Surely you know what ccNUMA means? How about single image system?
Those are 'technical' terms, that anyone reading c.a should be able to
understand.

> Or to someone who is capable of and prepared to
> explain? That is what I have been asking for for several years now.
> And, if you claim that book does so, provide the chapter and verse;
> if you are right, I will take back my statement.

I recommend the Obstinance, chapter 3, verse 18. The second tractate
of that chapter is particularly enlightening.

> Considerable experience over many decades shows that people who start off
> with argument by abuse (as you are doing) are typically the ones that
> don't have a clue.

blah blah blah.

> Why should I assume that you weren't being taken in
> by a load of bullshit?

Maybe because 4 other people are telling you as well?

> |> Of course it can run OMP code across multiple nodes. Did you even look
> |> at those TPCC submissions?
>
> Yes. More hand-waving. You clearly don't understand just how much more
> mechanism is needed to support general use of OpenMP than TPC-C.

Such as? Right now, I have yet to hear any specific details from you
and lots of hot air. How about this, why don't you provide an example
of a system in the last 10 years that has run *unclustered* TPC, but
cannot run openMP, where the root of this problem lies in hardware (and
what the problem is). If such a beast exists, then you might have a
valid point...

DK

Anne & Lynn Wheeler

unread,

Sep 6, 2006, 3:33:48 PM9/6/06

to

Del Cecchi <cecchi...@us.ibm.com> writes:
> Nick, I believe the X460 is a NUMA box with 4 socket nodes. So I
> believe it does support a single system image for all 32
> sockets. However there is a performance penalty for non-local access
> to memory.
>
> There is probably a Redbook that discusses these issues, and some
> relatively high level overview of the X3 architecture.

i triped across a redbook found by a search engine that discussed sql
server 2005 scaling on x3/hurricane ... mentioned both single shared
numa memory operation and partitioned operation (possibly using
memory for doing message passing?).
http://www.redbooks.ibm.com/redpapers/pdfs/redp4093.pdf

however i didn't find anything that went into next level detail (stuff
like you would find in something like sci numa memory architecture
description and how cache coherency actually operated.

one of the xeon talks at hot chips got into some scaleup cache
consistency issues ... and there were some questions related to
possible stalling/starvation deadlocks in the implementations
... however, you may have to pay the $55 for the cdrom.

the freebees
http://www.hotchips.org/hc18/presentations.htm

and the cdrom presentations (was well as the dvd videos):
http://www.hotchips.org/hc18/store.htm

Nick Maclaren

unread,

Sep 6, 2006, 3:41:21 PM9/6/06

to

In article <1157569011.9...@m73g2000cwd.googlegroups.com>

Yes. That is one of the reasons that I am suspicious about the claims
for the IBM and Unisys systems - as far as I know, they have never
submitted that benchmark. In the case of the SunFire, there is also
the very good paper:

http://www.sc2001.org/papers/pap.pap150.pdf#search=%22Sunfire%20charlesworth%22

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Sep 6, 2006, 4:00:07 PM9/6/06

to

In article <4m8jutF...@individual.net>,

Del Cecchi <cecchi...@us.ibm.com> writes:
|>
|> Are there particular concerns you have about the coherence? I am trying
|> to figure out what kind of documentation you need and what you will use
|> it for. There have been several generations of CCNUMA xseries machines
|> designed here in Rochester. I would not be surprised that documents of
|> the internal microarchitecture don't exist in a form that is available
|> to the public.

Yes. There are a zillion coherence models, some of which are excellent
for running single system images, but are no use for the applications
that assume the tightly-couple coherence that OpenMP and POSIX threads
assume. For example, consider one where pages (and, yes, I mean pages)
can be either shared by multiple nodes or must be owned by one, and
the transfer of control needed to be done by an instruction (which would
ensure the coherence of the caches).

That would be fine for running a single system image, provided that the
kernel and each process was on a single node, and the system call argument
and result copying code used that instruction. It would also be fine for
many applications (almost certainly including TPC-C). A variant on that
would be that transfer WAS automatic, but still done at the page level;
that would 'work' for OpenMP and POSIX threads, but not scale.

Without knowing precisely what IBM means by cache coherence in this context,
I cannot begin to guess how the machines behave. I am not particularly
interested in the internal microarchitecture.

I am also relying on the fact that three senior technical people from
three tier 1 companies have told me that the (pre-NGMA) Xeon doesn't
support cache coherence above (if I recall) 4 CPUs. And they were all
talking about using special chipsets, not Intel's standard ones. Well,
they could all have been wrong ....

|> I am trying to be helpful. Is there some particular code you need to
|> run? I might note that I don't detect much interest in selling these
|> boxes as some sort of supercomputer. We have this nice BlueGene box for
|> that. But anything for a buck.

Yes, I know, but I am not a potential customer at present :-)

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Sep 6, 2006, 4:11:16 PM9/6/06

to

In article <m3ac5cy...@lhwlinux.garlic.com>,

Anne & Lynn Wheeler <ly...@garlic.com> writes:
|>
|> i triped across a redbook found by a search engine that discussed sql
|> server 2005 scaling on x3/hurricane ... mentioned both single shared
|> numa memory operation and partitioned operation (possibly using
|> memory for doing message passing?).
|> http://www.redbooks.ibm.com/redpapers/pdfs/redp4093.pdf

Thanks. The trouble with SQL, TPC-C etc. is that they all work perfectly
well with more-or-less incoherent shared memory, and therefore don't
prove whether the coherence is enough for things like OpenMP and POSIX
threads. As you know, I regard that less-coherent model as what vendors
SHOULD be providing, but it is important to know what a design does and
what it doesn't.

|> however i didn't find anything that went into next level detail (stuff
|> like you would find in something like sci numa memory architecture
|> description and how cache coherency actually operated.

Yes :-(

|> one of the xeon talks at hot chips got into some scaleup cache
|> consistency issues ... and there were some questions related to
|> possible stalling/starvation deadlocks in the implementations
|> ... however, you may have to pay the $55 for the cdrom.

I am not sure that I am interested enough! Until and unless I see some
clear evidence on this one, I am keeping an open and skeptical mind on
whether older Xeon CPUs can be used to build what people nowadays expect
by cache-coherent SMP.

Regards,
Nick Maclaren.

Bill Todd

unread,

Sep 6, 2006, 4:44:15 PM9/6/06

to

David Kanter wrote:

...

>> |> > I think that you are reading it through your rose-tinted glasses.
>> |>
>> |> No, you really just don't have a clue here. Both Unisys and IBM have
>> |> large ccNUMA systems that run Linux and Windows. I've spoken rather
>> |> extensively with the folks at IBM, including the lead architect for the
>> |> chipset...it's most certainly ccNUMA.
>>
>> Fine. So can you point me at a reference that says what that means in
>> technical terms?
>
> Surely you know what ccNUMA means? How about single image system?
> Those are 'technical' terms, that anyone reading c.a should be able to
> understand.

You might want to cut Nick a bit more slack in the 'single system image'
area, since I don't think it means what *you* seem to think it does
(that's why I explicitly referred to a single system *instance* and to
what that meant for Windows). For example, Tru64 Clusters offers
'single system image' operation without (well, absent specific use of
'memory channel', if they still support that) in any way offering
distributed shared memory, let alone ccNUMA behavior.

...

>> |> Of course it can run OMP code across multiple nodes. Did you even look
>> |> at those TPCC submissions?
>>
>> Yes. More hand-waving. You clearly don't understand just how much more
>> mechanism is needed to support general use of OpenMP than TPC-C.
>
> Such as? Right now, I have yet to hear any specific details from you
> and lots of hot air. How about this, why don't you provide an example
> of a system in the last 10 years that has run *unclustered* TPC, but
> cannot run openMP, where the root of this problem lies in hardware (and
> what the problem is). If such a beast exists, then you might have a
> valid point...

It is possible that Nick is confused by the existence of products like
Oracle Parallel Server and their newer Real Application Clusters, which
support something mildly resembling distributed shared memory (in the
form of their distributed shared object-level caching) via software
rather than ccNUMA mechanisms (though of course most of the TPC-C
submissions under discussion here used SQL Server which AFAIK has no
such capabilities) - or by newer approaches to running TPC-C in large
ccNUMA systems that have some resemblance to a clustered TPC-C approach
in the way they allocate data and processing around the (still ccNUMA,
non-clustered) system.

These do not excuse his ignorant bluster, but might help explain some of it.

- bill

Anne & Lynn Wheeler

unread,

Sep 6, 2006, 4:58:02 PM9/6/06

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:
> Thanks. The trouble with SQL, TPC-C etc. is that they all work perfectly
> well with more-or-less incoherent shared memory, and therefore don't
> prove whether the coherence is enough for things like OpenMP and POSIX
> threads. As you know, I regard that less-coherent model as what vendors
> SHOULD be providing, but it is important to know what a design does and
> what it doesn't.

long ago and far away ... the original RIOS (power) didn't provide
support for any short of cache coherency (in fact, a big deal was made
of how much simpler & faster it was because it avoided having any sort
of cache coherency) .... things had to go to nearly completely
different design (power/pc) to get some cache coherency ....

however there was a four processor smp (oak) ... using rios ".9" (or
single chip rios) ... that depending on coherency by software setable
virtual memory segment register ... whether associated storage was to
be cached or not. some level of smp coherency was achieved by having
portions of memory specified as non-cached.

Del Cecchi

unread,

Sep 6, 2006, 5:19:17 PM9/6/06

to

I will see what I can find in public information as to coherence
architecture. I do know that the snoopy or broadcast coherence of the
xeon is extended by the summit chip set with a directory based
coherence. I still don't understand the "page" thing. Are you talking
about page tables and virtual storage type pages? As opposed to cache
lines and classic mesi or moesi type protocols?

I believe, now that I think about it, that there were some articles in
the IBM J. of R and D. for at least one generation that might discuss
in more detail. I'll post if I find them.

Nick Maclaren

unread,

Sep 6, 2006, 6:32:23 PM9/6/06

to

In article <m364g0y...@lhwlinux.garlic.com>,

Anne & Lynn Wheeler <ly...@garlic.com> writes:
|>

|> long ago and far away ... the original RIOS (power) didn't provide
|> support for any short of cache coherency (in fact, a big deal was made
|> of how much simpler & faster it was because it avoided having any sort
|> of cache coherency) .... things had to go to nearly completely
|> different design (power/pc) to get some cache coherency ....

Yup. During the 1980s, there were dozens of such designs, and most
worked very well. It was discovered that almost no programmer could
code a 'true' SMP application for them (Cray-type SHMEM, databases etc.
were fine), which I believe caused their demise. Heaven alone knows
why, as there were almost no such applications until recently.

Regards,
Nick Maclaren.

David Kanter

unread,

Sep 7, 2006, 12:11:35 AM9/7/06

to

> >> |> No, you really just don't have a clue here. Both Unisys and IBM have
> >> |> large ccNUMA systems that run Linux and Windows. I've spoken rather
> >> |> extensively with the folks at IBM, including the lead architect for the
> >> |> chipset...it's most certainly ccNUMA.
> >>
> >> Fine. So can you point me at a reference that says what that means in
> >> technical terms?
> >
> > Surely you know what ccNUMA means? How about single image system?
> > Those are 'technical' terms, that anyone reading c.a should be able to
> > understand.
>
> You might want to cut Nick a bit more slack in the 'single system image'
> area, since I don't think it means what *you* seem to think it does
> (that's why I explicitly referred to a single system *instance* and to
> what that meant for Windows). For example, Tru64 Clusters offers
> 'single system image' operation without (well, absent specific use of
> 'memory channel', if they still support that) in any way offering
> distributed shared memory, let alone ccNUMA behavior.

ccNUMA and single image is pretty darn unambiguous.

> > Such as? Right now, I have yet to hear any specific details from you
> > and lots of hot air. How about this, why don't you provide an example
> > of a system in the last 10 years that has run *unclustered* TPC, but
> > cannot run openMP, where the root of this problem lies in hardware (and
> > what the problem is). If such a beast exists, then you might have a
> > valid point...
>
> It is possible that Nick is confused by the existence of products like
> Oracle Parallel Server and their newer Real Application Clusters, which
> support something mildly resembling distributed shared memory (in the
> form of their distributed shared object-level caching) via software
> rather than ccNUMA mechanisms (though of course most of the TPC-C
> submissions under discussion here used SQL Server which AFAIK has no
> such capabilities) - or by newer approaches to running TPC-C in large
> ccNUMA systems that have some resemblance to a clustered TPC-C approach
> in the way they allocate data and processing around the (still ccNUMA,
> non-clustered) system.

It's possible, but he certainly ought to be able to explicitly indicate
such. Moreover, the submissions in question didn't use RAC.

> These do not excuse his ignorant bluster, but might help explain some of it.

Sometimes you're an awfully charitable guy.

DK

Bill Todd

unread,

Sep 7, 2006, 1:00:31 AM9/7/06

to

David Kanter wrote:
>>>> |> No, you really just don't have a clue here. Both Unisys and IBM have
>>>> |> large ccNUMA systems that run Linux and Windows. I've spoken rather
>>>> |> extensively with the folks at IBM, including the lead architect for the
>>>> |> chipset...it's most certainly ccNUMA.
>>>>
>>>> Fine. So can you point me at a reference that says what that means in
>>>> technical terms?
>>> Surely you know what ccNUMA means? How about single image system?
>>> Those are 'technical' terms, that anyone reading c.a should be able to
>>> understand.
>> You might want to cut Nick a bit more slack in the 'single system image'
>> area, since I don't think it means what *you* seem to think it does
>> (that's why I explicitly referred to a single system *instance* and to
>> what that meant for Windows). For example, Tru64 Clusters offers
>> 'single system image' operation without (well, absent specific use of
>> 'memory channel', if they still support that) in any way offering
>> distributed shared memory, let alone ccNUMA behavior.
>
> ccNUMA and single image is pretty darn unambiguous.

When will you learn that simply repeating one of your dubious assertions
adds nothing to the conversation?

I suggest that you advance your purported definition of what 'single
system image' means, and *then* there'll be something to discuss (either
in terms of that definition's correctness or in terms of its
applicability to your earlier statement, at least one of which I suspect
has a problem).

- bill

Nick Maclaren

unread,

Sep 7, 2006, 4:14:59 AM9/7/06

to

In article <4m8saiF...@individual.net>,

Del Cecchi <cecchi...@us.ibm.com> writes:
|>
|> I will see what I can find in public information as to coherence
|> architecture. I do know that the snoopy or broadcast coherence of the
|> xeon is extended by the summit chip set with a directory based
|> coherence. I still don't understand the "page" thing. Are you talking
|> about page tables and virtual storage type pages? As opposed to cache
|> lines and classic mesi or moesi type protocols?

Ah. If there is a directory, then it is almost certainly 'full' cache
coherence, as a directory would be gross overkill for anything less.

|> I believe, now that I think about it, that there were some articles in
|> the IBM J. of R and D. for at least one generation that might discuss
|> in more detail. I'll post if I find them.

Thanks very much. That might well explain what the issues are.

Regards,
Nick Maclaren.

Del Cecchi

unread,

Sep 7, 2006, 12:45:02 PM9/7/06

to

I couldn't find the documents I was thinking of but
http://www.redbooks.ibm.com/redbooks/pdfs/sg246797.pdf
has a pretty good overview, albeit without a great amount of detail.
There is enough discussion to conclude that it is indeed a ccnuma system.

Nick Maclaren

unread,

Sep 7, 2006, 3:39:51 PM9/7/06

to

In article <4mb0kcF...@individual.net>,

Del Cecchi <cecchi...@us.ibm.com> writes:
|>
|> I couldn't find the documents I was thinking of but
|> http://www.redbooks.ibm.com/redbooks/pdfs/sg246797.pdf
|> has a pretty good overview, albeit without a great amount of detail.
|> There is enough discussion to conclude that it is indeed a ccnuma system.

Actually, it is precisely that document that made me suspicious that
it might not be! Paragraphs like the one in section 1.5 that starts:

The key to this type of memory configuration is to limit the number
of processors that directly access a piece of memory, ....

And the reference to the interconnects as "scalability ports" connecting
distinct memory controllers.

What I did not find there was any definite indication that multiple
threads within the same application process could access the same line
of memory 'simultaneously' and for update. It was very clear that any
node could access any other node's memory, but it was (and is) extremely
unclear from that document whether it would permit OpenMP-style working.

I have been caught out far too often by assuming "they OBVIOUSLY mean X".

Regards,
Nick Maclaren.

Bill Todd

unread,

Sep 7, 2006, 5:42:06 PM9/7/06

to

Nick Maclaren wrote:

...

> I have been caught out far too often by assuming "they OBVIOUSLY mean X".

You have also been caught out far too often acting like an obstinate,
senile old coot. In this case, given the evidence already presented in
this thread that seems the considerably more likely explanation.

- bill

Del Cecchi

unread,

Sep 8, 2006, 12:50:06 AM9/8/06

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:edpsi7$6mb$1...@gemini.csx.cam.ac.uk...

I think it is safe to say that the coherence has the same effect when the
memory is on another node that it does when the memory is on the same
node and cached by a different processor. It's ok. It is a cache
coherent numa system. The term "scalability port" for the link that
connects two nodes together is an IBMism dating from the first x-series
numa machines. Had to call it something I guess.

del cecchi

Nick Maclaren

unread,

Sep 8, 2006, 5:08:34 AM9/8/06

to

In article <4mcb25F...@individual.net>,

"Del Cecchi" <delcecchi...@gmail.com> writes:
|>
|> I think it is safe to say that the coherence has the same effect when the
|> memory is on another node that it does when the memory is on the same
|> node and cached by a different processor. It's ok. It is a cache
|> coherent numa system. The term "scalability port" for the link that
|> connects two nodes together is an IBMism dating from the first x-series
|> numa machines. Had to call it something I guess.

Well, I will take your word for it (tentatively), but I have still seen
no definite evidence. Now, if you had actually run an appropriate test
or were actually familiar with the design in detail, I would count that
as definite evidence. My understanding from your postings is that neither
is the case.

The point is that even you might get confused if that system provided
a cache coherent, flat, contiguous memory model, but NOT enough mechanism
to enable an OpenMP-like program to span nodes. You would clearly not be
expecting it, and so wouldn't explicitly ask about that case.

The extra mechanism that is needed is a transparent, moderately efficient
way for a cache miss to a line on node A to get the ownership of the line
from node B (and maintain consistency throughout), when either or both of
the ownerships may be exclusive and the line may be dirty.

Regards,
Nick Maclaren.

Del Cecchi

unread,

Sep 8, 2006, 12:52:21 PM9/8/06

to

The design was done down the hall from me. My team did the PHY for the
first scalability port on the summit chip set. So I guess it depends on
what you mean by "in detail" and which version you are talking about.
My understanding that the coherence protocol complies with your last
paragraph, although it has been a couple of years.

I don't know what it would take to satisfy you. Perhaps the VHDL for
the chip?

Nick Maclaren

unread,

Sep 8, 2006, 12:59:29 PM9/8/06

to

In article <4mdle8F...@individual.net>,

Del Cecchi <cecchi...@us.ibm.com> writes:
|>
|> The design was done down the hall from me. My team did the PHY for the
|> first scalability port on the summit chip set. So I guess it depends on
|> what you mean by "in detail" and which version you are talking about.

Well, it isn't unknown for people not to know what is happening in
adjacent offices :-)

Well, your previous sentence is why I am still slightly skeptical. My
understanding is that YOU aren't absolutely sure of exactly what it
supports. I would be quite happy to take your word for what it does,
if you yourself were sure!

Regards,
Nick Maclaren.

Del Cecchi

unread,

Sep 8, 2006, 2:46:09 PM9/8/06

to

I am absolutely sure.

John Dallman

unread,

Sep 8, 2006, 10:47:00 PM9/8/06

to

In article <1157602294.9...@p79g2000cwp.googlegroups.com>,
dka...@gmail.com (David Kanter) wrote:

> > You might want to cut Nick a bit more slack in the 'single system
> > image' area, since I don't think it means what *you* seem to think
> > it does (that's why I explicitly referred to a single system
> > *instance* and to what that meant for Windows). For example, Tru64
> > Clusters offers 'single system image' operation without (well,
> > absent specific use of 'memory channel', if they still support that)
> > in any way offering distributed shared memory, let alone ccNUMA
> > behavior.
> ccNUMA and single image is pretty darn unambiguous.

No, actually.

The Tru64 meaning of "single image", AFAIK, is that all the machines in
the cluster boot from the same disk, and are running the same copy of
the OS. This is obviously a significant system maintenance win as
compared to having them all configured separately. But they don't have
any access to each other's memory unless they send copies of parts of it
over the network, and that doesn't happen automatically.

So this isn't remotely similar to a ccNUMA, which, again AFAIK, means a
system with a bunch of processors, and a bunch of physically distinct
chunks of memory, plus a high-speed connection of some kind, hooked into
the paging hardware, that allows a single memory map to be maintained in
a coherent manner across all the memory and all the processors,
including all the processor caches - the hardest part.

Now, any sensible ccNUMA system will be "single image" in that the OS
stubs on each processor will have come from the same disk: if you can
build a ccNUMA, not doing that would be silly. But a "single image"
system doesn't have to be ccNUMA, and there are examples that aren't.

Now, are you using "single image" to mean that a program's text
segment(s) are only present at one location in a ccNUMA's memory map?
Being read-only, it can of course be sensibly cached on multiple
processors, and have copies present in multiple chunks of memory. But
that is still logically a "single memory image" and a "single disk
image", as opposed to clustering systems that only offer the latter.

---
John Dallman j...@cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"

Nick Maclaren

unread,

Sep 9, 2006, 6:59:26 AM9/9/06

to

In article <4mds3jF...@individual.net>,
el Cecchi <cecchi...@us.ibm.com> writes:
|>
|> I am absolutely sure.

Ah! Then I am happy, and my understanding was incorrect. I must
have misunderstood what the other people were saying, they were
looking for a different target than the one I assumed they were,
or they were mistaken.

Regards,
Nick Maclaren.

Del Cecchi

unread,

Sep 9, 2006, 4:23:48 PM9/9/06

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message

news:edu6qe$15t$1...@gemini.csx.cam.ac.uk...

What other people and what were they saying?

Andy Glew

unread,

Sep 10, 2006, 12:54:48 AM9/10/06

to

j...@cix.co.uk (John Dallman) writes:
>
> Now, any sensible ccNUMA system will be "single image" in that the OS
> stubs on each processor will have come from the same disk: if you can
> build a ccNUMA, not doing that would be silly. But a "single image"
> system doesn't have to be ccNUMA, and there are examples that aren't.

By the way, one of the things that most surprised me in 1985 when I
joined Gould was that the OS people were seriously discussing making
the UMA NP1 system non-single-system-image.

We ended up doing single-system-image UMA cache coherent shaed memory.
But UMA does not necessarily imply SSI, nor vice versa.

For that matter, aren't the partitioned systems effectively
non-single-system-image?

Andy Glew

unread,

Sep 10, 2006, 12:51:07 AM9/10/06

to

"David Kanter" <dka...@gmail.com> writes:

> Surely you know what ccNUMA means? How about single image system?
> Those are 'technical' terms, that anyone reading c.a should be able to
> understand.

(0) To the best of my knowledge, Nick is wrong: the IBM and Unisys
systems are cache coherent shared memory multiprocessor systems in
the sense of the term that most posters seem to be using. They
even seem to preserve the appropriate memory ordering models. But
I cannot point you to references.

(1) However, although Nick is, as usual, being unreasonable, I feel
obliged to defend him a little:

(1a) I'm guessing that Nick has had to cope with machines that were
sold as being SMP, only later to learn that they weren't SMP in
the sense of "capable of running OpenMP".

E.g. SMP to some means "Symmetric Multiprocessor".
I.e. it doesn't need to be shared memory at all.

SMP to others means "Shared Memory Multiprocessor".
But shared memory doesn't necessarily mean cache coherent.

By the way - I don't even think OpenMP means or requires cache
coherent in hardware. Can it not be compiler maintained cache
coherence? Although if Nick has ever been sold *that*, I can
understand his annoyance: "Waddaya mean I cannot just run GCC
generated code in an MP manner? I have to use your special
compiler?"

I expect that we will one day see OpenMP running on IBM Cell,
with compiler support.

(1b) David Kanter's snipe about "single image system" triggered my
jumping to Nick's defence:

To me, "single image system" means that I can have one copy of
the OS installed, running on all nodes. If I upgrade one node, I
upgrade them all. I.e. it is a sysadmin term.

This definition of "single image" applies to non-shared-memory
clusters. It also applies to shared memory systems when
independent instances of the OS are running -
i.e. not-really-shared-memory-systems, such as you may find on
non cache coherent systems.

(1c) ccNUMA usually means cache coherent. But it doesn't necessarily
mean "maintains a reasonable memory ordering model".

So: I can understand Nick wanting to see an exact definition of what
the vendors mean.

Andy Glew

unread,

Sep 10, 2006, 1:47:40 AM9/10/06

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:

> I am not sure that I am interested enough! Until and unless I see some
> clear evidence on this one, I am keeping an open and skeptical mind on
> whether older Xeon CPUs can be used to build what people nowadays expect
> by cache-coherent SMP.

My word does not count as clear evidence, but I believe that all Xeon
CPUs ever built, dating back to the earliest P6 family machines, can
be used to build cache-coherent SMPs - and they are as scalable as
external snoop filtering with backwards invalidation can be.

The memory ordering model - fencing, locking, synchronization - is the
only real constraint imposed by the CPU itself, and even that can be
worked around by the system builder.

Greg Lindahl

unread,

Sep 10, 2006, 1:51:32 AM9/10/06

to

In article <peypodto...@PXPL8591.amr.corp.intel.com>,
Andy Glew <first...@employer.domain> wrote:

> By the way - I don't even think OpenMP means or requires cache
> coherent in hardware. Can it not be compiler maintained cache
> coherence?

Yes, Intel is currently selling such a gizmo, implemented on clusters
with their compiler and a high speed interconnect.

-- greg

David Kanter

unread,

Sep 10, 2006, 3:44:54 AM9/10/06

to

> (1) However, although Nick is, as usual, being unreasonable, I feel
> obliged to defend him a little:
>
> (1a) I'm guessing that Nick has had to cope with machines that were
> sold as being SMP, only later to learn that they weren't SMP in
> the sense of "capable of running OpenMP".
>
> E.g. SMP to some means "Symmetric Multiprocessor".
> I.e. it doesn't need to be shared memory at all.
>
> SMP to others means "Shared Memory Multiprocessor".
> But shared memory doesn't necessarily mean cache coherent.
>
> By the way - I don't even think OpenMP means or requires cache
> coherent in hardware. Can it not be compiler maintained cache
> coherence? Although if Nick has ever been sold *that*, I can
> understand his annoyance: "Waddaya mean I cannot just run GCC
> generated code in an MP manner? I have to use your special
> compiler?"

I could see this being rather obnoxious, but it is a perfectly valid
form of cache coherence...just not an economically feasible one really.

> I expect that we will one day see OpenMP running on IBM Cell,
> with compiler support.

That should be interesting.

> (1b) David Kanter's snipe about "single image system" triggered my
> jumping to Nick's defence:
>
> To me, "single image system" means that I can have one copy of
> the OS installed, running on all nodes. If I upgrade one node, I
> upgrade them all. I.e. it is a sysadmin term.

I think what I was aiming for was that you can have a single install of
your operating system span the entire system, access memory, IO,
processors in all the partitions, etc.

> This definition of "single image" applies to non-shared-memory
> clusters. It also applies to shared memory systems when
> independent instances of the OS are running -
> i.e. not-really-shared-memory-systems, such as you may find on
> non cache coherent systems.
>
> (1c) ccNUMA usually means cache coherent. But it doesn't necessarily
> mean "maintains a reasonable memory ordering model".

Yes, but if it runs OTS Windows or Linux, it would almost certainly
have to obey the 'normal' x86 memory ordering rules.

DK

Nick Maclaren

unread,

Sep 10, 2006, 3:55:09 AM9/10/06

to

In article <4mgm6gF...@individual.net>,

"Del Cecchi" <delcecchi...@gmail.com> writes:
|>
|> What other people and what were they saying?

Sorry. Their identities are confidential. They were saying that they
couldn't build a large SMP out of Xeons, and the 1 TB memory limit
wasn't the only reason, which is why there was not alternative to the
IA64 or their RISC chips - I understood them to say that there were
insoluble problems with making large numbers of Xeons coherent. The
fact that I got a very similar answer from several of them is odd.

You can obviously short-list who I was talking to.

In article <peypodto...@PXPL8591.amr.corp.intel.com>,

Andy Glew <first...@employer.domain> writes:
|>
|> (1) However, although Nick is, as usual, being unreasonable, I feel
|> obliged to defend him a little:

Thank you :-) Yes, those are, indeed, the issues.

In article <peypfyf0...@PXPL8591.amr.corp.intel.com>,

Andy Glew <first...@employer.domain> writes:
|>
|> My word does not count as clear evidence, but I believe that all Xeon
|> CPUs ever built, dating back to the earliest P6 family machines, can
|> be used to build cache-coherent SMPs - and they are as scalable as
|> external snoop filtering with backwards invalidation can be.

Yes. It was specifically for large SMP systems (above 4 sockets). For
example, Opterons don't go above 8 because of an addressing limit; I
understood that there was some similar limit (though I can't say what).

|> The memory ordering model - fencing, locking, synchronization - is the
|> only real constraint imposed by the CPU itself, and even that can be
|> worked around by the system builder.

Yes and no. There is also the question of the primitives provided
by the hardware. A lot of early SMP systems required all parallelised
applications to run privileged, and a great many of them even now require
a system call for even the simplest communication (including atomic
updates). That is not good news for performance.

But you may count that as part of the model.

It is supposed to work even on 'ordinary' (now meaning Gigabit) Ethernet.
I mean to investigate it in detail sometime. I am expecting it to be a
fairly spectactular failure, just as its predecessors were. HPF was
also available in such modes, and flopped dismally, and there were a
lot of predecessors, which behaved similarly.

The killer isn't the latency, nowadays, because a decent MPI can match
memory latencies fairly closely, but the use of coherence based on page
faults. Even ignoring their cost, the killer is that a granularity of
a page means a LOT more thrashing for a typical program than at a cache
line granularity. And I don't think that it would be feasible to make
OpenMP use message passing directly - BSP etc. could, though.

Intel believe that there are enough people who want to use OpenMP but
are writing embarrassingly parallel codes. Well, we shall see. My
money is on the other side.

Regards,
Nick Maclaren.

Bill Todd

unread,

Sep 10, 2006, 4:50:35 AM9/10/06

to

Nick Maclaren wrote:

...

> In article <peypfyf0...@PXPL8591.amr.corp.intel.com>,
> Andy Glew <first...@employer.domain> writes:
> |>
> |> My word does not count as clear evidence, but I believe that all Xeon
> |> CPUs ever built, dating back to the earliest P6 family machines, can
> |> be used to build cache-coherent SMPs - and they are as scalable as
> |> external snoop filtering with backwards invalidation can be.
>
> Yes. It was specifically for large SMP systems (above 4 sockets). For
> example, Opterons don't go above 8 because of an addressing limit;

But of course Opterons *do* go above 8 sockets (all the way to 32
sockets, in fact): the 'addressing limit' to which you refer only
applies to *glueless* Opteron configurations, and the external chipset
that Horus built to fake out that glueless constraint is what allows
them to scale further.

I
> understood that there was some similar limit (though I can't say what).

There is indeed a 'similar' limit, in that *without additional support*
current Xeons don't scale beyond 4 sockets (if nothing else, IIRC they
aren't designed to drive more than the 5 bus loads that a 4-socket
system entails).

IBM and Unisys chipsets provide that additional support.

- bill

Nick Maclaren

unread,

Sep 10, 2006, 5:04:45 AM9/10/06

to

In article <IKWdnfsSuf9GTJ7Y...@metrocastcablevision.com>,

Bill Todd <bill...@metrocast.net> writes:
|>
|> But of course Opterons *do* go above 8 sockets (all the way to 32
|> sockets, in fact): the 'addressing limit' to which you refer only
|> applies to *glueless* Opteron configurations, and the external chipset
|> that Horus built to fake out that glueless constraint is what allows
|> them to scale further.

Well, yes, I know that - but it was an example of such a limit.

|> There is indeed a 'similar' limit, in that *without additional support*
|> current Xeons don't scale beyond 4 sockets (if nothing else, IIRC they
|> aren't designed to drive more than the 5 bus loads that a 4-socket
|> system entails).
|>
|> IBM and Unisys chipsets provide that additional support.

The curious thing is that the conversations were in the context of
adding such extra support.

There is also the addressability limit (for a long time 64 GB, 1 TB
for the later Netburst, and I am not sure what for NGMA). But I
understood that there was something else - evidently I misunderstood
or something.

Regards,
Nick Maclaren.

Bill Todd

unread,

Sep 10, 2006, 5:10:35 AM9/10/06

to

David Kanter wrote:

...

>> (1b) David Kanter's snipe about "single image system" triggered my
>> jumping to Nick's defence:

As it did my own: I do try to achieve some level of balance when that's
appropriate.

>>
>> To me, "single image system" means that I can have one copy of
>> the OS installed, running on all nodes. If I upgrade one node, I
>> upgrade them all. I.e. it is a sysadmin term.

Exactly.

>
> I think what I was aiming for was that you can have a single install of
> your operating system span the entire system, access memory, IO,
> processors in all the partitions, etc.

It certainly sounded as if you thought that a 'single system image' (the
term that had been used in earlier posts until you used 'single image
system' - a term with which I am not familiar: though you can get a hit
on virtually anything with Google, its use is exceedingly thin on the
ground compared with 'single system image' - which I suspected was meant
to mean 'single system image') defined a configuration considerably more
tightly-coupled than a cluster is (e.g., one coordinated by a single
*instance* of the operating system rather than one with an instance at
every node acting together cooperatively).

In fact, it still kind of sounds that way - your wording above being
hardly clear on that point.

Andy and I agree on the definition of 'single system image', and my
impression from a brief look around is that most of the rest of the
industry does as well. So if you're using a different definition for
such a "'technical' term[s], that anyone reading c.a should be able to
understand" (in your own words), that's kind of ironic, whereas if you
agree with Andy and me on the definition then it's not clear why you
brought it up in the first place, as it had no relevance whatsoever to
the subject under discussion.

- bill

Niels Jørgen Kruse

unread,

Sep 10, 2006, 5:45:22 AM9/10/06

to

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:

> |> There is indeed a 'similar' limit, in that *without additional support*
> |> current Xeons don't scale beyond 4 sockets (if nothing else, IIRC they
> |> aren't designed to drive more than the 5 bus loads that a 4-socket
> |> system entails).
> |>
> |> IBM and Unisys chipsets provide that additional support.
>
> The curious thing is that the conversations were in the context of
> adding such extra support.

Would you really expect SGI or HP representatives to point you to Unisys
or IBM?

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Nick Maclaren

unread,

Sep 10, 2006, 6:12:10 AM9/10/06

to

In article <1hlg5xa.1630eokjn5x0mN%nos...@ab-katrinedal.dk>,

nos...@ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=) writes:
||> >
|> > The curious thing is that the conversations were in the context of
|> > adding such extra support.
|>
|> Would you really expect SGI or HP representatives to point you to Unisys
|> or IBM?

Well, no, but it wasn't sales staff and it was in the context of explaining
why they weren't pursuing that line. If I had had the answer once, I
would have doubted it, but I heard it from several independent sources.
And I won't say which companies it was, but they might surprise you.

My guess is that there are, indeed, some issues - but that they got
inflated in the telling, and the relevant people were themselves under
a misapprehension. It is also very interesting that neither IBM nor
Unisys have any OpenMP benchmarks for those systems, but DO for their
other ranges. That COULD just be marketing.

Regards,
Nick Maclaren.

already...@yahoo.com

unread,

Sep 10, 2006, 7:05:03 AM9/10/06

to

Bill Todd wrote:
>
> There is indeed a 'similar' limit, in that *without additional support*
> current Xeons don't scale beyond 4 sockets (if nothing else, IIRC they
> aren't designed to drive more than the 5 bus loads that a 4-socket
> system entails).
>

Make it "3 bus loads".
The last Xeon that supported 5 bus loads was Gallatin - 32bitter with
36b physical address.

Bill Todd

unread,

Sep 10, 2006, 12:13:46 PM9/10/06

to

Nick Maclaren wrote:

...

> My guess is that there are, indeed, some issues - but that they got
> inflated in the telling, and the relevant people were themselves under
> a misapprehension.

My guess is that the main issue is that Intel didn't design the
processor chip specifically to support larger systems even though it
left it capable of doing so with someone else's chipset. Until
recently, the 64 GB limit on physical memory combined with the general
established preference for RISCy solutions in that space meant that
there hasn't been all that much impetus for those others to do so
(though IBM has been doing so for at least two generations of their 'x
architecture' now - possibly largely for bragging rights, since even the
current generation doesn't scale impressively beyond 4 sockets).

It is also very interesting that neither IBM nor
> Unisys have any OpenMP benchmarks for those systems, but DO for their
> other ranges.

Unisys apparently hasn't submitted an OpenMP benchmark for nearly three
years, and the only two they offered up back then were in fact based on
a chipset that IIRC supported Xeons - but (being part of the small crowd
vigorously promoting Itanic back then and attempting to persuade people
that it was *the* high-end 'industry-standard' solution) they chose to
use Itanics instead.

X3, as I've already mentioned, doesn't seem to scale all that well
beyond a single board, so the lack of submissions there may not be all
that surprising either.

- bill

mike

unread,

Sep 10, 2006, 12:54:11 PM9/10/06

to

|
| But of course Opterons *do* go above 8 sockets (all the way to 32
| sockets, in fact): the 'addressing limit' to which you refer only
| applies to *glueless* Opteron configurations, and the external
chipset
| that Horus built to fake out that glueless constraint is what allows
| them to scale further.
|

Does anyone actually sell a Horus based system? I can find many
articles about it in 2005 saying it will be out "soon" but nothing
saying here it is.

Thor Lancelot Simon

unread,

Sep 10, 2006, 12:58:58 PM9/10/06

to

In article <IKWdnfsSuf9GTJ7Y...@metrocastcablevision.com>,

Bill Todd <bill...@metrocast.net> wrote:
>Nick Maclaren wrote:
>
>...
>
>> In article <peypfyf0...@PXPL8591.amr.corp.intel.com>,
>> Andy Glew <first...@employer.domain> writes:
>> |>
>> |> My word does not count as clear evidence, but I believe that all Xeon
>> |> CPUs ever built, dating back to the earliest P6 family machines, can
>> |> be used to build cache-coherent SMPs - and they are as scalable as
>> |> external snoop filtering with backwards invalidation can be.
>>
>> Yes. It was specifically for large SMP systems (above 4 sockets). For
>> example, Opterons don't go above 8 because of an addressing limit;
>
>But of course Opterons *do* go above 8 sockets (all the way to 32
>sockets, in fact): the 'addressing limit' to which you refer only
>applies to *glueless* Opteron configurations, and the external chipset
>that Horus built to fake out that glueless constraint is what allows
>them to scale further.

Indeed, ALR and Unisys (both sold it; I don't know who built it)
produced such a chipset for the Pentium Pro almost a decade ago now.
This is very old news; it's hard to believe anyone would take seriously
the claim that the fact "chip has glueless support for only X sockets"
is a valid basis for the inference "systems using chip only scale to
X sockets".

--
Thor Lancelot Simon t...@rek.tjls.com

"We cannot usually in social life pursue a single value or a single moral
aim, untroubled by the need to compromise with others." - H.L.A. Hart

John Dallman

unread,

Sep 10, 2006, 3:19:00 PM9/10/06

to

In article <peypk64c...@PXPL8591.amr.corp.intel.com>,
first...@employer.domain (Andy Glew) wrote:

> By the way, one of the things that most surprised me in 1985 when I
> joined Gould was that the OS people were seriously discussing making
> the UMA NP1 system non-single-system-image.
>
> We ended up doing single-system-image UMA cache coherent shaed memory.
> But UMA does not necessarily imply SSI, nor vice versa.
>
> For that matter, aren't the partitioned systems effectively
> non-single-system-image?

I'm not sure exactly what you mean by "single-system-image". Please
elucidate?

Rob Warnock

unread,

Sep 11, 2006, 12:47:31 AM9/11/06

to

John Dallman <j...@cix.co.uk> wrote:
+---------------

| I'm not sure exactly what you mean by "single-system-image".
| Please elucidate?

+---------------

The meaning varies depending on context. In the clustering
world (see <http://en.wikipedia.org/wiki/Single_system_image>)
it means a cluster that provides the illusion of a single
system to its users.

But in the SMP/NUMA/ccNUMA world (which is what Andy was
talking about), it means a single physical address space
for all the CPUs [cache-coherent, for SMP and ccNUMA]
running a single copy of the operating system. E.g., an
SGI Altix will run a single system image Linux on up to
1024 CPUs, or you can break it up [without moving any cables]
into several "partitions" of a smaller number of CPUs, each
partition of which is a separate system image [but the partitions
can still communicate via non-coherent message-passing].

-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607