Opteron 8-way glueless limit?

Iain McClatchie

unread,

Aug 16, 2004, 11:09:30 PM8/16/04

to

What's the cause of the 8-way glueless limit for Opteron?

I ask because I found a 10-node graph with at most two hops
between nodes and three arcs per node. A 10-CPU motherboard
would be... better than 8! :).

Also, Del, since you seem to know enough about HT to snipe
at it, would you care to speculate on whether HT could get
across two PC board connectors, assuming I can keep the link
length down to something like 16 inches?

I'm imagining 10 little boards, each with an Opteron, 4 DIMMs,
and a little I/O, plugged into a motherboard/backplane.

Andrew Reilly

unread,

Aug 16, 2004, 11:25:17 PM8/16/04

to

Iain McClatchie wrote:
> I'm imagining 10 little boards, each with an Opteron, 4 DIMMs,
> and a little I/O, plugged into a motherboard/backplane.

Why muck about with a motherboard? I've always wanted to build a
multiprocessor with little self-contained boards like that, but
all standing on edge in a circle, hooked together along the back
edge with short runs of point-to-point cable (of some sort). A
tiny, weeny Cray-1 :-)

Cheers,

--
Andrew

Hank Oredson

unread,

Aug 16, 2004, 11:47:31 PM8/16/04

to

"Andrew Reilly" <areilly-...@areilly.bpc-users.org> wrote in message
news:2odc2cF...@uni-berlin.de...

Think ceramic substrate, contacts top and bottom.
Stack 'em.
Technology available for a long time.
One little problem about heat.
Gotta keep the block from running too hot.

Think copper with ceramic top and bottom.
Holes through copper, more ceramic in holes.
Edges of copper into clamps on "backplane".
Take out heat.
Now you get some interesting density with very
short interconnect, maybe can get heat out.

I was involved with some research into structures
like this long ago. Moved on to other things and
lost track of what became of the ideas. The goal
was in fact to do something like a Cray-1, and have
it fit into a standard avionics rack. Application
was radar signal processing.

--

... Hank

http://horedson.home.att.net
http://w0rli.home.att.net

Aaron Spink

unread,

Aug 17, 2004, 1:10:25 AM8/17/04

to

"Iain McClatchie" <iai...@truecircuits.com> wrote in message
news:45022fc8.04081...@posting.google.com...

> What's the cause of the 8-way glueless limit for Opteron?
>

Probably a combination of chip design and coherence method.

Chip design is easy, if they only track up to 8 sharers, the limit is right
there.

The coherence method used in Opteron is broadcast snoopy. This means that
the per cacheline overhead increases O(NxN). In a two processor system, you
only need to send 1 snoop and get 1 snoop response, in a 4 proc system you
need to send 3 snoops and get 3 snoop responses, in an 8 proc system this
increases to 7 snoops and 7 snoop responses. Factor in fanout and fanin
over multiple links and the link*message count increases even more.

Aaron Spink
speaking for myself inc.

Peter Dickerson

unread,

Aug 17, 2004, 3:43:54 AM8/17/04

to

"Iain McClatchie" <iai...@truecircuits.com> wrote in message
news:45022fc8.04081...@posting.google.com...

Well how many HT links do you think an Opteron has? The 8-cpu configuration
uses some of the links for I/O. If you have a 10-node tree of links then
there are six links available for I/O but that is four hops with the
central, root, node taking a lot of trafic.

Peter

Nick Maclaren

unread,

Aug 17, 2004, 4:49:06 AM8/17/04

to

In article <5rgUc.3596$3O3....@newsread2.news.pas.earthlink.net>,

While I believe that you are correct, (a) there was no public
specification of the coherence protocol or usage the last time
I looked and (b) that limit is stated to be the limit for glueless
connexion even for incoherent memory.

I have been unable to find any reason for the latter.

Regards,
Nick Maclaren.

David Kanter

unread,

Aug 17, 2004, 12:32:27 PM8/17/04

to

iai...@truecircuits.com (Iain McClatchie) wrote in message news:<45022fc8.04081...@posting.google.com>...

> What's the cause of the 8-way glueless limit for Opteron?

AMD only allocated 8 bits for the processor ID. This decision was
based on the fact that most servers are < 8P and the uncertainty of
market acceptance in bigger systems.

David

Mitch Alsup

unread,

Aug 17, 2004, 12:53:35 PM8/17/04

to

"Aaron Spink" <aaron...@earthlink.net> wrote in message news:<5rgUc.3596$3O3....@newsread2.news.pas.earthlink.net>...

Not quite:

It only takes O(log3(n)) rounded up broadcasts of snoop requests to
traverse the fabric. For a flat topology 8 P system this is 7 snoop
broadcast messages (messages * links) of 4 beats each.

Then it does take (n) responses back to the originator to complete
the snoop. This takes 15 (messages * links) of 2 beats each (unless
one of the remote caches contains the requested data).

>
> Aaron Spink
> speaking for myself inc.

Mitch

Iain McClatchie

unread,

Aug 17, 2004, 2:28:37 PM8/17/04

to

Iain> I ask because I found a 10-node graph with at most two hops
Iain> between nodes and three arcs per node.

Peter> Well how many HT links do you think an Opteron has?

Ummm... 3. All of which are used for CPU<->CPU in the 10-node
2-hop graph, leaving... none for I/O.

Oops.

Nevermind.

David Kanter

unread,

Aug 17, 2004, 5:16:39 PM8/17/04

to

dka...@onebox.com (David Kanter) wrote in message news:<4e56972b.04081...@posting.google.com>...

Let me rephrase that. AMD only allocated 3 bits for the PID, allowing
8 different CPUs in a domain. I believe this was one reason, the
other being the need for I/O as Peter mentioned.

David

Nick Maclaren

unread,

Aug 17, 2004, 5:32:16 PM8/17/04

to

In article <4e56972b.04081...@posting.google.com>,

David Kanter <dka...@onebox.com> wrote:
>dka...@onebox.com (David Kanter) wrote in message news:<4e56972b.04081...@posting.google.com>...
>> iai...@truecircuits.com (Iain McClatchie) wrote in message news:<45022fc8.04081...@posting.google.com>...
>> > What's the cause of the 8-way glueless limit for Opteron?
>>
>> AMD only allocated 8 bits for the processor ID. This decision was
>> based on the fact that most servers are < 8P and the uncertainty of
>> market acceptance in bigger systems.
>

>Let me rephrase that. AMD only allocated 3 bits for the PID, allowing
>8 different CPUs in a domain. I believe this was one reason, the
>other being the need for I/O as Peter mentioned.

Er, I am more than a little dubious about the hopping. The lack of
bits (because of the marketing) makes sense, but the rest doesn't.

If you allow a single connexion on one node for I/O, then it is
trivial to show that you can't get everywhere in 2 hops on 8 nodes.
So the limit to 8 is definitely NOT because of that.

Regards,
Nick Maclaren.

Nathan J. Williams

unread,

Aug 17, 2004, 7:48:24 PM8/17/04

to

nm...@cus.cam.ac.uk (Nick Maclaren) writes:

> Er, I am more than a little dubious about the hopping. The lack of
> bits (because of the marketing) makes sense, but the rest doesn't.
>
> If you allow a single connexion on one node for I/O, then it is
> trivial to show that you can't get everywhere in 2 hops on 8 nodes.
> So the limit to 8 is definitely NOT because of that.

Two hops isn't a fundamental limit, no. The limits I know of are in
the implementation; things like the HT routing table and the memory
directory are only set up with room for 8 entries. So I believe that
the 8-processor limit is an implementation limit, but probably one
that was not invented too cavalierly.

- Nathan

Iain McClatchie

unread,

Aug 18, 2004, 2:26:33 AM8/18/04

to

Nick> If you allow a single connexion on one node for I/O, then it is
Nick> trivial to show that you can't get everywhere in 2 hops on 8 nodes.

Wow. How trivial?

You can get really close, though.

8 CPUs, *two* I/O connections:

A: B C D
B: A F <->I/O
C: A G H
D: A I <->I/O
F: B G H
G: C F I
H: C F I
I: D G H

Paths D-x-y-F and B-x-y-I are the only three-hop links.

I quite like the symmetry of this next one. 7 CPUs, three
I/O connections, maximum two hops between CPUs:

A: B C D
B: A E <->I/O
C: A G <->I/O
D: A I <->I/O
E: B G I
G: C E I
I: D E G

John Savard

unread,

Aug 18, 2004, 2:45:26 AM8/18/04

to

On Tue, 17 Aug 2004 03:47:31 GMT, "Hank Oredson" <hore...@att.net> wrote, in
part:

>I was involved with some research into structures
>like this long ago. Moved on to other things and
>lost track of what became of the ideas. The goal
>was in fact to do something like a Cray-1, and have
>it fit into a standard avionics rack. Application
>was radar signal processing.

By *now*, I think you could probably fit a Cray-1 on about sixteen chips. But you
would need an awful lot of pins on each chip. So we may not see the like of a
Cray-1 again until it can fit on *one* chip.

Which it probably could, if you were willing to live with really low yield.

In fact, that may be the kind of chip used in some current Cray products.

John Savard
http://home.ecn.ab.ca/~jsavard/index.html

Nick Maclaren

unread,

Aug 18, 2004, 4:32:43 AM8/18/04

to

In article <45022fc8.04081...@posting.google.com>,

Dead trivial. The I/O node can connect to only two others, which
can each connect to two more - obviously a limit. That is 7.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Aug 18, 2004, 4:52:59 AM8/18/04

to

In article <mtuy8kd...@contents-vnder-pressvre.mit.edu>,

nat...@mit.edu (Nathan J. Williams) writes:
|>
|> Two hops isn't a fundamental limit, no. The limits I know of are in
|> the implementation; things like the HT routing table and the memory
|> directory are only set up with room for 8 entries. So I believe that
|> the 8-processor limit is an implementation limit, but probably one
|> that was not invented too cavalierly.

Oh, it wasn't cavalier at all, but I don't believe that it had
anything to to do with routing. I believe that the decision to
target 1-8 CPUs was made first, and the routing designed around
that.

Regards,
Nick Maclaren.

Brian Hurt

unread,

Aug 18, 2004, 10:51:18 AM8/18/04

to

nm...@cus.cam.ac.uk (Nick Maclaren) wrote in message news:<cftth0$fhb$1...@pegasus.csx.cam.ac.uk>...

> In article <4e56972b.04081...@posting.google.com>,
> David Kanter <dka...@onebox.com> wrote:
> >dka...@onebox.com (David Kanter) wrote in message news:<4e56972b.04081...@posting.google.com>...
> >> iai...@truecircuits.com (Iain McClatchie) wrote in message news:<45022fc8.04081...@posting.google.com>...
> >> > What's the cause of the 8-way glueless limit for Opteron?
> >>
> >> AMD only allocated 8 bits for the processor ID. This decision was
> >> based on the fact that most servers are < 8P and the uncertainty of
> >> market acceptance in bigger systems.
> >
> >Let me rephrase that. AMD only allocated 3 bits for the PID, allowing
> >8 different CPUs in a domain. I believe this was one reason, the
> >other being the need for I/O as Peter mentioned.
>
> Er, I am more than a little dubious about the hopping. The lack of
> bits (because of the marketing) makes sense, but the rest doesn't.

These are the technical reasons. The real question is, I think, why
do you need more than eight?

Eight CPUs, glueless, allows AMD to play in the low end of the
mid-range server space, but come in a lot cheaper than an equivelent 8
CPU box from anyone else. My back of the evelope guestimates put the
cost of such of a box at $15,00 dollars. Consider- 8 CPUs, $800 e.,
$6,400. 8x memory @ $500 e., $4,000. Motherboard cost $200. I/O
Chips, 4x @ $200e., $800. Case, misc., $1,000- total, $12,400.
Giving me another $2,600 of slop for things I've forgotten. Add a 40%
profit margin on top of that $15K, and you get a $25K list price.

The equivelent Sun would be, I'd think, a V880. Which comes in at
$86,000:
http://store.sun.com/CMTemplate/CEServlet?process=SunStore&cmdViewProduct_CP&catid=112785&

I'm mildly surprised more whitebox manufacturers haven't gotten into
this- selling 8-way AMD boxes + Linux into Unix shops.

For higher numbers of CPUs, glueless becomes less of an issue.
Especially considering what you need isn't glue logic, but HT-HT
bridge/router chips. I'd be inclined to connect the cpus in "nodes"
of 4 cpus each, in a classic 2D hypercube (aka a square). Use two of
the spare HT connections (from diagonal CPUs) to connect to IO, the
other two to connect to the router/bridge chip. The bridge chips has
5 HT connects- 2 to the node, 3 to other bridge chips. This allows me
to connect eight nodes of four cpus up into a hypercube, for a total
of 32 CPUs. Hello, high end server space. Probably again at a
fraction of the cost of the big boys.

Does this mean death to Sun, HP, IBM, etc? No. It simply means a
shift in what boxes get sold into what markets. The $10-50K server
space is currently populated with 2 CPU boxes- it'll become 8-16 CPU
boxes. The $100-500K space is currently populated with 8-16CPU boxes,
instead it'll become populated with 24-64 CPU boxes. The $1-5M market
will go higher yet- 2048 CPU boxes? More?

A PC of today has way more memory, way more computational power, way
more IO, way more storage space, than the largest mainframes of the
1950s. Does that mean the mainframe market is dead? Heck no.
Because people don't go "we need this much computational power- how
cheap can we buy it?" They go "we have this much money- how much
computational power can we buy with it?"

Brian

Erik Magnuson

unread,

Aug 19, 2004, 1:16:30 AM8/19/04

to

bh...@spnz.org (Brian Hurt) wrote in message

>
> These are the technical reasons. The real question is, I think, why
> do you need more than eight?
>
> Eight CPUs, glueless, allows AMD to play in the low end of the
> mid-range server space, but come in a lot cheaper than an equivelent 8
> CPU box from anyone else. My back of the evelope guestimates put the
> cost of such of a box at $15,00 dollars. Consider- 8 CPUs, $800 e.,
> $6,400. 8x memory @ $500 e., $4,000. Motherboard cost $200. I/O
> Chips, 4x @ $200e., $800. Case, misc., $1,000- total, $12,400.
> Giving me another $2,600 of slop for things I've forgotten. Add a 40%
> profit margin on top of that $15K, and you get a $25K list price.
>
> The equivelent Sun would be, I'd think, a V880. Which comes in at
> $86,000:
> http://store.sun.com/CMTemplate/CEServlet?process=SunStore&cmdViewProduct_CP&catid=112785&
>
> I'm mildly surprised more whitebox manufacturers haven't gotten into
> this- selling 8-way AMD boxes + Linux into Unix shops.

You're forgetting that the availability of application software
drives the hardware purchases and not the other way around. Eventually
we will be seeing low cost 8-way AMD boxes running what the v880's (and
equivalent machines) are running now. Things should get really exciting
when 2 and 4 GB DIMM's sell for the same price per GB as the 512 MB parts.

FWIW, Sun's US-3i does support low cost 4-way systems, however the J-Bus isn't
as flexible as HT. The upcoming US-3i+ are supposed to have performance
similar to the current high end Opterons, though the Opterons will probably
have moved ahead by the time the US-3i+ boxes ship.

-- Erik

Stefan Monnier

unread,

Aug 19, 2004, 10:13:04 AM8/19/04

to

> equivalent machines) are running now. Things should get really exciting
> when 2 and 4 GB DIMM's sell for the same price per GB as the 512 MB parts.

By the time this happens, those things will be far from "exciting".

Stefan

del cecchi

unread,

Aug 20, 2004, 10:33:35 PM8/20/04

to

"Iain McClatchie" <iai...@truecircuits.com> wrote in message
news:45022fc8.04081...@posting.google.com...

Back from the great white north....

What speed are you interested in? At 1.6 Gbit/second, the (this is from
memory, so maybe not correct) CAD valid time at the transmitter output
pins with respect to the clock is 166 ps (before and after) and at the
Receiver the Setup and Hold times of CAD relative to clock is 110 ps.
So if you compute algebraicly rather than statistically you have 56 ps
for the interconnect skew and ISI. You get somewhat more by playing
statistics of course.

16 inches of wire is 40 cm is 2.8 ns. I am told by our card guys that
one might get 5% mistracking if one were lucky. If you use the whole 56
ps for mistracking with no ISI, you need about 3%. That's not counting
the ps (5?, 10?) that the connector throws in. Or ISI, depending on
wire thickness, width, laminate material.

It would be a stretch, big time. Now if you take those same parts and
back off to 1200, you have another 100 ps to play with, because the
increase in UI/2 goes right to the interconnect although the spec
doesn't talk about it in those terms.

del

Erik Magnuson

unread,

Aug 22, 2004, 1:31:38 AM8/22/04

to

"del cecchi" <dcecchi...@att.net> wrote

>
> 16 inches of wire is 40 cm is 2.8 ns. I am told by our card guys that
> one might get 5% mistracking if one were lucky. If you use the whole 56
> ps for mistracking with no ISI, you need about 3%. That's not counting
> the ps (5?, 10?) that the connector throws in. Or ISI, depending on
> wire thickness, width, laminate material.

40 cm in 2.8 ns - works out to a velocity factor of 48% - I'm assuming
that's for traces on FR-4. The guys working on RF instrumentation can
get pretty persnickety about matching phase and antennuation in cable
assemblies. They could probably make a cable that would meet your specs. The
catch is that the cable assemblies would end up costing at least as much as
the processor and would be very sensitive to mishandling.

-Erik

Del Cecchi

unread,

Aug 23, 2004, 2:57:00 PM8/23/04

to

"John Savard" <jsa...@excxn.aNOSPAMb.cdn.invalid> wrote in message
news:4122fa73...@news.ecn.ab.ca...

OK, how much stuff in a Cray1? Wasn't it made out of ECL gate arrays with
relatively few gates/chip? No Cache.
A little google turned up a hardware manual.

Cycle time 12.5ns
Chips were either 16 registers 1bit wide or a 4w nand and a 5w nand.
Just to be on the safe side let's pretend that they averaged 50 cmos gate
equivilents.

There were 1662 Modules with up to 288 IC/module.

That's only 24 Million Gates.

The WC gate delay was 1 ns. Even allowing for an increase in stages due to
ECL to CMOS, no problem.

It only had 8 MB of memory max (1M words).

Conclusion: Cray1 on a chip quite feasable and at several times the
performance.

del cecchi

Del Cecchi

unread,

Aug 23, 2004, 2:58:56 PM8/23/04

to

"Erik Magnuson" <er...@tfb.com> wrote in message
news:25cf6e12.04082...@posting.google.com...

I said "card" which is IBMese for Printed wiring board. And that is what he
asked about.

Yes, can buy matched cables for those high speed scopes and parberts too.

del

Erik Magnuson

unread,

Aug 24, 2004, 12:44:52 AM8/24/04

to

"Del Cecchi" <cecchi...@us.ibm.com> wrote in message news:<2out3hF...@uni-berlin.de>...

My bad on that one - the image in my head was flexible interconnects.

> Yes, can buy matched cables for those high speed scopes and parberts too.

On a related note - many microwave designers start turning their noses
up at
FR-4 when frequencies go above 1 to 1.5 GHz. Main reason is that the
dielectric constant isn't very constant for FR-4 and the loss factor
is
pretty significant and variable. I wonder if it would be worthwhile
using
microwave grade material (e.g. Rogers Duroid) instead of FR-4 for the
backplane. The materials typically have a lower dielectric constant
(higher
velocity factor) and more importantly, designed to be much more
consistent
than FR-4.

OTOH, I would like to see the look on the PWB guys' faces when someone
suggests using PTFE (Teflon) as a PWB substrate - probably the same as
being told lunch
is going to be Haggis and Lutefisk.

- Erik

Chris Morgan

unread,

Aug 25, 2004, 3:50:06 PM8/25/04

to

"Del Cecchi" <cecchi...@us.ibm.com> writes:

> Conclusion: Cray1 on a chip quite feasable and at several times the
> performance.

For extra credit - how long until Cray1 performance in a Cray1
software emulator running on a Wal-Mart PC? I'm not being sarcastic,
it would tickle me if anyone has enough info for a useful
estimate. Personaly I don't have a grasp of even the orders of
magnitude - is a Cray 1 like a 800Mhz PC, 8GHz or what? I would like
to believe the Wal-Mart PC can't quite do it yet, but I'm not sure.

Not that it perhaps matters that much - IIRC when Seymour Cray was
told that Apple used a Cray in designing a Mac, he responded that that
was fine as he used a Mac to design the Cray.

Chris
--
Chris Morgan
"Post posting of policy changes by the boss will result in
real rule revisions that are irreversible"

- anonymous correspondent

Robert Myers

unread,

Aug 25, 2004, 8:44:40 PM8/25/04

to

Chris Morgan wrote:

> "Del Cecchi" <cecchi...@us.ibm.com> writes:
>
>
>>Conclusion: Cray1 on a chip quite feasable and at several times the
>>performance.
>
>
> For extra credit - how long until Cray1 performance in a Cray1
> software emulator running on a Wal-Mart PC? I'm not being sarcastic,
> it would tickle me if anyone has enough info for a useful
> estimate. Personaly I don't have a grasp of even the orders of
> magnitude - is a Cray 1 like a 800Mhz PC, 8GHz or what? I would like
> to believe the Wal-Mart PC can't quite do it yet, but I'm not sure.
>

You can do the numbers yourself. 12.5ns cycle time = 80MHz. One
floating point add and one floating point multiply per clock = 160
megaflops.

One read and one store per clock is 160 megawords/second memory
bandwidth or 1280 megabytes/second, or just a tad more than PC133 SDRAM.

The Walmart PC with SRAM won't quite cut it on memory bandwidth, but it
is just about there in every other respect, with superior performance on
everything but the kind of memory-bound, vectorized code that was the
Cray 1's long suit. At that, the memory-bound problem had best not be
very big, because you only had a megaword of memory on the Cray 1.

I'll leave a comparison of the I/O capabilities to someone else.

> Not that it perhaps matters that much - IIRC when Seymour Cray was
> told that Apple used a Cray in designing a Mac, he responded that that
> was fine as he used a Mac to design the Cray.
>

The way that the Cray 1 has stuck in people's minds for so long is
striking. Perhaps it is just a case of imprinting: we don't change
archtypes very often, and the Cray 1 got imprinted as "The Supercomputer."

RM

del cecchi

unread,

Aug 25, 2004, 11:32:08 PM8/25/04

to

"Robert Myers" <rmyer...@comcast.net> wrote in message
news:XnaXc.235444$eM2.137259@attbi_s51...

And it is pretty amazing when you consider it was done with 2 gates per
16 pin package, having 500 ps equivilent gate delay. And no high level
simulation and not much in the way of CAD programs.

Sort of like the software guys that did operating systems like OS/360 in
assembler with not much in the way of libraries, source control systems,
or symbolic debuggers. Instead, card readers and core dumps.

It really makes one think, sort of like visiting Embarrass, mn.

del cecchi
>
>

Jakob Engblom

unread,

Aug 26, 2004, 2:49:00 PM8/26/04

to

> For higher numbers of CPUs, glueless becomes less of an issue.
> Especially considering what you need isn't glue logic, but HT-HT
> bridge/router chips. I'd be inclined to connect the cpus in "nodes"
> of 4 cpus each, in a classic 2D hypercube (aka a square). Use two of
> the spare HT connections (from diagonal CPUs) to connect to IO, the
> other two to connect to the router/bridge chip. The bridge chips has
> 5 HT connects- 2 to the node, 3 to other bridge chips. This allows me
> to connect eight nodes of four cpus up into a hypercube, for a total
> of 32 CPUs. Hello, high end server space. Probably again at a
> fraction of the cost of the big boys.

I think that this was precisely what Newisys just announced at
HotChips: their "Horus" chip connects up 4 Opterons to a small node,
and then routes between such fat nodes to form up to 32-way Opteron
systems.

http://zdnet.com.com/2100-1103_2-5323996.html

Rich Oehler

unread,

Aug 31, 2004, 11:18:55 AM8/31/04

to

ja...@virtutech.com (Jakob Engblom) wrote in message news:<80c81770.04082...@posting.google.com>...

As one of the authors of the Horus 2004 Hot Chips Presentation and the
CTO of Newisys, I would like to add that our Horus chip have 4 local
HT links, one to each of the Opteron sockets in a quad and 3 external
links over which we use a modified HT protocol. We modified the HT
protocol to make it robust over a cable. Our protocol offers
guaranteed exactly once delivery. The following errors are
recoverable: Parity, Out of Band, LOS, FIFO overflow, CRC mismatch,
Loss of packet, Seq ID mismatch, and Illegal packet.

We also offer re-initialization of remote links without box reset,
which is useful for hot add and removal of quads as well as
partitioning.

Our protocol raises the 3 bit node_id limit of HT. Horus does this by
acting as a surrogate for all the resources that are connected via its
external links. Thus it behaves as the fifth Opteron in the local quad
maintaining the local HT protocol, while using our modified protocol
with its external resources. This is repeated for each of the local
quads by their Horus. A Horus in one quad is directly connected via
the external links to up to three other Horus in their quads.
Architecturally, Horus can address up to 32 sockets (5 bit node_id),
but when configure that way, there are additional hops and additional
latency introduced both at the Horus level and within the quads.

We believe strongly believe in providing significant I/O bandwith
within each quad. Our current 4p model uses 2 of the 12 HT links for
I/O. (We have measure over 5.3 GB/s sustained I/O transfers on this 4p
box). The implication of using 2 HT links for I/O is that there is one
less cHT path available and therefore there is a 'cut through' path in
Horus. This path is optimized to minimize latency.

Our 'sweet spot' is somewhere between 8 and 16 sockets (2 to 4 quads).
For now, I would add that this matches the high end of the sweet spot
of the industry standard OSs (Windows and Linux) that would run on
boxes built with Horus. We went to 32 sockets to provide a platform on
which the OS developers to take the next steps.

Our design goal for local remote latency (comparing the access time of
a L2 miss to remote memory in a one Horus hop remote quad to local
memory on the L2 missing Opteron) is less than 3 to 1. We have also
augmented Horus with both an up to 64MB remote date cache (often
called an NUMA cache in the literature) and a remote directory. Both
of these additional resources will significantly reduce the overall
average memory latency. Our simlations of various TPC/C traces show
very good scaling to at least 16 sockets with both the single and dual
core Opterons.

We expect to have silicon latter this year and prototype systems early
next year. Given this schedule, we will have proof of concept vehicles
mid next year.

Our business model is to sell to OEMs. OEMs will take our products to
market.
(We have had some success with this model, you should ask your
favorite OEM about their 2p and 4p Opteron offerings).

Rich Oehler

Message has been deleted

David Wang

unread,

Aug 31, 2004, 7:10:11 PM8/31/04

to

Andi Kleen <fre...@alancoxonachip.com> wrote:

> rich....@newisys.com (Rich Oehler) writes:
> >
> > As one of the authors of the Horus 2004 Hot Chips Presentation and the

> [...]

> Thanks for the interesting mail.

> Is there any chance that you could put the slides and/or the paper
> online somewhere?

> Thanks,

> -Andi

Rich and his friends at Newisys have agreed to post the slides on
www.realworldtech.com. I'm working with the webmaster and he's
working with realworldtech's ISP to make sure that the site can
handle the extra bandwidth demands. The slides in pdf format are
quite large, about 3.3 MB (~2 MB in zipped format). If the file
is only downloaded 1000 times, that wouldn't be a problem, but
if it gets downloaded 10,000 times, it would be a problem. The
slides should be online later today or tomorrow.

Some (one?) of the architects may also answer questions about
Horus on comp.arch and/or realworldtech, subject to availability
of time, the depth and sensitivity of the questions asked. It's
not a commitment to do a Q&A session, but they would like to
help raise awareness of the chip and generate interest.

They designed a pretty large chip with a fairly small team....

--
davewang202(at)yahoo(dot)com

Bernd Paysan

unread,

Sep 1, 2004, 5:08:57 AM9/1/04

to

David Wang wrote:

> The slides in pdf format are
> quite large, about 3.3 MB (~2 MB in zipped format).

Can you convince the PowerPoint user of this file to drop the background
image? PowerPoint slides with plain background give very small PDFs, while
a background image (even if it's the same for all pages) is stored as
separate object for each page (due to the way PowerPoint to PDF conversion
works - by printing).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

David Wang

unread,

Sep 1, 2004, 9:23:33 AM9/1/04

to

Bernd Paysan <bernd....@gmx.de> wrote:
> David Wang wrote:

> > The slides in pdf format are
> > quite large, about 3.3 MB (~2 MB in zipped format).

> Can you convince the PowerPoint user of this file to drop the background
> image? PowerPoint slides with plain background give very small PDFs, while
> a background image (even if it's the same for all pages) is stored as
> separate object for each page (due to the way PowerPoint to PDF conversion
> works - by printing).

Sounds like a good idea. I'll make the request, but it'll have to go
through the chain of approval again. :)

--
davewang202(at)yahoo(dot)com

David Wang

unread,

Sep 1, 2004, 9:37:17 AM9/1/04

to

On second thought, it probably won't help much at all. There is no
"background". It's just white. There are two pictures in the slides;
one picture is the test rig setup and the second picture is the
Horus team group photo. Also, lots of nice graphics. That's where
the size comes from, so 3.3/2.0 MB is what it will stay... I'll go
check where we are in getting the slides posted.

--
davewang202(at)yahoo(dot)com