Another place where there's alot of money

49 views
Skip to first unread message

Robert Myers

unread,
Apr 9, 2004, 11:40:24 PM4/9/04
to

And, I might add, probably a perpetually instatiable appetite for
throughput.

http://www.nytimes.com/2004/04/10/technology/10GAME.html?pagewanted=1&hp

<quote>

Computer games represent one of the fastest-growing, most profitable
entertainment businesses. Making movies, by contrast, is getting
tougher and more expensive, now costing, with marketing fees, an
average of $103 million a film. That is one reason, among others, that
those with power in Hollywood are avidly seeking to get into the game
business while also reshaping standard movie contracts so they can
grab a personal share of game rights.

<snip>

Ridley Scott, best known for science fiction fantasies like "Blade
Runner" and "Alien" as well as the historical epic "Gladiator," has
been meeting with video game company executives, too, arguing that
games offer greater creative opportunities these days because they are
less expensive to make and not constrained by the roughly two-hour
time frame of a conventional movie.

"The idea that a world, the characters that inhabit it, and the
stories those characters share can evolve with the audience's
participation and, perhaps, exist in a perpetual universe is indeed
very exciting to me," said Mr. Scott, who is seeking a video game
maker to form a partnership with him and his brother Tony.

</quote>

The article moves on to make some cautionary comments, likening the
current world of game authors to that of comic book writers.

The talent will go where the money is. If computer games attract the
best creative talent (apologies to those who believe that comic books
are high art), the money will go there, too. We ain't see nuttin'
yet.

RM

john jakson

unread,
Apr 10, 2004, 11:03:47 AM4/10/04
to
Robert Myers <rmy...@rustuck.com> wrote in message news:<klqe705f63oher239...@4ax.com>...

Now this is where MTA could really take off, I wouldn't be surprised
if its already in there somewhere.

regards

johnjakson_usa_com

Robert Myers

unread,
Apr 10, 2004, 5:27:30 PM4/10/04
to
On 10 Apr 2004 08:03:47 -0700, johnj...@yahoo.com (john jakson)
wrote:

>Robert Myers <rmy...@rustuck.com> wrote in message news:<klqe705f63oher239...@4ax.com>...

<snip>


>>
>> The talent will go where the money is. If computer games attract the
>> best creative talent (apologies to those who believe that comic books
>> are high art), the money will go there, too. We ain't see nuttin'
>> yet.
>>

>


>Now this is where MTA could really take off, I wouldn't be surprised
>if its already in there somewhere.
>

I think Ian McClatchie has explained to us that indeed it is:

>Ian>I think some of the research you are talking about is happening
>for the gaming market, right now.
>
>Robert> Hardware and software support for lightweight threads.
>
>Ian>Graphics chips spawn multiple "threads" PER CYCLE.
>
>Robert> Hardware and software support for streaming computation.
>
>Ian>These threads coordinate access to memories.
>
>Robert> Tools, strategies, and infrastructure to make specialized hardware
>Robert> like ASIC's more easily available for use by science.
>
>Ian>Not here quite yet, since the gaming guys do single precision and
>you folks want DP mostly. But check out www.gpgpu.org. I honestly
>think that there is a real chance the graphics folks are going to
>sneak up on and overwhelm the CPU guys for physics codes, maybe
>unintentionally.

That's why I can't take IBM's entries into the supercomputer market
altogether seriously. Their best technologies are going into games.
The golden rule, you know.

Servers? Supercomputers? x86? B-o-o-o-ring.

RM

Rupert Pigott

unread,
Apr 10, 2004, 7:04:33 PM4/10/04
to

The other day I was grinding my teeth at the minimal level of
committment to open-source from the 3D GFX card makers because
the bloody drivers locked up when I dragged a window AND I
wanted to run them on something other than x86 Linux.

Then it occurred to me : You could probably get some mileage
out of running WireGL/Chromium on a small BG/L style system on
PCI-Express card. So there's one cool app that is games related
and could harness BG/L type nodes. Anything to stick two fingers
up at the 3D hardware vendors would be good right now. Unhappy
customer. :/

Cheers,
Rupert

Robert Myers

unread,
Apr 10, 2004, 8:09:35 PM4/10/04
to

One way to read that, of course, is that they think they've really got
something worth hiding.

I just (like, last night, or, really, this morning) got my Promise
SATA raid array running on a driver Promise-supplied driver compiled
from source (against an Enterprise Server SMP kernel, no less), and
(as you may or may not know), the history of Promise controllers and
Linux has not been a smooth one. It is, on the other hand, x86 Linux,
and I haven't yet booted from the RAID array.

The fact that Promise has completely caved and that there will be a
GPL driver native to the 2.6 Kernel fills me with hope.

I don't generally suffer from a shortage of self-confidence, but
graphics drivers is where I draw the line. I don't know what it is
_other_ than x86 Linux you are using, but, by the time you've gotten
to the desktop in a Linux system running X, you've got so many layers
and configuration files that serially redefine what the previous layer
defined that you're lucky if you've still got your sanity in hand.

I figure that using a GPU to do physics should, by comparison, be a
walk in the park.

>Then it occurred to me : You could probably get some mileage
>out of running WireGL/Chromium on a small BG/L style system on
>PCI-Express card. So there's one cool app that is games related
>and could harness BG/L type nodes. Anything to stick two fingers
>up at the 3D hardware vendors would be good right now. Unhappy
>customer. :/
>

It will never happen, of course, but as a way of throwing some real?
world?, er, light ;-) on the whole issue, I'll bet you'd get top
billing on slashdot for such a project. That way, we could get the
discussion down to some meaningful benchmarks, like frame rate on
Quake.

One day of fame and a swamped server in return for a hefty slice of
your sanity? You just didn't seem quite _that_ far gone. ;-).

On the _other_ hand, a small BG/L system on a PCI-Express card sounds
like it should be a day and a half's work for Del's shop, and I'll bet
IBM would sell a few.

RM

john jakson

unread,
Apr 11, 2004, 12:54:34 AM4/11/04
to
Robert Myers <rmy...@rustuck.com> wrote in message news:<51pg70t6bevebrvb2...@4ax.com>...


I have been speculating that MTA will replace RISC just as it replaced
CISC, now thats worth some flame.

I had this horrible idea that might not be so horrible.

Since x86 went from plain CISC to ride along with RISC (and beat all
of them in the process), I wonder if it could jump again onto MTA. I
see no reason why the actual instructions executed by the barrel
engine in the MTA could not be x86 codes or atleast the RISC subset
that most compilers emit for.

In my design I already allow for some variable length codes 16b x 1-4,
so thats not a problem. My branches cost near 0 cycles (1/8 cycle for
fetch ahead) taken or not. The execution unit can bind 1 or 2 branches
with a non branch so there is a speed boost there of about 10-30%, no
branch prediction of any sort needed here.

An x86 MTA would probably need 1 or 2 extra pipe stages to resolve the
more complex 8b instruction encodings. The codes I would leave out can
be microcoded in firmware, and the penalty for those is not so bad if
a little help is given in HW trapping. Probably not so practical in
FPGA but worth some thinking. FPU is still a problem though. If FPGA
can hit 250MHz after P/R, then a full effort custom VLSI should be
able to go 3-10x faster, limited only by the cycle rates of dualport
ram or 8b adds and similar.

It would allow all the benefits of MTA to get with all the benefits of
having all that code, OSes, compilers etc. The details involved in
executing simplified x86 ops are the really the same as clean sheet
ops for new MTA. Question is which is harder, writing a compiler for
blank ISA or just using x86 codes.

regards

johnjakson_usa_com

Robert Myers

unread,
Apr 11, 2004, 1:19:43 AM4/11/04
to
On 10 Apr 2004 21:54:34 -0700, johnj...@yahoo.com (john jakson)
wrote:

<snip>


>
>I have been speculating that MTA will replace RISC just as it replaced
>CISC, now thats worth some flame.
>
>I had this horrible idea that might not be so horrible.
>
>Since x86 went from plain CISC to ride along with RISC (and beat all
>of them in the process), I wonder if it could jump again onto MTA. I
>see no reason why the actual instructions executed by the barrel
>engine in the MTA could not be x86 codes or atleast the RISC subset
>that most compilers emit for.
>

I've sort of taken it for granted that that's where x86 processors are
headed, but then, I'm not a computer architect. If you can't run the
pipeline faster because the energy costs of doing so are too high,
then you have to go to more pipelines. That's the paradigm shift or
whatever that Gelsinger was nattering on about.

There are only so many ways you can deploy multiple pipelines on a
single die. The only advantage that I can see to separate cores is
that the entire die doesn't have to be reachable in a single clock.
As a trade against that, you lose the possibility of close cooperation
among threads. I think we'll probably wind up with both (SMT and
CMP).

My reference to x86 as boring was really meant to refer to further
tweaks and die shrinks on the current architectures, not particularly
to the instruction set.

RM

David C. DiNucci

unread,
Apr 11, 2004, 3:27:49 AM4/11/04
to
Robert Myers wrote:
> There are only so many ways you can deploy multiple pipelines on a
> single die. The only advantage that I can see to separate cores is
> that the entire die doesn't have to be reachable in a single clock.
> As a trade against that, you lose the possibility of close cooperation
> among threads. ...

I am apparently *still* (after literally all these years) still missing
something very basic, so if someone has the time, maybe they can finally
help me get straight on this.

First, assuming that "separate cores" share no resources/stages, the
alternative is to share some resources/stages between threads, in which
case, to achieve the same maximum performance with multiple threads, the
clock needs to run fast enough to ensure that any such shared
resources/stages are not a bottleneck. That is one motivation for high
clock which is not present for multiple cores (and it seems to me to be
a major one).

Second, what kind of "close cooperation" is currently possible between
threads which share a core? I recall that the early Tera MTA only
allowed threads to interact through memory, so unless that has changed
significantly in hyperthreaded architectures (and I don't recall seeing
that it had), I don't know what "possibility" you are losing with
multiple cores.

So if my assumptions above are correct, and if multiple cores can be as
effective at sharing caches and integrating with other aspects of the
memory subsystem as their hyperthreaded counterparts, I am still led to
the conclusion that multiple cores get their bonus points from lower
clock speed (and therefore less heat) for similar (or better)
performance when there are enough threads to keep those cores busy, and
hyperthreaded cores get their bonus points from (1) possibly requiring
less floorplan/cost to manufacture support for n threads, and (2) the
ability to dynamically and productively allocate otherwise idle
resources/stages to active threads. I assume that these play off
against one another--i.e. that the ability to run the multiple cores
slower also allows them to be built smaller/simpler (with the forth chip
sited earlier as an extreme in that direction), but that if multiple
cores do end up significantly larger than the hyperthreaded
counterparts, that would affect their ability to share caches as
effectively.

My sense is that the motivation for hyperthreading relates almost
entirely to #2, because having lots of threads is the exception rather
than the rule, because of the widely accepted belief that PARALLEL
PROGRAMMING IS HARD!, and as long as that belief exists, so will shared
pipelines.

> ...I think we'll probably wind up with both (SMT and CMP).

Even with my silly assumptions, I can buy that: Enough CMP to handle
the minimum number of threads expected to be active, and SMT (i.e.
dynamically-allocatable pipe stages) to provide headroom. What I don't
buy (if it was implied) is that the CMP and SMP will be present in the
same processors, since there are good reasons for them to run at
different clock speeds. In fact, I might imagine SMTs (at high temp)
sprinkled among the CMPs (at lower temp) to even out the cooling.

-- Dave

Robert Myers

unread,
Apr 11, 2004, 9:24:13 AM4/11/04
to
On Sun, 11 Apr 2004 00:27:49 -0700, "David C. DiNucci"
<da...@elepar.com> wrote:

>Robert Myers wrote:
>> There are only so many ways you can deploy multiple pipelines on a
>> single die. The only advantage that I can see to separate cores is
>> that the entire die doesn't have to be reachable in a single clock.
>> As a trade against that, you lose the possibility of close cooperation
>> among threads. ...
>
>I am apparently *still* (after literally all these years) still missing
>something very basic, so if someone has the time, maybe they can finally
>help me get straight on this.
>

I had forgotten you were in the Nick Maclaren camp on this one (might
as well put it that way, because that's the way I think of it). I had
(believe it or not) thought of starting the paragraph to which you are
replying more or less the way you started yours, which is that I
*still* don't understand the case for separate cores (except the one I
mentioned). Were die layout not a problem, you'd put everything in
the same place, and let who belonged to what sort itself out as
needed. Die layout, of course, is a consideration.

>First, assuming that "separate cores" share no resources/stages, the
>alternative is to share some resources/stages between threads, in which
>case, to achieve the same maximum performance with multiple threads, the
>clock needs to run fast enough to ensure that any such shared
>resources/stages are not a bottleneck. That is one motivation for high
>clock which is not present for multiple cores (and it seems to me to be
>a major one).
>

If you try to run helper threads on separate cores, you lose most of
the advantages because of the overhead of communicating through L2.
If you don't share L2, there is no point at all in helper threads.

I don't get your argument for a fast clock at all. You don't balance
resources by speeding up the clock. You balance resources by
balancing resources. Your best chance of getting an ideal match is
when all resources are in one big pool and can be assigned as needed
to the workload that requires them. Your worst chance of getting an
ideal match is to have the resource pool arbitrarily divided into
pieces. Simple queuing theory, as someone put it.

>Second, what kind of "close cooperation" is currently possible between
>threads which share a core? I recall that the early Tera MTA only
>allowed threads to interact through memory, so unless that has changed
>significantly in hyperthreaded architectures (and I don't recall seeing
>that it had), I don't know what "possibility" you are losing with
>multiple cores.
>

Tera MTA is not a good way to think about the strategies that are
being discussed. The idea of Tera MTA was to have a slew of
_separate_ threads that advanced only every n clocks so that every
single instruction could have a latency of n without any other magic
at all.

There is a paper I have cited so many times that I get tired of
looking it up that describes one of several strategies for using
helper threads to, in effect, expand the run-time scheduling window of
Itanium without introducing OoO scheduling. The one particular paper
examines using helper threads with SMT and CMP, and CMP loses big
time. Helper threads, by definition, are trying to advance a primary
thread and are not conceptually separate as would be the threads in
Tera MTA.

The last time I mentioned the paper in question, Nick's response was,
more or less, "Oh that. Well, we've known for decades that you could
get that kind of speedup for a helper thread--only in privileged
mode." [Insert here a tart response to the "well we've known for
decades" line that I'm composing and going to attach to a hot key].

"Okay," I responded, drawing my breath in slowly, "what is it that we
need to do so that people can do these kinds of things in user space,
since I don't think the dozens of researchers publishing papers on
these strategies have been planning on running everything in
privileged mode." At which point Nick responded with a list of
requirements that look achievable to someone who is not a professional
computer architect.

>So if my assumptions above are correct, and if multiple cores can be as
>effective at sharing caches and integrating with other aspects of the
>memory subsystem as their hyperthreaded counterparts, I am still led to
>the conclusion that multiple cores get their bonus points from lower
>clock speed (and therefore less heat) for similar (or better)
>performance when there are enough threads to keep those cores busy,

As I've stated, if you could put everything in one place, there would
be no advantage at all to multiple cores (other than amortizing NRE by
printing cores on dies like postage stamps). Since you can't put
everything in one place, there is a balance between acceptable heat
distribution, die size, and being able to reach everything in a single
core in a single clock (although I suspect that even that constraint
is going to go by the board at some point).

>and
>hyperthreaded cores get their bonus points from (1) possibly requiring
>less floorplan/cost to manufacture support for n threads, and (2) the
>ability to dynamically and productively allocate otherwise idle
>resources/stages to active threads. I assume that these play off
>against one another--i.e. that the ability to run the multiple cores
>slower also allows them to be built smaller/simpler (with the forth chip
>sited earlier as an extreme in that direction), but that if multiple
>cores do end up significantly larger than the hyperthreaded
>counterparts, that would affect their ability to share caches as
>effectively.
>

The primary advantage of closely-coupled threads on a single core is
that data sharing is not done through L2 cache or memory.

>My sense is that the motivation for hyperthreading relates almost
>entirely to #2, because having lots of threads is the exception rather
>than the rule, because of the widely accepted belief that PARALLEL
>PROGRAMMING IS HARD!, and as long as that belief exists, so will shared
>pipelines.
>

I think you are doing yourself and Software Cabling a disservice by
adopting this posture. While in theory you could use the machinery of
Software Cabling to address any level of parallelism, its natural
target is coarse-grained parallelism. The programmer can think
locally von Neumann, and let SC take care of the parallel programming
stuff globally. Helper threads are a way of letting the processor or
processor and supporting software do the fine-grained parallelism.

RM

john jakson

unread,
Apr 11, 2004, 11:27:31 AM4/11/04
to
Robert Myers wrote in message

snipping

> I've sort of taken it for granted that that's where x86 processors are
> headed, but then, I'm not a computer architect. If you can't run the
> pipeline faster because the energy costs of doing so are too high,
> then you have to go to more pipelines. That's the paradigm shift or
> whatever that Gelsinger was nattering on about.
>

MTA doesn't have long pipelines, its best to think of multiple wheels
each with N faces. N might be 4,8,16 but any small int may suffice.
Each N processes is different and is identified by a Pid. N is chosen
for the necesary pipeline depth needed to sustain reg to reg typ int
operations at highest speed. This means that if a P returns N cycles
later, it can pick up previous results saving some cache b/w. The
multiple wheels which may carry the same Pids or different Pid sets.

The wheels have different functions, those on the instruction fetch
side of the queue maintain the PC and push N ops terminated by
branches into the N queues faster than needed. Those on the exec side
are for int, branch cc ops and possibly fpu,dsp or whatever. The
retired bra sends back offsets to the fetch unit. A process can't be
in fetch and exec at same time, so the 2 sides are coupled only
through the exchange of bra targets, ie what to do next.

Is the pipeine N or is N*noOfWheels. What about the queue, an op may
be fetched ahead and wait for 60 or cycles before it goes to exec. I'd
call it a N stage pipeline with maybe 60 latency stages.

regards

johnjakson_usa_com

Rupert Pigott

unread,
Apr 11, 2004, 12:16:55 PM4/11/04
to
Robert Myers wrote:

> I had forgotten you were in the Nick Maclaren camp on this one (might
> as well put it that way, because that's the way I think of it). I had
> (believe it or not) thought of starting the paragraph to which you are
> replying more or less the way you started yours, which is that I
> *still* don't understand the case for separate cores (except the one I
> mentioned). Were die layout not a problem, you'd put everything in
> the same place, and let who belonged to what sort itself out as
> needed. Die layout, of course, is a consideration.

Hardware Schmardware. ;)

I've come to think that the real challenge is choosing a parallel
coding model that has as many lives as the von Neumann cat. The von
Neumann pussy needs to run out of lives first, and it appears to
have happened already. The current generation of killer micros are
basically a bunch of parallel pussies, I think the software will
follow, albeit slowly and painfully as it always has done. I have
some hope that I'll be quite happy with the state of affairs in a
couple of years time.

Cheers,
Rupert

Robert Myers

unread,
Apr 11, 2004, 1:12:27 PM4/11/04
to
On Sun, 11 Apr 2004 17:16:55 +0100, Rupert Pigott
<r...@dark-try-removing-this-boong.demon.co.uk> wrote:

>Robert Myers wrote:
>
>> I had forgotten you were in the Nick Maclaren camp on this one (might
>> as well put it that way, because that's the way I think of it). I had
>> (believe it or not) thought of starting the paragraph to which you are
>> replying more or less the way you started yours, which is that I
>> *still* don't understand the case for separate cores (except the one I
>> mentioned). Were die layout not a problem, you'd put everything in
>> the same place, and let who belonged to what sort itself out as
>> needed. Die layout, of course, is a consideration.
>
>Hardware Schmardware. ;)
>

Except that, especially with OoO, hardware has done much of the heavy
lifting. As annoying as it may be to the real hardware architects to
have a bunch of software types going on about how hardware should be
designed, it's a subject that we software types simply cannot ignore.

I take alot of heat, some of it clearly good-natured, about my
fixation on the Cray-I, but that was a machine that you could not
program effectively without understanding how it worked. Not only
that, the programming model was simple enough that I could get my
feeble brain around it quickly enough and get on with the physics.

Either life was alot simpler then, or the world awaits another Seymour
to cut through the clutter and find the right choice of tricks that's
worth the bother and that can be utilized in actual practice. The
right answer is probably not that it's either/or, but both.

>I've come to think that the real challenge is choosing a parallel
>coding model that has as many lives as the von Neumann cat. The von
>Neumann pussy needs to run out of lives first, and it appears to
>have happened already. The current generation of killer micros are
>basically a bunch of parallel pussies, I think the software will
>follow, albeit slowly and painfully as it always has done. I have
>some hope that I'll be quite happy with the state of affairs in a
>couple of years time.
>

Once you could see the attack of the killer micros coming, _most_ of
what was going to happen could be forseen with a five and dime crystal
ball, _including_ the fact that many processes would be running in
parallel. Two decades of fiddling and fumbling have passed, and you
think a couple of years is going to bring software bliss? Have you
been celebrating a religious holiday in some highly non-traditional
fashion? ;-).

RM

Rupert Pigott

unread,
Apr 11, 2004, 1:36:30 PM4/11/04
to
Robert Myers wrote:

[SNIP]

> Once you could see the attack of the killer micros coming, _most_ of
> what was going to happen could be forseen with a five and dime crystal
> ball, _including_ the fact that many processes would be running in
> parallel. Two decades of fiddling and fumbling have passed, and you
> think a couple of years is going to bring software bliss? Have you
> been celebrating a religious holiday in some highly non-traditional
> fashion? ;-).

The key difference is : parallelism has been pervasive in hardware
that everyday programmers can get their hands on for a good decade
now. You can see the attitude changing, Threads are hip and cool,
if you look at the job ads on this side of the pond you'll see a
lot of them making explicit mention of "Threading Skills" in the
job specs. What's more : The hardware is actually getting even more
parallel too and in the not too distant future SMP (or at least
SMT systems) will be common as muck.

So while I think shared memory threads are the spawn of Satan, I
don't mind the side effect that they give me lots of parallel HW
to play with. This HW will probably run just as well with my CSP
monkey business as it will with shared memory threads too.

If you really want the Cray simplicity I think you are barking up
the wrong tree with SMT. The threads of execution are intricately
coupled by (userspace invisible) implementation dependant resource
contention issues. I suspect that this helper thread concept will
pan out the same too, that doesn't stop someone from proving me
wrong though.

Cheers,
Rupert

Robert Myers

unread,
Apr 11, 2004, 2:33:55 PM4/11/04
to
On Sun, 11 Apr 2004 18:36:30 +0100, Rupert Pigott
<r...@dark-try-removing-this-boong.demon.co.uk> wrote:

<snip>


>
>If you really want the Cray simplicity I think you are barking up
>the wrong tree with SMT. The threads of execution are intricately
>coupled by (userspace invisible) implementation dependant resource
>contention issues. I suspect that this helper thread concept will
>pan out the same too, that doesn't stop someone from proving me
>wrong though.
>

No, the days of sweetly crystalline Cray programming are gone forever.

Only people like Terje and Linus (what is it with these guys from
Scandinavia, anyway?) worry about the real effects of OoO on a
day-to-day basis. The rest of us simply accept that the processor
somehow gets away with running twenty times as fast as the memory bus
and get on with business.

When I mentioned my five and dime crystal ball, the one thing I don't
think most people would have been able to forsee was the ability to
keep a couple of hundred instructions in flight by trickery that's
invisible to all but the most detail-oriented of programmers.

My crystal ball says that helper threads are going to work the same
way. The helper threads that apparently so horrify you are just
another turn of the screw. If you've got a few hundred instructions
in flight with predication, speculation, hoisting, and fixup already,
what's a few hundred more here or there?

A bitch to debug? I would imagine so, but I just _know_ you're not
going to tell me that CSP threads are easy.

RM

Rupert Pigott

unread,
Apr 11, 2004, 5:44:53 PM4/11/04
to
Robert Myers wrote:

> A bitch to debug? I would imagine so, but I just _know_ you're not
> going to tell me that CSP threads are easy.

Fundamentally you still have to tackle the same synchronisation
issues that the problem throws up, but you don't add a whole bunch
of *hidden* coupling (eg : memory contention).

Cheers,
Rupert

Robert Myers

unread,
Apr 11, 2004, 7:52:47 PM4/11/04
to

I _think_ you're worried about a non-problem, but I'm willing to be
educated. Helper threads don't introduce user-visible separate
threads or user-visible memory sharing, and the processor maintains
the illusion of in-order execution.

Things can go horrifyingly wrong for hardware architects and compiler
designers, but it's up to them to see to it that life is no harder for
end-users than life would be without helper threads.

Once you've lost deterministic execution--and you have with OoO--you
can't rely on single-step debugger execution to expose what's actually
going on, but that's not a problem that's newly-introduced by helper
threads.

RM

Hank Oredson

unread,
Apr 11, 2004, 8:56:34 PM4/11/04
to
"Robert Myers" <rmy...@rustuck.com> wrote in message
news:cp2j701vcij39q676...@4ax.com...

And this all happens on each of the 64 processors on the motherboard.

But maybe the 32 processor chip, two per box, will not happen quite
as soon as I think it will ...

> A bitch to debug? I would imagine so, but I just _know_ you're not
> going to tell me that CSP threads are easy.
>
> RM


--

... Hank

http://horedson.home.att.net
http://w0rli.home.att.net


del cecchi

unread,
Apr 11, 2004, 11:17:43 PM4/11/04
to

"David C. DiNucci" <da...@elepar.com> wrote in message
news:4078F375...@elepar.com...

> Robert Myers wrote:
> > ...I think we'll probably wind up with both (SMT and CMP).
>
> Even with my silly assumptions, I can buy that: Enough CMP to handle
> the minimum number of threads expected to be active, and SMT (i.e.
> dynamically-allocatable pipe stages) to provide headroom. What I
don't
> buy (if it was implied) is that the CMP and SMP will be present in the
> same processors, since there are good reasons for them to run at
> different clock speeds. In fact, I might imagine SMTs (at high temp)
> sprinkled among the CMPs (at lower temp) to even out the cooling.
>
> -- Dave

You guys got a heck of a crystal ball. It can see all the way to 2003.
:-)
http://www.hotchips.org/archive/hc15/pdf/11.ibm.pdf


Robert Myers

unread,
Apr 12, 2004, 1:38:00 AM4/12/04
to

A fact I took note of as recently as April 2, under the nominal
heading of, under the nominal subject heading Re: [OT] Microsoft
aggressive search plans revealed:

RM>Power 4 isn't multi-threaded, but Power 5 is, and I'll start to get
RM>really excited when this new openness makes a multi-threaded core
RM>available to play with as IP. ;-).
RM>
RM>(I know, Del, there's just no pleasing some people).

Slide Number 12 and the slides before it make it clear that the Power
5 implementation of SMT addresses the issue of what to do when having
multiple threads active is a disadvantage (has Nick said anything
about that?). Slide 11 shows an example with a max payoff of 25%,
pretty well in line with what has most often been reported with Intel
Hyperthreading.

RM

RM

Maynard Handley

unread,
Apr 12, 2004, 5:14:45 PM4/12/04
to
In article <4078F375...@elepar.com>,

"David C. DiNucci" <da...@elepar.com> wrote:

> My sense is that the motivation for hyperthreading relates almost
> entirely to #2, because having lots of threads is the exception rather
> than the rule, because of the widely accepted belief that PARALLEL
> PROGRAMMING IS HARD!, and as long as that belief exists, so will shared
> pipelines.
>
> > ...I think we'll probably wind up with both (SMT and CMP).
>
> Even with my silly assumptions, I can buy that: Enough CMP to handle
> the minimum number of threads expected to be active, and SMT (i.e.
> dynamically-allocatable pipe stages) to provide headroom. What I don't
> buy (if it was implied) is that the CMP and SMP will be present in the
> same processors, since there are good reasons for them to run at
> different clock speeds. In fact, I might imagine SMTs (at high temp)
> sprinkled among the CMPs (at lower temp) to even out the cooling.

It is not useful to say that PARALLEL PROGRAMMING IS HARD when there are
very different things being discussed.
At the per-chip level, the issue is whether it is hard to have a CPU
running four threads or so at a time. At the Robert Myers level, the
issue is whether it is hard to have a computer running 10,000 threads at
a time. I can well believe that the second is a very hard problem.
The first, however, is right now an artificially hard problem. It is
artificially hard because while there are frequently plenty of fragments
of code that can be parallelized, they are maybe 1000 instructions long,
so the OS overhead in doing so makes the parallelization impractical.
The overhead comes from
? CPU level overhead that assumes modifications made in this thread need
to propagate out (to another CPU) and the CPU has to wait till that has
happened and
? OS overhead that assumes that it's an unacceptable waste of resources
to have a (virtual) processor sit idle, and so enforces switches into
and out of kernel, moving threads onto and off run lists and so on, as
synchronization resources are acquired and released.

Now if we updates the programming model to describe that this collection
of threads is tightly coupled and should run as a single unit on a
single physical CPU, we can get rid of the CPU level overhead because
we'd know that the mods made by thread 0A are in the cache of CPU 0 and
will be seen by thread 0B without any extra hard work on the part of the
CPU. The problem, of course, now is that one has to annotate the code
correctly to specify that THIS sync operation only involves syncing with
THAT thread, so is local, while some other sync operation is global and
needs to be propagated to all CPUs. One can see a programming model that
works well with say 2 or 4 threads but falls apart after that.
Likewise we need to be able to tell the OS that these two threads are
tightly coupled and when one waits on the other, the wait should occur
through some hardware wait mechanism, with the waiting thread going into
an HW idle state of some sort, rather than context switching to some
alternative waiting task.

Both of these are feasible, in the sense that it's not too much of a
stretch to modify the HW and SW to get them to happen.
The PROBLEM really, is that they aren't very fwd or bwd compatible.
They're not fwd compatible because the mechanisms they use to try to be
efficient (to make threading small fragments of code worthwhile) fall
apart when you have too many threads to keep track of and too many
interactions --- it's just too hard to ensure that these memory updates
only affect the threads local to this CPU and don't need to be
propagated to other CPUs. And they're not backward compatible because
code written assuming that it can run as two physical threads, with no
overhead cost for threading code fragments a thousand cycles long, will
run like a slug on a single-threaded CPU or a standard DP machine, with
OS intervention every 1000 cycles to switch to the other thread.

So that's the problem as I see it.
Maybe someone smarter than me can construct some primitives that do
allow scaling (in at least either the fwd or bwd direction) that will
allow one to thread these tiny fragments of code. But in the absence of
that, life sucks. The issue is not that the parallelism doesn't exist
but that one cannot usefully get to it through todays HW and OS
abstractions.

Maynard

Stephen Sprunk

unread,
Apr 12, 2004, 5:30:25 PM4/12/04
to
"Robert Myers" <rmy...@rustuck.com> wrote in message
news:en9k70pm2khm2co4r...@4ax.com...

> On Sun, 11 Apr 2004 22:17:43 -0500, "del cecchi"
> <dcecchi...@att.net> wrote:
> >You guys got a heck of a crystal ball. It can see all the way to 2003.
> >:-)
> >http://www.hotchips.org/archive/hc15/pdf/11.ibm.pdf
>
> Slide 11 shows an example with a max payoff of 25%,
> pretty well in line with what has most often been reported with Intel
> Hyperthreading.

And with slide 5 saying that SMT adds 24% to the size of each core (as
opposed to the 5% commonly claimed on comp.arch), that means Power5's SMT
provides no net gain in performance per mm^2.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

Tony Nelson

unread,
Apr 12, 2004, 6:05:43 PM4/12/04
to
In article <07c94d238480b9ff...@news.teranews.com>,
"Stephen Sprunk" <ste...@sprunk.org> wrote:
...

> And with slide 5 saying that SMT adds 24% to the size of each core (as
> opposed to the 5% commonly claimed on comp.arch), that means Power5's SMT
> provides no net gain in performance per mm^2.

Nowadays I bet designers are happy when they get performance increases
proportional to area increases.
____________________________________________________________________
TonyN.:' tony...@shore.net
'

Robert Myers

unread,
Apr 12, 2004, 6:33:51 PM4/12/04
to
On Mon, 12 Apr 2004 21:30:25 GMT, "Stephen Sprunk"
<ste...@sprunk.org> wrote:

>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:en9k70pm2khm2co4r...@4ax.com...
>> On Sun, 11 Apr 2004 22:17:43 -0500, "del cecchi"
>> <dcecchi...@att.net> wrote:
>> >You guys got a heck of a crystal ball. It can see all the way to 2003.
>> >:-)
>> >http://www.hotchips.org/archive/hc15/pdf/11.ibm.pdf
>>
>> Slide 11 shows an example with a max payoff of 25%,
>> pretty well in line with what has most often been reported with Intel
>> Hyperthreading.
>
>And with slide 5 saying that SMT adds 24% to the size of each core (as
>opposed to the 5% commonly claimed on comp.arch), that means Power5's SMT
>provides no net gain in performance per mm^2.
>

A study c. 1996 by Patterson et. al. showed even OoO processors
stalled 60% of the time on OLTP workloads, making OLTP workloads a
natural target for SMT. Whether IBM got it right or not this time
seems almost beside the point. It amazes me it took them this long to
get into the game.

RM

Robert Myers

unread,
Apr 12, 2004, 7:18:54 PM4/12/04
to
On Mon, 12 Apr 2004 21:14:45 GMT, Maynard Handley
<nam...@redheron.com> wrote:

<snip>


>
>So that's the problem as I see it.
>Maybe someone smarter than me can construct some primitives that do
>allow scaling (in at least either the fwd or bwd direction) that will
>allow one to thread these tiny fragments of code. But in the absence of
>that, life sucks. The issue is not that the parallelism doesn't exist
>but that one cannot usefully get to it through todays HW and OS
>abstractions.
>

So, you've got some dumb-ass von Neumann module that you want to run a
few thousand copies of. You modify the API, however it looks, to fit,
say, software cabling. Software cabling doesn't know or care what's
inside the module, it just knows and enforces the rules under which
the module can be invoked and how the module is allowed to transmit
and receive data. Software cabling invokes the module, assigns it to
a multi-threaded processor, and the magic of transparent
parallelization takes over. Software hasn't got a clue that there's
even more parallelization is going on inside the box. Simple, no?

RM

del cecchi

unread,
Apr 12, 2004, 8:47:42 PM4/12/04
to

"Robert Myers" <rmy...@rustuck.com> wrote in message
news:o46m70tusl425tiog...@4ax.com...
You will recall, I'm sure, that the farmers on the tundra had
multithreading (2) on Northstar. Yes, I recall that some in comp.arch
said it is not "real" multithreading because it was somewhat coarser
grained, switching threads on a cache miss. In the Power4 it was
decided to have two whole cores rather than two threads in one core.
Now in the Power5 they have upgraded the cores to have multithreading.

It's a funny thing. There is at least one department of folks whose job
it is to study the effects of microarchitecture trade offs on
performance. They are armed with a great deal of information about
workloads and with special simulators and well honed models. Too bad
they can't do as well as some spectator monday morning quarterback. I
guess they aren't smart enough. Or maybe the chip has to get out on
schedule.

del cecchi


Robert Myers

unread,
Apr 12, 2004, 9:24:33 PM4/12/04
to
On Mon, 12 Apr 2004 19:47:42 -0500, "del cecchi"
<dcecchi...@att.net> wrote:

>You will recall, I'm sure, that the farmers on the tundra had
>multithreading (2) on Northstar.

No, I didn't.

>Too bad
>they can't do as well as some spectator monday morning quarterback. I
>guess they aren't smart enough. Or maybe the chip has to get out on
>schedule.
>

You can read it that way if you want to, or you can read it as my
saying there's something about this I don't understand. The trades
are obviously complicated: people have taken hard stands on both sides
of the issue.

Neither IBM nor Intel says, "This is the real reason this chip does or
does not have this particular feature." Marketing will say stuff, of
course, but I don't think you would expect anyone to pay much
attention to it.

The chip comes out, and it's left to Monday morning quarterbacks to
try to figure out what's really going on. Hyperthreading for Intel is
kind of marginal. It's a natural thing for Intel to try, because it's
a way to try to recapture some of the IPC they gave up in going to a
longer pipeline.

My (Monday morning quarterback) belief, though, is that much more
significant hyperthreading is their game plan for Itanium, and that
Hyperthreading was really an R&D project for Itanium with a little
marketing pizazz as a gimme. In particular, since hyperthreading has
come out for the P4, it has been all but useless for a big chunk of P4
applications. That may change, especially with the new instructions
in SSE3.

For IBM, the story is different. As I understand SMT, it's a clear
win for OLTP workloads, and IBM won't have boxes in CompUSA with "SMT
included" on them. That means that SMT is included or not included
strictly on the merits, unless, like Intel, they are looking more to
the future than to the present.

The fact that IBM has had SMT, then didn't have SMT, and now has SMT,
all with OLTP presumably as the target application, doesn't leave me
thinking I'm smarter than the guys with all the numbers. It leaves me
wondering if I know what this is really all about.

RM

del cecchi

unread,
Apr 12, 2004, 11:43:13 PM4/12/04
to

"Robert Myers" <rmy...@rustuck.com> wrote in message
news:guem70h7s9tmdjic8...@4ax.com...

Actually, you might not have been hanging around here in Northstar days.
It was a couple of processor iterations before Power4. Back in the Old
Days. :-) And the SMT was sort of coarse, to the point that some here
refused to call it that.

I think that on any design there are always a number of ways to meet the
objectives. And the objectives include cost (mfg and development) and
schedule as well as performance and power and that stuff. Northstar
designers thought they could get a performance boost for OS400 at an
affordable cost by building a two thread SMT that switched threads on a
cache miss. Power4 thought they would be better off buying more silicon
and putting two processor cores on the chip, with each core being more
complicated than the Nstar core. Different strokes for different folks,
as we used to say.

In the case of Power5, I'm sure the question was how best to get a
performance boost out of the chip using the enhanced density. Make it 4
cores? Add some resource and convert the existing core to SMT? Build a
new spiffy core with 57 stages in the pipe? 128 bit FPU?

decisions decisions. So many transistors, so little time.

del cecchi


Terje Mathisen

unread,
Apr 13, 2004, 1:55:48 AM4/13/04
to

Maynard, in my simplistic view, this should be feasible today, using the
current 'lockless' primitives:

Multiple threads on a single physical cpu will get _very_ good
performance using such primitives since all the relevant (L2) cache
lines will be exclusively owned, right?

The real problem is the need for the OS to get out of the way, and in
particular not do a lot of the _really_ stupid things that still happen
today, i.e. like WinXP ping-pong'ing a single cpu-limited thread all
over all available cpus, instead of giving it a little cpu affinity by
default. :-(

As soon as you have more cpu-hogging threads than you have physical
resources for, you'll lose, but allocating N-1 threads on a N-thread
machine, and then assuming the OS will be clueful enough to keep them
stable isn't too much to expect, is it?

When you need to setup an NxM array, with N threads on each of M cpus,
then you need a little more help, but as long as you're willing to (more
or less) manually allocate threads to (virtual) cpus that should be enough.

OTOH, so far I haven't found any way (under WinXP) to even figure out
how my cpus are numbered!

Is it (0,1),(2,3) for (cpu0-thread0, cpu0-thread1),(cpu1-thread0,
cpu1-thread1) or does it start by going across the physical cpus:

(cpu0-thread0, cpu1-thread0),(cpu0-thread1, cpu1-thread1)

Any hints?

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

David C. DiNucci

unread,
Apr 13, 2004, 4:14:23 AM4/13/04
to
Robert Myers wrote:
>
> On Sun, 11 Apr 2004 00:27:49 -0700, "David C. DiNucci"
> <da...@elepar.com> wrote:
> >I am apparently *still* (after literally all these years) still missing
> >something very basic, so if someone has the time, maybe they can finally
> >help me get straight on this.
> >
> I had forgotten you were in the Nick Maclaren camp on this one (might
> as well put it that way, because that's the way I think of it).

I am willing to have it characterized as so, though I have no doubts
that my thinking disagrees with Nick's in at least some important ways,
and I have no intention of putting words into his mouth (or thoughts
into his fingers).

> ... I had


> (believe it or not) thought of starting the paragraph to which you are
> replying more or less the way you started yours, which is that I
> *still* don't understand the case for separate cores (except the one I
> mentioned). Were die layout not a problem, you'd put everything in
> the same place, and let who belonged to what sort itself out as
> needed. Die layout, of course, is a consideration.

Describing the issue as "die layout" makes it sound more specific than
it really is. The issue is the latency required to schedule resources
to a thread and then the related issue of moving data between those
resources. You seem to be advocating allocating resources at a very fine
granularity, and then dynamically routing the data to and from them as
needed. This is essentially a traditional dataflow philosophy, and that
might explain any perceived similarity between Nick's aversion to it and
my own: We've seen it before. It's easy to go in pretending that
there's no latency because each portion is so tiny, but you experience
that latency with every operation, or in this case, every functional
unit, for both scheduling and routing. It's death by a million cuts.
Then you hope that you can hide that latency by parallelism, and maybe
you can to some extent, but the parallelism would still do you more good
if you didn't have the latency.

Pipelines are nice because you schedule several resources at a time
(i.e. all the stages in the pipe), yet you share them at a finer
granularity (one stage at a time), and since the sharing follows a fixed
pattern, successive stages (in time) can be made close to one another
(in space). To the extent that streaming architectures can be considered
as a configurable pipe, that's great, but even then you miss out on some
of those spacetime proximities.

Maybe this is the basis of our different views on this subject. You
consider the functional units more as a pool of resources allocatable at
will, while I regard them more as a pipe. If you share some stages of a
pipe, those stages need to operate faster if they are not to be a
bottleneck. Even *I* can see the merit of tossing instructions from
different threads into a single pipe, assuming that there would be
bubbles otherwise, but that's apparently not what we're discussing.

> If you try to run helper threads on separate cores, you lose most of
> the advantages because of the overhead of communicating through L2.
> If you don't share L2, there is no point at all in helper threads.

This is the very reason I stated my cache-sharing assumption below--i.e.


> >So if my assumptions above are correct, and if multiple cores can be as
> >effective at sharing caches and integrating with other aspects of the

> >memory subsystem as their hyperthreaded counterparts, ...

But returning back:

> I don't get your argument for a fast clock at all. You don't balance
> resources by speeding up the clock. You balance resources by
> balancing resources. Your best chance of getting an ideal match is
> when all resources are in one big pool and can be assigned as needed
> to the workload that requires them. Your worst chance of getting an
> ideal match is to have the resource pool arbitrarily divided into
> pieces. Simple queuing theory, as someone put it.

Except you're leaving the queues (pipes) out of your queuing theory. If
I remember, you were the one talking about the importance of minimizing
data movement, etc., and now you seem to be advocating routing data
almost arbitrarily to get it from one FU to the next. And you are
moving and changing more state in order to accomodate that dynamic
routing. I'm getting flashbacks to tags and matching store in dataflow
machines.

And the chips get hotter and hotter.

<snip>


> "Okay," I responded, drawing my breath in slowly, "what is it that we
> need to do so that people can do these kinds of things in user space,
> since I don't think the dozens of researchers publishing papers on
> these strategies have been planning on running everything in
> privileged mode." At which point Nick responded with a list of
> requirements that look achievable to someone who is not a professional
> computer architect.

So, I read this as an explanation of your earlier statement regarding


"the possibility of close cooperation among threads".

> >So if my assumptions above are correct, and if multiple cores can be as


> >effective at sharing caches and integrating with other aspects of the
> >memory subsystem as their hyperthreaded counterparts, I am still led to
> >the conclusion that multiple cores get their bonus points from lower
> >clock speed (and therefore less heat) for similar (or better)
> >performance when there are enough threads to keep those cores busy,
>
> As I've stated, if you could put everything in one place, there would
> be no advantage at all to multiple cores (other than amortizing NRE by
> printing cores on dies like postage stamps).

I won't spend time arguing statements which we agree are based on false
premises.

> Since you can't put
> everything in one place, there is a balance between acceptable heat
> distribution, die size, and being able to reach everything in a single
> core in a single clock (although I suspect that even that constraint
> is going to go by the board at some point).

And, as I state above, I believe you are oversimplifying. The problem is
not just (and maybe even not primarily) distributing the clock. It's
allocating resources and routing data between them, all in the least
amount of time and state changes. I believe that fine-grain allocation
and routing is not the way to best accomplish that.

> >and
> >hyperthreaded cores get their bonus points from (1) possibly requiring
> >less floorplan/cost to manufacture support for n threads, and (2) the
> >ability to dynamically and productively allocate otherwise idle
> >resources/stages to active threads. I assume that these play off
> >against one another--i.e. that the ability to run the multiple cores
> >slower also allows them to be built smaller/simpler (with the forth chip
> >sited earlier as an extreme in that direction), but that if multiple
> >cores do end up significantly larger than the hyperthreaded
> >counterparts, that would affect their ability to share caches as
> >effectively.
> >
>
> The primary advantage of closely-coupled threads on a single core is
> that data sharing is not done through L2 cache or memory.

So, it seems clear that you regard threads running on a single SMT core
as being related differently (in terms of programming model) than those
that might run anywhere else.

> >My sense is that the motivation for hyperthreading relates almost
> >entirely to #2, because having lots of threads is the exception rather
> >than the rule, because of the widely accepted belief that PARALLEL
> >PROGRAMMING IS HARD!, and as long as that belief exists, so will shared
> >pipelines.
> >
>
> I think you are doing yourself and Software Cabling a disservice by
> adopting this posture. While in theory you could use the machinery of
> Software Cabling to address any level of parallelism, its natural
> target is coarse-grained parallelism.

I didn't say anything about Software Cabling, and even if had been
floating around in the back of my mind (as it is wont to do), I
certainly wasn't advocating its use in fine-grain parallelism. I do tend
to believe that there is a tendency to rely on fine-grained parallelism
when coarse-grain would work better, for the very reasons I've already
stated.

> ... Helper threads are a way of letting the processor or


> processor and supporting software do the fine-grained parallelism.

So am I to understand that you see the primary value of SMT as its
ability to support these helper threads to prefetch data into cache?

And thanks, I do believe that you answered my original question,
-- Dave
-----------------------------------------------------------------
David C. DiNucci Elepar Tools for portable grid,
da...@elepar.com http://www.elepar.com parallel, distributed, &
503-439-9431 Beaverton, OR 97006 peer-to-peer computing

David C. DiNucci

unread,
Apr 13, 2004, 4:15:58 AM4/13/04
to

Well, yes and no.

First, if all you want to do is run a few thousand copies of some von
Neumann module (dumb-ass or otherwise), Software Cabling is probably
overkill for you, but there are lots of other things that aren't. If you
don't want to do it by hand, Ninf comes to mind, but I assume projects
like Gridbus and Globus have similar tools, and for that matter, even
something like Linda Piranha.

Second, what you say is true, to the extent that the magic of
transparent parallelization exists. Obviously, if there's some black
box out there that will run my d-avNm faster than the yellow box next to
it, then all else being equal, I'll choose the black box. But if I'm
paying more to buy the black box, and/or power it, and/or cool it, my
choice might very well be different. And if I can make the cheaper,
cooler, less power-hungry yellow one actually go faster than the black
one just by providing my code to it in a different yet very programmable
and understandable form (and, yes, I am thinking of SC in this case), I
would personally be far more interested in the yellow one--though I'm
sure there are many others who would not be. In the end, it depends on
which market niche you're going for, and I've chosen mine.

Nick Maclaren

unread,
Apr 13, 2004, 4:10:18 AM4/13/04
to

In article <cp2j701vcij39q676...@4ax.com>,

Robert Myers <rmy...@rustuck.com> writes:
|>
|> Only people like Terje and Linus (what is it with these guys from
|> Scandinavia, anyway?) worry about the real effects of OoO on a
|> day-to-day basis. The rest of us simply accept that the processor
|> somehow gets away with running twenty times as fast as the memory bus
|> and get on with business.

Well, I would, if I didn't spend most of my time dealing with much
more elementary implementation imbecilities :-(

The real effects of out-of-order execution can be VERY visible
to interrupt handlers, but how many systems provide competent
application-level interrupt handling nowadays?


Regards,
Nick Maclaren.

Maynard Handley

unread,
Apr 13, 2004, 4:39:24 AM4/13/04
to
In article <c5fvd5$em4$1...@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> wrote:

> Maynard, in my simplistic view, this should be feasible today, using the
> current 'lockless' primitives:
>
> Multiple threads on a single physical cpu will get _very_ good
> performance using such primitives since all the relevant (L2) cache
> lines will be exclusively owned, right?
>
> The real problem is the need for the OS to get out of the way, and in
> particular not do a lot of the _really_ stupid things that still happen
> today, i.e. like WinXP ping-pong'ing a single cpu-limited thread all
> over all available cpus, instead of giving it a little cpu affinity by
> default. :-(

Well:
* Yes the HW stuff is available in that load-locked/store-conditional
will work. The problem is that you're not supposed to just use
load-locked/store-conditional, you're also supposed to throw some sort
of synchronizing instruction in there. I lose track of the details from
CPU to CPU, but my understanding is that you COULD get away without that
synchronizing instruction if you only cared about synching to this CPU.
Of course since IBM doesn't make any SMT CPUs it's a moot point.
Now two POWER4 cores sharing a die and an L2 presumably matches this
situation, but I don't know what IBM recommend has to be done to sync
one CPU of the die with the other CPU, and with no interest in off-die
CPUs. I don't know if they consider that an interesting problem ---
perhaps not because the tend to sell POWER4s in boxes with many CPUs.
* Of course we agree on the need for the OS to get out the way.

But, as I said before, the summary is that the sweet spot for how to do
this is neither fwd nor bwd compatible, which is why I am uncertain
about how things will play out.

> As soon as you have more cpu-hogging threads than you have physical
> resources for, you'll lose, but allocating N-1 threads on a N-thread
> machine, and then assuming the OS will be clueful enough to keep them
> stable isn't too much to expect, is it?

Even that, however, assumes you're willing to write code for say 1, 2
and 4 virtual-CPU machines. That's tough when you want such fine-grained
threading. It's a real hassle to keep work like that in sync, and
current IDEs don't do a good job of allowing you either to hide one view
of the code (so you can concentrate on say only the 2-way threading
code) or conversely show the different (but equivalent) code paths side
by side.
Perhaps it's no different from writing code to run on three different
architectures, and appropriate factoring, macros and inline functions
can help? Time will tell.

> When you need to setup an NxM array, with N threads on each of M cpus,
> then you need a little more help, but as long as you're willing to (more
> or less) manually allocate threads to (virtual) cpus that should be enough.

But at this point your life gets really tough. The language primitives
(if you're using Java or COM) or the OS primitives (eg pthreads) really
aren't set up to express [this data structure (mutex, semaphore,
whatever) here needs to be synced with these local threads, and there
needs to be synced with all threads]. What I mean is, there's no easy
way to express that at this point I want the action of acquiring a mutex
to involve only a load-locked/store-conditional loop WITHOUT the extra
"publish to the other CPUs" work, or at this point I want the act of
waiting for a mutex to involve a spin-loop, not an OS queue.

> OTOH, so far I haven't found any way (under WinXP) to even figure out
> how my cpus are numbered!
>
> Is it (0,1),(2,3) for (cpu0-thread0, cpu0-thread1),(cpu1-thread0,
> cpu1-thread1) or does it start by going across the physical cpus:
>
> (cpu0-thread0, cpu1-thread0),(cpu0-thread1, cpu1-thread1)
>
> Any hints?

Terje, I'm an MacOS X PPC boy! What can I tell you about WinXP?
Of course in MacOS X land we haven't even hit this problem yet --- but
I'm sure we will soon enough.

Maynard

Maynard Handley

unread,
Apr 13, 2004, 4:42:31 AM4/13/04
to
In article <hd8m70tpn8avvpt24...@4ax.com>,
Robert Myers <rmy...@rustuck.com> wrote:

Sure. And if you're ray-tracing or web-serving life is likewise easy. As
we went through a big song-and-dance a few weeks ago, however, those
problems don't interest me much, and aren't much relevant to the world
that's the bread-and-butter of Windows and MacOS X. What is interesting
in that world (eg my MPEG <B>DECODE</B> example) is that there is
parallelism there, but it is fine-grained --- and can't efficiently be
accessed through current language/OS models.

Maynard

Maynard Handley

unread,
Apr 13, 2004, 4:51:03 AM4/13/04
to
In article <c5fnle$13su7$1...@ID-129159.news.uni-berlin.de>,
"del cecchi" <dcecchi...@att.net> wrote:


> Actually, you might not have been hanging around here in Northstar days.
> It was a couple of processor iterations before Power4. Back in the Old
> Days. :-) And the SMT was sort of coarse, to the point that some here
> refused to call it that.
>
> I think that on any design there are always a number of ways to meet the
> objectives. And the objectives include cost (mfg and development) and
> schedule as well as performance and power and that stuff. Northstar
> designers thought they could get a performance boost for OS400 at an
> affordable cost by building a two thread SMT that switched threads on a
> cache miss. Power4 thought they would be better off buying more silicon
> and putting two processor cores on the chip, with each core being more
> complicated than the Nstar core. Different strokes for different folks,
> as we used to say.
>
> In the case of Power5, I'm sure the question was how best to get a
> performance boost out of the chip using the enhanced density. Make it 4
> cores? Add some resource and convert the existing core to SMT? Build a
> new spiffy core with 57 stages in the pipe? 128 bit FPU?
>
> decisions decisions. So many transistors, so little time.
>
> del cecchi

Do you really think such drastic changes would occur? To me POWER4 looks
like the venerable P6 core: a really nice foundation that can have all
sorts of extras tricked onto it, but basically good enough for the next
five or more years. SMT is about as drastic a change as I'd expect.

What really happened from PPro through PII and PIII? Wasn't it just
minor changes and the addition of various poorly thought out MMX style
instructions?

Maynard

Robert Myers

unread,
Apr 13, 2004, 5:38:42 AM4/13/04
to
On Tue, 13 Apr 2004 01:14:23 -0700, "David C. DiNucci"
<da...@elepar.com> wrote:

>Robert Myers wrote:
>>
>> On Sun, 11 Apr 2004 00:27:49 -0700, "David C. DiNucci"
>> <da...@elepar.com> wrote:

<snip>

>And, as I state above, I believe you are oversimplifying. The problem is
>not just (and maybe even not primarily) distributing the clock. It's
>allocating resources and routing data between them, all in the least
>amount of time and state changes. I believe that fine-grain allocation
>and routing is not the way to best accomplish that.
>

<snip>

>>
>> The primary advantage of closely-coupled threads on a single core is
>> that data sharing is not done through L2 cache or memory.
>
>So, it seems clear that you regard threads running on a single SMT core
>as being related differently (in terms of programming model) than those
>that might run anywhere else.
>

There are two sharply different ways you can use SMT (and even if the
world I'm describing is artificially black and white, indulge my
tendency to oversimplify for purposes of discussion). One of them is
to run two threads that are completely unrelated to one another (they
share as little data as possible and are at least uncorrelated in
demand on resources). That's a possible use of SMT, but it's not very
interesting to me. Deciding whether it's worth it or not probably
involves sitting through long meetings and coming up with an
inconclusive answer.

The other way to use threads (in this artificially black and white
universe) is to work with data that belong in the same cache and
shouldn't have to travel long distances to be used multiple times
because you have a (temporarily) small working set. When you enter
this small world, you can go through the instruction stream one step
at a time, you can, in effect, initiate threadlets and suspend them
with register renaming the way an OoO processor does, or you can just
jump into the pile of instructions and look for stuff you can do (I
hope you appreciate my use of terms of art). The exciting possibility
of agile SMT is the latter.

There are many proposals floating around for how you parallelize a
nominally serial instruction stream. Forcing out long-latency memory
requests is the most obvious one and the one that has gotten the most
attention, but it's not the only possibility. Dick Wilmot has spoken
clearly and cogently for a more ambitious strategy, and I am sure
there are plenty more in the literature. For these strategies to
work, you need to be able to execute separate threads almost as
effortlessly as an OoO processor currently initiates, suspends, and
resumes execution of tiny groups of instructions.

If, in my suggesting creating a dataflow machine on the fly, you are
imagining a general dataflow architecture, with prospective dance
partners far apart on a crowded dance floor trying to find one
another, I can see why you would be horrified. I'm not suggesting
such a thing. I'm suggesting taking instructions and data that belong
close together in space and time and giving the processor more
flexibility as to how it executes them than a single pipe OoO
processor can. Just one more turn of the screw, as I said to Rupert
Pigott.

SMT as originally introduced in the P4 won't support the kind of
parallelism that I see as potentially extremely valuable, and I don't
know whether or not the thread synchronization instructions in SSE3
will change that situation at all, so I can see that I have probably
confused alot of people. I hope I've made myself a little more clear.

RM

Joe Seigh

unread,
Apr 13, 2004, 7:35:49 AM4/13/04
to

Maynard Handley wrote:
>
> In article <c5fvd5$em4$1...@osl016lin.hda.hydro.com>,
> Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>
> > Maynard, in my simplistic view, this should be feasible today, using the
> > current 'lockless' primitives:
> >
> > Multiple threads on a single physical cpu will get _very_ good
> > performance using such primitives since all the relevant (L2) cache
> > lines will be exclusively owned, right?
> >
> > The real problem is the need for the OS to get out of the way, and in
> > particular not do a lot of the _really_ stupid things that still happen
> > today, i.e. like WinXP ping-pong'ing a single cpu-limited thread all
> > over all available cpus, instead of giving it a little cpu affinity by
> > default. :-(
>
> Well:
> * Yes the HW stuff is available in that load-locked/store-conditional
> will work. The problem is that you're not supposed to just use
> load-locked/store-conditional, you're also supposed to throw some sort
> of synchronizing instruction in there. I lose track of the details from
> CPU to CPU, but my understanding is that you COULD get away without that
> synchronizing instruction if you only cared about synching to this CPU.

...


>
> But at this point your life gets really tough. The language primitives
> (if you're using Java or COM) or the OS primitives (eg pthreads) really
> aren't set up to express [this data structure (mutex, semaphore,
> whatever) here needs to be synced with these local threads, and there
> needs to be synced with all threads]. What I mean is, there's no easy
> way to express that at this point I want the action of acquiring a mutex
> to involve only a load-locked/store-conditional loop WITHOUT the extra
> "publish to the other CPUs" work, or at this point I want the act of
> waiting for a mutex to involve a spin-loop, not an OS queue.
>

Part of the problem is that hardware designers have no idea how multi-threading
is actually used. Take strongly coherent cache. No correctly written portable
multi-threaded program depends on strongly choherent cache. Yet hardware
designers insist that it is needed. It's some kind of fetish thing as near as
I can figure out.

BTW, some of the mutex implementations are smart enough to spin when the holder
of the lock is running but otherwise suspend to avoid wasting cpu cycles. You
also might look into lock-free algorithms that allow forward progress without
regard to the status of other threads.

Joe Seigh

Stephen Fuld

unread,
Apr 13, 2004, 1:10:40 PM4/13/04
to

"del cecchi" <dcecchi...@att.net> wrote in message
news:c5fnle$13su7$1...@ID-129159.news.uni-berlin.de...

The distinction may be important to answering the question that Robert
posed. The Northstar multi threading was sefinitly multi threading, but it
wasn't *simultaneous* multi threading as there were instructions from only
one thread at a time at any given pipeline stage. I believe the technical
name for what NS did was "switch on event" (SOE) multithreading. In the NS
case, the event was an L2 cache miss.

Now the reason that may be relevant is that I expect converting a single
thread core to doing true SMT is more work on the core than doing SOEMT.
The scheduler is more complex, and you must add more stuff to keep track of
which thread things belong to. This requires additional development time.

Given that, it is reasonable to think that, given the requirements to get
something out, the Power 4 chose the simpler development of putting two
cores on the same die, thus allowing them to get the product out while doing
the, presumably harder, development work to get the core to be SMT. The
next logical step is to apply that work, the "SMTing" of the core, to the
existing Power 4 design to get to the next performance level.

Of course, this is pure speculation, but it seems reasonable.

--
- Stephen Fuld
e-mail address disguised to prevent spam


Nick Maclaren

unread,
Apr 13, 2004, 1:22:19 PM4/13/04
to
In article <kaVec.26739$i74.5...@bgtnsc04-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>Now the reason that may be relevant is that I expect converting a single
>thread core to doing true SMT is more work on the core than doing SOEMT.
>The scheduler is more complex, and you must add more stuff to keep track of
>which thread things belong to. This requires additional development time.
>
>Given that, it is reasonable to think that, given the requirements to get
>something out, the Power 4 chose the simpler development of putting two
>cores on the same die, thus allowing them to get the product out while doing
>the, presumably harder, development work to get the core to be SMT. The
>next logical step is to apply that work, the "SMTing" of the core, to the
>existing Power 4 design to get to the next performance level.

That assumes that Eggers-style SMT DOES deliver "the next performance"
level over shared-cache SMP/CMP or MTA/Northstar-style SOEMT. One of
the consequences of additional complexity is that you may have to
omit optimisations in order to get correctness.

TANSTAAFL.


Regards,
Nick Maclaren.

Robert Myers

unread,
Apr 13, 2004, 7:43:26 PM4/13/04
to
On Mon, 12 Apr 2004 22:43:13 -0500, "del cecchi"
<dcecchi...@att.net> wrote:

>
>Actually, you might not have been hanging around here in Northstar days.
>It was a couple of processor iterations before Power4. Back in the Old
>Days. :-) And the SMT was sort of coarse, to the point that some here
>refused to call it that.
>

I was indeed not around for the original Northstar discussion, but I
do believe that the SOEMT aspect of Northstar has been discussed
whilst I have been present, much as the Dormouse was present in the
courtroom of the King and Queen of Hearts*. I don't think I had read
the Patterson paper at the time.

RM

*Well, at any rate, the Dormouse said-- the Hatter went on, looking
anxiously round to see if he would deny it too: but the Dormouse
denied nothing, being fast asleep.

Terje Mathisen

unread,
Apr 14, 2004, 1:51:58 AM4/14/04
to
Maynard Handley wrote:
> But at this point your life gets really tough. The language primitives
> (if you're using Java or COM) or the OS primitives (eg pthreads) really
> aren't set up to express [this data structure (mutex, semaphore,
> whatever) here needs to be synced with these local threads, and there
> needs to be synced with all threads]. What I mean is, there's no easy
> way to express that at this point I want the action of acquiring a mutex
> to involve only a load-locked/store-conditional loop WITHOUT the extra
> "publish to the other CPUs" work, or at this point I want the act of
> waiting for a mutex to involve a spin-loop, not an OS queue.

It isn't quite (or even nearly?) this bad:

All the lockless primitives (I assume you write your own code instead of
depending upon OS calls/libs!) can be used directly in your own code,
which means there's no need to involve the OS at all, or at least not
for thread/cpu syncing.

Anyway, the real reason this is a feasible way to go is that, due to
cache coherence, all of this will work transparently across threads
and/or cpus, the only difference is that a cpu-local lock means that the
L2 cache line will never need to change ownership, and this is crucial
for performance.

Joe Seigh

unread,
Apr 14, 2004, 7:17:39 AM4/14/04
to

Terje Mathisen wrote:
>
> All the lockless primitives (I assume you write your own code instead of
> depending upon OS calls/libs!) can be used directly in your own code,
> which means there's no need to involve the OS at all, or at least not
> for thread/cpu syncing.
>
> Anyway, the real reason this is a feasible way to go is that, due to
> cache coherence, all of this will work transparently across threads
> and/or cpus, the only difference is that a cpu-local lock means that the
> L2 cache line will never need to change ownership, and this is crucial
> for performance.

You'd still need memory barriers because of the out of order execution.
The performance hit from that isn't significant? Especially if you are
going for finer grained multi-threading.

Joe Seigh

Terje Mathisen

unread,
Apr 14, 2004, 9:51:29 AM4/14/04
to
Joe Seigh wrote:

All bus traffic is significant, but I'm not too worried about a write
barrier (WB) or two near each locked update.

I would like to know what's the most efficient WB on x86, i.e. cpuid is
serializing, but might not (probably cannot!) guarantee anything about
write order outside the current cpu.

What is Linux using these days?

Joe Seigh

unread,
Apr 14, 2004, 11:36:57 AM4/14/04
to

Terje Mathisen wrote:
>
> I would like to know what's the most efficient WB on x86, i.e. cpuid is
> serializing, but might not (probably cannot!) guarantee anything about
> write order outside the current cpu.
>
> What is Linux using these days?
>

For general memory barrier I think they're using

lock add [esp + 0], 0

Not all pentiums have a real membar instruction yet.

Stores for x86 should be in order except for some of the special extended
instructions and the string stores.

Joe Seigh

Sander Vesik

unread,
Apr 14, 2004, 1:37:07 PM4/14/04
to
Stephen Fuld <s.f...@pleaseremove.att.net> wrote:
>
> The distinction may be important to answering the question that Robert
> posed. The Northstar multi threading was sefinitly multi threading, but it
> wasn't *simultaneous* multi threading as there were instructions from only
> one thread at a time at any given pipeline stage. I believe the technical
> name for what NS did was "switch on event" (SOE) multithreading. In the NS
> case, the event was an L2 cache miss.

See the problem is that when you go to shorten SOE MT you end up with SMT again ;-)

On a more serious note - why is the presence of instructions in a single
stage - and not the presence of instructions in the pipeline from different
threads used as the metric? After all, both require you to tag results with
teh thread and even more, if there are execution units that do long async
computations - say a divider - you may *occasionaly* have instructions from
two threads executing in the same pipeline stage.

--
Sander

+++ Out of cheese error +++

Eric

unread,
Apr 14, 2004, 1:19:42 PM4/14/04
to
Terje Mathisen wrote:
>
> I would like to know what's the most efficient WB on x86, i.e. cpuid is
> serializing, but might not (probably cannot!) guarantee anything about
> write order outside the current cpu.

The P4 has LFENCE, SFENCE and MFENCE instructions, and a PTE
can be marked for weak ordered accesses, but I have not heard
of any OS that makes weak ordering available to apps.
(Going by their descriptions, the x86 SFENCE behaves different
from an Alpha WMB, more like a an Alpha MB.)

Eric

Eric

unread,
Apr 14, 2004, 1:18:01 PM4/14/04
to
Maynard Handley wrote:
>
> <snip>

>
> But at this point your life gets really tough. The language primitives
> (if you're using Java or COM) or the OS primitives (eg pthreads) really
> aren't set up to express [this data structure (mutex, semaphore,
> whatever) here needs to be synced with these local threads, and there
> needs to be synced with all threads]. What I mean is, there's no easy
> way to express that at this point I want the action of acquiring a mutex
> to involve only a load-locked/store-conditional loop WITHOUT the extra
> "publish to the other CPUs" work, or at this point I want the act of
> waiting for a mutex to involve a spin-loop, not an OS queue.
>
> <snip>

I don't follow you here. What you seem to be describing is an
ordinary cpu spinlock, but that would never be used by a thread.
If the mutex is not available then the acquiring thread needs to
yield and wait for the owner to release, and that involves the OS.

There is one situation where this is not strictly so, where the
lock duration is so short (e.g. a linked list push or pop) that
it can be worthwhile for the acquiring thread to try spinning for a
little while, just in case the lock is released quickly. However
since the owner thread could be rescheduled while holding the lock,
there is no point in spinning and awaiting release for too long.

Eric


Stephen Fuld

unread,
Apr 14, 2004, 1:46:07 PM4/14/04
to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:c5h7kb$j4r$1...@pegasus.csx.cam.ac.uk...

> In article <kaVec.26739$i74.5...@bgtnsc04-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
> >
> >Now the reason that may be relevant is that I expect converting a single
> >thread core to doing true SMT is more work on the core than doing SOEMT.
> >The scheduler is more complex, and you must add more stuff to keep track
of
> >which thread things belong to. This requires additional development
time.
> >
> >Given that, it is reasonable to think that, given the requirements to get
> >something out, the Power 4 chose the simpler development of putting two
> >cores on the same die, thus allowing them to get the product out while
doing
> >the, presumably harder, development work to get the core to be SMT. The
> >next logical step is to apply that work, the "SMTing" of the core, to the
> >existing Power 4 design to get to the next performance level.
>
> That assumes that Eggers-style SMT DOES deliver "the next performance"
> level over shared-cache SMP/CMP or MTA/Northstar-style SOEMT.

No, it actually only assumes that the IBM people who made that decision
thought that it did. :-) I don't want to get into that argument again.
Since IBM tends to put out white papers/redbooks, etc. we may find out more
information on the reality as time passes.

Alexander Terekhov

unread,
Apr 14, 2004, 3:12:01 PM4/14/04
to

Joe Seigh wrote:
[...]

> Stores for x86 should be in order except for some of the special extended
> instructions and the string stores.

On Itanic, ordinary x86 loads have acquire semantics (hoist-
load/store mbar for subsequent loads/stores in the program order)
and ordinary stores have release semantics (sink-load/store mbar
for preceding loads/stores in the program order).

regards,
alexander.

Joe Seigh

unread,
Apr 14, 2004, 3:31:14 PM4/14/04
to

"sink" and Itanic are an appropiate conjunction. It was nice that
Intel attempted to define their memory model semantics with examples
but did anyone notice how bizarre some of those examples are? None
of them are typical programming contructs.

Joe Seigh

Nick Maclaren

unread,
Apr 14, 2004, 3:50:55 PM4/14/04
to
In article <zNefc.23162$K_.6...@bgtnsc05-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>> >next logical step is to apply that work, the "SMTing" of the core, to the
>> >existing Power 4 design to get to the next performance level.
>>
>> That assumes that Eggers-style SMT DOES deliver "the next performance"
>> level over shared-cache SMP/CMP or MTA/Northstar-style SOEMT.
>
>No, it actually only assumes that the IBM people who made that decision
>thought that it did. :-) I don't want to get into that argument again.

A fair point - well, two of them :-)

>Since IBM tends to put out white papers/redbooks, etc. we may find out more
>information on the reality as time passes.

Maybe. Those books tend to be what is supposed to happen, and are
reliable when things have gone well, but can be pretty misleading
when they haven't.


Regards,
Nick Maclaren.

Terje Mathisen

unread,
Apr 14, 2004, 4:53:35 PM4/14/04
to
Joe Seigh wrote:

Not guaranteed if you write to multiple locations in the same cache
line: They can be combined and the written in a single burst, even if
you wrote it in a non-sequential order, afaik.

I.e. Write Combining memory ranges can even be used for graphics frame
buffers, where this specific feature can cause problems for
memory-mapped control ports in the same range.

Iain McClatchie

unread,
Apr 14, 2004, 5:04:33 PM4/14/04
to
Terje> I would like to know what's the most efficient WB on x86, i.e. cpuid is
Terje> serializing, but might not (probably cannot!) guarantee anything about
Terje> write order outside the current cpu.

What are you two talking about?

x86 processors have processor ordered memory semantics. Write barriers
aren't needed. Write order apparent to any other CPU is the program order
in the other thread.

This is not complicated. It's a simple semantic. CPU designers like this
simple semantic because it's easier to explain than nearly anything else,
and because they mostly get it for free with their OoO cores.

To help bury this topic, let me point out that the Alpha guys made a big deal
of not promising processor ordering. I don't see that it ever would have
bought them anything, except a lot of excitement among software weenies who
thought they had a new problem to think about.

So far as I know, processor ordering does not give you trivial locking, so
that's still an issue.

Joe Seigh

unread,
Apr 14, 2004, 5:41:38 PM4/14/04
to

Iain McClatchie wrote:
>
> Terje> I would like to know what's the most efficient WB on x86, i.e. cpuid is
> Terje> serializing, but might not (probably cannot!) guarantee anything about
> Terje> write order outside the current cpu.
>
> What are you two talking about?

The x86 memory model.


>
> x86 processors have processor ordered memory semantics. Write barriers
> aren't needed. Write order apparent to any other CPU is the program order
> in the other thread.

processor order != program order

Processor order is used when discussing whether cache is transparent or not,
i.e. whether cache preserves processor order.

But x86 stores are in program order AFAIK except for the write combining stuff
which sounded to me like it was taking place in the processor's store buffers
not cache (from the Intel docs).

Joe Seigh

Alexander Terekhov

unread,
Apr 14, 2004, 6:21:59 PM4/14/04
to

Iain McClatchie wrote:
[...]

> What are you two talking about?

For starters,

http://www.crhc.uiuc.edu/ece412/papers/models_tutorial.pdf

regards,
alexander.

Maynard Handley

unread,
Apr 14, 2004, 7:10:49 PM4/14/04
to
In article <407D9276...@xemaps.com>,
Joe Seigh <jsei...@xemaps.com> wrote:


What are "the examples" you refer to above?
No-one is claiming that these are an important issue for "normal" code.
The point is that they ARE an important issue when you are trying to
write code that involves some sort of synchronization between 2 CPUs
that do not share a cache, for example if you are writing a mutex
acquire/mutex release primitive. If you have never written or pondered
such code and the various things that can go wrong in the sequence of
events, you're really not in a position to judge.

Maynard

Joe Seigh

unread,
Apr 14, 2004, 7:39:11 PM4/14/04
to

Intel Itanium Architecture Software Developer's Manual
Volume 2: System Architecture
section 2.2 Memory Ordering in the Intel Itanium Architecture

Actually I have. Numerous times. In a commercial operating system
even. I've even done a formal definition of memory model semantics.

Joe Seigh

Eric Gouriou

unread,
Apr 14, 2004, 8:35:39 PM4/14/04
to
Joe Seigh wrote:
> Maynard Handley wrote:
>
>>In article <407D9276...@xemaps.com>,
>> Joe Seigh <jsei...@xemaps.com> wrote:
>>
>>
>>>[...] It was nice that

>>>Intel attempted to define their memory model semantics with examples
>>>but did anyone notice how bizarre some of those examples are? None
>>>of them are typical programming contructs.
>>>
>>>Joe Seigh
>>
>>What are "the examples" you refer to above?
[...]

> Intel Itanium Architecture Software Developer's Manual
> Volume 2: System Architecture
> section 2.2 Memory Ordering in the Intel Itanium Architecture

That's not the best place to learn about those semantics. Instead
I'd recommend the formal specification document:
<URL:http://www.intel.com/design/Itanium/Downloads/251429.htm>

You should feel at home in the formalism.

Eric

Joe Seigh

unread,
Apr 14, 2004, 8:33:02 PM4/14/04
to

Eric Gouriou wrote:


>
> Joe Seigh wrote:
>