Another place where there's alot of money

Robert Myers

unread,

Apr 9, 2004, 11:40:24 PM4/9/04

to

And, I might add, probably a perpetually instatiable appetite for
throughput.

http://www.nytimes.com/2004/04/10/technology/10GAME.html?pagewanted=1&hp

<quote>

Computer games represent one of the fastest-growing, most profitable
entertainment businesses. Making movies, by contrast, is getting
tougher and more expensive, now costing, with marketing fees, an
average of $103 million a film. That is one reason, among others, that
those with power in Hollywood are avidly seeking to get into the game
business while also reshaping standard movie contracts so they can
grab a personal share of game rights.

<snip>

Ridley Scott, best known for science fiction fantasies like "Blade
Runner" and "Alien" as well as the historical epic "Gladiator," has
been meeting with video game company executives, too, arguing that
games offer greater creative opportunities these days because they are
less expensive to make and not constrained by the roughly two-hour
time frame of a conventional movie.

"The idea that a world, the characters that inhabit it, and the
stories those characters share can evolve with the audience's
participation and, perhaps, exist in a perpetual universe is indeed
very exciting to me," said Mr. Scott, who is seeking a video game
maker to form a partnership with him and his brother Tony.

</quote>

The article moves on to make some cautionary comments, likening the
current world of game authors to that of comic book writers.

The talent will go where the money is. If computer games attract the
best creative talent (apologies to those who believe that comic books
are high art), the money will go there, too. We ain't see nuttin'
yet.

RM

john jakson

unread,

Apr 10, 2004, 11:03:47 AM4/10/04

to

Robert Myers <rmy...@rustuck.com> wrote in message news:<klqe705f63oher239...@4ax.com>...

Now this is where MTA could really take off, I wouldn't be surprised
if its already in there somewhere.

regards

johnjakson_usa_com

Robert Myers

unread,

Apr 10, 2004, 5:27:30 PM4/10/04

to

On 10 Apr 2004 08:03:47 -0700, johnj...@yahoo.com (john jakson)
wrote:

>Robert Myers <rmy...@rustuck.com> wrote in message news:<klqe705f63oher239...@4ax.com>...

<snip>

>>
>> The talent will go where the money is. If computer games attract the
>> best creative talent (apologies to those who believe that comic books
>> are high art), the money will go there, too. We ain't see nuttin'
>> yet.
>>

>

>Now this is where MTA could really take off, I wouldn't be surprised
>if its already in there somewhere.
>

I think Ian McClatchie has explained to us that indeed it is:

>Ian>I think some of the research you are talking about is happening
>for the gaming market, right now.
>
>Robert> Hardware and software support for lightweight threads.
>
>Ian>Graphics chips spawn multiple "threads" PER CYCLE.
>
>Robert> Hardware and software support for streaming computation.
>
>Ian>These threads coordinate access to memories.
>
>Robert> Tools, strategies, and infrastructure to make specialized hardware
>Robert> like ASIC's more easily available for use by science.
>
>Ian>Not here quite yet, since the gaming guys do single precision and
>you folks want DP mostly. But check out www.gpgpu.org. I honestly
>think that there is a real chance the graphics folks are going to
>sneak up on and overwhelm the CPU guys for physics codes, maybe
>unintentionally.

That's why I can't take IBM's entries into the supercomputer market
altogether seriously. Their best technologies are going into games.
The golden rule, you know.

Servers? Supercomputers? x86? B-o-o-o-ring.

RM

Rupert Pigott

unread,

Apr 10, 2004, 7:04:33 PM4/10/04

to

The other day I was grinding my teeth at the minimal level of
committment to open-source from the 3D GFX card makers because
the bloody drivers locked up when I dragged a window AND I
wanted to run them on something other than x86 Linux.

Then it occurred to me : You could probably get some mileage
out of running WireGL/Chromium on a small BG/L style system on
PCI-Express card. So there's one cool app that is games related
and could harness BG/L type nodes. Anything to stick two fingers
up at the 3D hardware vendors would be good right now. Unhappy
customer. :/

Cheers,
Rupert

Robert Myers

unread,

Apr 10, 2004, 8:09:35 PM4/10/04

to

One way to read that, of course, is that they think they've really got
something worth hiding.

I just (like, last night, or, really, this morning) got my Promise
SATA raid array running on a driver Promise-supplied driver compiled
from source (against an Enterprise Server SMP kernel, no less), and
(as you may or may not know), the history of Promise controllers and
Linux has not been a smooth one. It is, on the other hand, x86 Linux,
and I haven't yet booted from the RAID array.

The fact that Promise has completely caved and that there will be a
GPL driver native to the 2.6 Kernel fills me with hope.

I don't generally suffer from a shortage of self-confidence, but
graphics drivers is where I draw the line. I don't know what it is
_other_ than x86 Linux you are using, but, by the time you've gotten
to the desktop in a Linux system running X, you've got so many layers
and configuration files that serially redefine what the previous layer
defined that you're lucky if you've still got your sanity in hand.

I figure that using a GPU to do physics should, by comparison, be a
walk in the park.

>Then it occurred to me : You could probably get some mileage
>out of running WireGL/Chromium on a small BG/L style system on
>PCI-Express card. So there's one cool app that is games related
>and could harness BG/L type nodes. Anything to stick two fingers
>up at the 3D hardware vendors would be good right now. Unhappy
>customer. :/
>

It will never happen, of course, but as a way of throwing some real?
world?, er, light ;-) on the whole issue, I'll bet you'd get top
billing on slashdot for such a project. That way, we could get the
discussion down to some meaningful benchmarks, like frame rate on
Quake.

One day of fame and a swamped server in return for a hefty slice of
your sanity? You just didn't seem quite _that_ far gone. ;-).

On the _other_ hand, a small BG/L system on a PCI-Express card sounds
like it should be a day and a half's work for Del's shop, and I'll bet
IBM would sell a few.

RM

john jakson

unread,

Apr 11, 2004, 12:54:34 AM4/11/04

to

Robert Myers <rmy...@rustuck.com> wrote in message news:<51pg70t6bevebrvb2...@4ax.com>...

I have been speculating that MTA will replace RISC just as it replaced
CISC, now thats worth some flame.

I had this horrible idea that might not be so horrible.

Since x86 went from plain CISC to ride along with RISC (and beat all
of them in the process), I wonder if it could jump again onto MTA. I
see no reason why the actual instructions executed by the barrel
engine in the MTA could not be x86 codes or atleast the RISC subset
that most compilers emit for.

In my design I already allow for some variable length codes 16b x 1-4,
so thats not a problem. My branches cost near 0 cycles (1/8 cycle for
fetch ahead) taken or not. The execution unit can bind 1 or 2 branches
with a non branch so there is a speed boost there of about 10-30%, no
branch prediction of any sort needed here.

An x86 MTA would probably need 1 or 2 extra pipe stages to resolve the
more complex 8b instruction encodings. The codes I would leave out can
be microcoded in firmware, and the penalty for those is not so bad if
a little help is given in HW trapping. Probably not so practical in
FPGA but worth some thinking. FPU is still a problem though. If FPGA
can hit 250MHz after P/R, then a full effort custom VLSI should be
able to go 3-10x faster, limited only by the cycle rates of dualport
ram or 8b adds and similar.

It would allow all the benefits of MTA to get with all the benefits of
having all that code, OSes, compilers etc. The details involved in
executing simplified x86 ops are the really the same as clean sheet
ops for new MTA. Question is which is harder, writing a compiler for
blank ISA or just using x86 codes.

regards

johnjakson_usa_com

Robert Myers

unread,

Apr 11, 2004, 1:19:43 AM4/11/04

to

On 10 Apr 2004 21:54:34 -0700, johnj...@yahoo.com (john jakson)
wrote:

<snip>

>
>I have been speculating that MTA will replace RISC just as it replaced
>CISC, now thats worth some flame.
>
>I had this horrible idea that might not be so horrible.
>
>Since x86 went from plain CISC to ride along with RISC (and beat all
>of them in the process), I wonder if it could jump again onto MTA. I
>see no reason why the actual instructions executed by the barrel
>engine in the MTA could not be x86 codes or atleast the RISC subset
>that most compilers emit for.
>

I've sort of taken it for granted that that's where x86 processors are
headed, but then, I'm not a computer architect. If you can't run the
pipeline faster because the energy costs of doing so are too high,
then you have to go to more pipelines. That's the paradigm shift or
whatever that Gelsinger was nattering on about.

There are only so many ways you can deploy multiple pipelines on a
single die. The only advantage that I can see to separate cores is
that the entire die doesn't have to be reachable in a single clock.
As a trade against that, you lose the possibility of close cooperation
among threads. I think we'll probably wind up with both (SMT and
CMP).

My reference to x86 as boring was really meant to refer to further
tweaks and die shrinks on the current architectures, not particularly
to the instruction set.

RM

David C. DiNucci

unread,

Apr 11, 2004, 3:27:49 AM4/11/04

to

Robert Myers wrote:
> There are only so many ways you can deploy multiple pipelines on a
> single die. The only advantage that I can see to separate cores is
> that the entire die doesn't have to be reachable in a single clock.
> As a trade against that, you lose the possibility of close cooperation

> among threads. ...

I am apparently *still* (after literally all these years) still missing
something very basic, so if someone has the time, maybe they can finally
help me get straight on this.

First, assuming that "separate cores" share no resources/stages, the
alternative is to share some resources/stages between threads, in which
case, to achieve the same maximum performance with multiple threads, the
clock needs to run fast enough to ensure that any such shared
resources/stages are not a bottleneck. That is one motivation for high
clock which is not present for multiple cores (and it seems to me to be
a major one).

Second, what kind of "close cooperation" is currently possible between
threads which share a core? I recall that the early Tera MTA only
allowed threads to interact through memory, so unless that has changed
significantly in hyperthreaded architectures (and I don't recall seeing
that it had), I don't know what "possibility" you are losing with
multiple cores.

So if my assumptions above are correct, and if multiple cores can be as
effective at sharing caches and integrating with other aspects of the
memory subsystem as their hyperthreaded counterparts, I am still led to
the conclusion that multiple cores get their bonus points from lower
clock speed (and therefore less heat) for similar (or better)
performance when there are enough threads to keep those cores busy, and
hyperthreaded cores get their bonus points from (1) possibly requiring
less floorplan/cost to manufacture support for n threads, and (2) the
ability to dynamically and productively allocate otherwise idle
resources/stages to active threads. I assume that these play off
against one another--i.e. that the ability to run the multiple cores
slower also allows them to be built smaller/simpler (with the forth chip
sited earlier as an extreme in that direction), but that if multiple
cores do end up significantly larger than the hyperthreaded
counterparts, that would affect their ability to share caches as
effectively.

My sense is that the motivation for hyperthreading relates almost
entirely to #2, because having lots of threads is the exception rather
than the rule, because of the widely accepted belief that PARALLEL
PROGRAMMING IS HARD!, and as long as that belief exists, so will shared
pipelines.

> ...I think we'll probably wind up with both (SMT and CMP).

Even with my silly assumptions, I can buy that: Enough CMP to handle
the minimum number of threads expected to be active, and SMT (i.e.
dynamically-allocatable pipe stages) to provide headroom. What I don't
buy (if it was implied) is that the CMP and SMP will be present in the
same processors, since there are good reasons for them to run at
different clock speeds. In fact, I might imagine SMTs (at high temp)
sprinkled among the CMPs (at lower temp) to even out the cooling.

-- Dave

Robert Myers

unread,

Apr 11, 2004, 9:24:13 AM4/11/04

to

On Sun, 11 Apr 2004 00:27:49 -0700, "David C. DiNucci"
<da...@elepar.com> wrote:

>Robert Myers wrote:
>> There are only so many ways you can deploy multiple pipelines on a
>> single die. The only advantage that I can see to separate cores is
>> that the entire die doesn't have to be reachable in a single clock.
>> As a trade against that, you lose the possibility of close cooperation
>> among threads. ...
>
>I am apparently *still* (after literally all these years) still missing
>something very basic, so if someone has the time, maybe they can finally
>help me get straight on this.
>

I had forgotten you were in the Nick Maclaren camp on this one (might
as well put it that way, because that's the way I think of it). I had
(believe it or not) thought of starting the paragraph to which you are
replying more or less the way you started yours, which is that I
*still* don't understand the case for separate cores (except the one I
mentioned). Were die layout not a problem, you'd put everything in
the same place, and let who belonged to what sort itself out as
needed. Die layout, of course, is a consideration.

>First, assuming that "separate cores" share no resources/stages, the
>alternative is to share some resources/stages between threads, in which
>case, to achieve the same maximum performance with multiple threads, the
>clock needs to run fast enough to ensure that any such shared
>resources/stages are not a bottleneck. That is one motivation for high
>clock which is not present for multiple cores (and it seems to me to be
>a major one).
>

If you try to run helper threads on separate cores, you lose most of
the advantages because of the overhead of communicating through L2.
If you don't share L2, there is no point at all in helper threads.

I don't get your argument for a fast clock at all. You don't balance
resources by speeding up the clock. You balance resources by
balancing resources. Your best chance of getting an ideal match is
when all resources are in one big pool and can be assigned as needed
to the workload that requires them. Your worst chance of getting an
ideal match is to have the resource pool arbitrarily divided into
pieces. Simple queuing theory, as someone put it.

>Second, what kind of "close cooperation" is currently possible between
>threads which share a core? I recall that the early Tera MTA only
>allowed threads to interact through memory, so unless that has changed
>significantly in hyperthreaded architectures (and I don't recall seeing
>that it had), I don't know what "possibility" you are losing with
>multiple cores.
>

Tera MTA is not a good way to think about the strategies that are
being discussed. The idea of Tera MTA was to have a slew of
_separate_ threads that advanced only every n clocks so that every
single instruction could have a latency of n without any other magic
at all.

There is a paper I have cited so many times that I get tired of
looking it up that describes one of several strategies for using
helper threads to, in effect, expand the run-time scheduling window of
Itanium without introducing OoO scheduling. The one particular paper
examines using helper threads with SMT and CMP, and CMP loses big
time. Helper threads, by definition, are trying to advance a primary
thread and are not conceptually separate as would be the threads in
Tera MTA.

The last time I mentioned the paper in question, Nick's response was,
more or less, "Oh that. Well, we've known for decades that you could
get that kind of speedup for a helper thread--only in privileged
mode." [Insert here a tart response to the "well we've known for
decades" line that I'm composing and going to attach to a hot key].

"Okay," I responded, drawing my breath in slowly, "what is it that we
need to do so that people can do these kinds of things in user space,
since I don't think the dozens of researchers publishing papers on
these strategies have been planning on running everything in
privileged mode." At which point Nick responded with a list of
requirements that look achievable to someone who is not a professional
computer architect.

>So if my assumptions above are correct, and if multiple cores can be as
>effective at sharing caches and integrating with other aspects of the
>memory subsystem as their hyperthreaded counterparts, I am still led to
>the conclusion that multiple cores get their bonus points from lower
>clock speed (and therefore less heat) for similar (or better)
>performance when there are enough threads to keep those cores busy,

As I've stated, if you could put everything in one place, there would
be no advantage at all to multiple cores (other than amortizing NRE by
printing cores on dies like postage stamps). Since you can't put
everything in one place, there is a balance between acceptable heat
distribution, die size, and being able to reach everything in a single
core in a single clock (although I suspect that even that constraint
is going to go by the board at some point).

>and
>hyperthreaded cores get their bonus points from (1) possibly requiring
>less floorplan/cost to manufacture support for n threads, and (2) the
>ability to dynamically and productively allocate otherwise idle
>resources/stages to active threads. I assume that these play off
>against one another--i.e. that the ability to run the multiple cores
>slower also allows them to be built smaller/simpler (with the forth chip
>sited earlier as an extreme in that direction), but that if multiple
>cores do end up significantly larger than the hyperthreaded
>counterparts, that would affect their ability to share caches as
>effectively.
>

The primary advantage of closely-coupled threads on a single core is
that data sharing is not done through L2 cache or memory.

>My sense is that the motivation for hyperthreading relates almost
>entirely to #2, because having lots of threads is the exception rather
>than the rule, because of the widely accepted belief that PARALLEL
>PROGRAMMING IS HARD!, and as long as that belief exists, so will shared
>pipelines.
>

I think you are doing yourself and Software Cabling a disservice by
adopting this posture. While in theory you could use the machinery of
Software Cabling to address any level of parallelism, its natural
target is coarse-grained parallelism. The programmer can think
locally von Neumann, and let SC take care of the parallel programming
stuff globally. Helper threads are a way of letting the processor or
processor and supporting software do the fine-grained parallelism.

RM

john jakson

unread,

Apr 11, 2004, 11:27:31 AM4/11/04

to

Robert Myers wrote in message

snipping

> I've sort of taken it for granted that that's where x86 processors are
> headed, but then, I'm not a computer architect. If you can't run the
> pipeline faster because the energy costs of doing so are too high,
> then you have to go to more pipelines. That's the paradigm shift or
> whatever that Gelsinger was nattering on about.
>

MTA doesn't have long pipelines, its best to think of multiple wheels
each with N faces. N might be 4,8,16 but any small int may suffice.
Each N processes is different and is identified by a Pid. N is chosen
for the necesary pipeline depth needed to sustain reg to reg typ int
operations at highest speed. This means that if a P returns N cycles
later, it can pick up previous results saving some cache b/w. The
multiple wheels which may carry the same Pids or different Pid sets.

The wheels have different functions, those on the instruction fetch
side of the queue maintain the PC and push N ops terminated by
branches into the N queues faster than needed. Those on the exec side
are for int, branch cc ops and possibly fpu,dsp or whatever. The
retired bra sends back offsets to the fetch unit. A process can't be
in fetch and exec at same time, so the 2 sides are coupled only
through the exchange of bra targets, ie what to do next.

Is the pipeine N or is N*noOfWheels. What about the queue, an op may
be fetched ahead and wait for 60 or cycles before it goes to exec. I'd
call it a N stage pipeline with maybe 60 latency stages.

regards

johnjakson_usa_com

Rupert Pigott

unread,

Apr 11, 2004, 12:16:55 PM4/11/04

to

Robert Myers wrote:

> I had forgotten you were in the Nick Maclaren camp on this one (might
> as well put it that way, because that's the way I think of it). I had
> (believe it or not) thought of starting the paragraph to which you are
> replying more or less the way you started yours, which is that I
> *still* don't understand the case for separate cores (except the one I
> mentioned). Were die layout not a problem, you'd put everything in
> the same place, and let who belonged to what sort itself out as
> needed. Die layout, of course, is a consideration.

Hardware Schmardware. ;)

I've come to think that the real challenge is choosing a parallel
coding model that has as many lives as the von Neumann cat. The von
Neumann pussy needs to run out of lives first, and it appears to
have happened already. The current generation of killer micros are
basically a bunch of parallel pussies, I think the software will
follow, albeit slowly and painfully as it always has done. I have
some hope that I'll be quite happy with the state of affairs in a
couple of years time.

Cheers,
Rupert

Robert Myers

unread,

Apr 11, 2004, 1:12:27 PM4/11/04

to

On Sun, 11 Apr 2004 17:16:55 +0100, Rupert Pigott
<r...@dark-try-removing-this-boong.demon.co.uk> wrote:

>Robert Myers wrote:
>
>> I had forgotten you were in the Nick Maclaren camp on this one (might
>> as well put it that way, because that's the way I think of it). I had
>> (believe it or not) thought of starting the paragraph to which you are
>> replying more or less the way you started yours, which is that I
>> *still* don't understand the case for separate cores (except the one I
>> mentioned). Were die layout not a problem, you'd put everything in
>> the same place, and let who belonged to what sort itself out as
>> needed. Die layout, of course, is a consideration.
>
>Hardware Schmardware. ;)
>

Except that, especially with OoO, hardware has done much of the heavy
lifting. As annoying as it may be to the real hardware architects to
have a bunch of software types going on about how hardware should be
designed, it's a subject that we software types simply cannot ignore.

I take alot of heat, some of it clearly good-natured, about my
fixation on the Cray-I, but that was a machine that you could not
program effectively without understanding how it worked. Not only
that, the programming model was simple enough that I could get my
feeble brain around it quickly enough and get on with the physics.

Either life was alot simpler then, or the world awaits another Seymour
to cut through the clutter and find the right choice of tricks that's
worth the bother and that can be utilized in actual practice. The
right answer is probably not that it's either/or, but both.

>I've come to think that the real challenge is choosing a parallel
>coding model that has as many lives as the von Neumann cat. The von
>Neumann pussy needs to run out of lives first, and it appears to
>have happened already. The current generation of killer micros are
>basically a bunch of parallel pussies, I think the software will
>follow, albeit slowly and painfully as it always has done. I have
>some hope that I'll be quite happy with the state of affairs in a
>couple of years time.
>

Once you could see the attack of the killer micros coming, _most_ of
what was going to happen could be forseen with a five and dime crystal
ball, _including_ the fact that many processes would be running in
parallel. Two decades of fiddling and fumbling have passed, and you
think a couple of years is going to bring software bliss? Have you
been celebrating a religious holiday in some highly non-traditional
fashion? ;-).

RM

Rupert Pigott

unread,

Apr 11, 2004, 1:36:30 PM4/11/04

to

Robert Myers wrote:

[SNIP]

> Once you could see the attack of the killer micros coming, _most_ of
> what was going to happen could be forseen with a five and dime crystal
> ball, _including_ the fact that many processes would be running in
> parallel. Two decades of fiddling and fumbling have passed, and you
> think a couple of years is going to bring software bliss? Have you
> been celebrating a religious holiday in some highly non-traditional
> fashion? ;-).

The key difference is : parallelism has been pervasive in hardware
that everyday programmers can get their hands on for a good decade
now. You can see the attitude changing, Threads are hip and cool,
if you look at the job ads on this side of the pond you'll see a
lot of them making explicit mention of "Threading Skills" in the
job specs. What's more : The hardware is actually getting even more
parallel too and in the not too distant future SMP (or at least
SMT systems) will be common as muck.

So while I think shared memory threads are the spawn of Satan, I
don't mind the side effect that they give me lots of parallel HW
to play with. This HW will probably run just as well with my CSP
monkey business as it will with shared memory threads too.

If you really want the Cray simplicity I think you are barking up
the wrong tree with SMT. The threads of execution are intricately
coupled by (userspace invisible) implementation dependant resource
contention issues. I suspect that this helper thread concept will
pan out the same too, that doesn't stop someone from proving me
wrong though.

Cheers,
Rupert

Robert Myers

unread,

Apr 11, 2004, 2:33:55 PM4/11/04

to

On Sun, 11 Apr 2004 18:36:30 +0100, Rupert Pigott
<r...@dark-try-removing-this-boong.demon.co.uk> wrote:

<snip>

>
>If you really want the Cray simplicity I think you are barking up
>the wrong tree with SMT. The threads of execution are intricately
>coupled by (userspace invisible) implementation dependant resource
>contention issues. I suspect that this helper thread concept will
>pan out the same too, that doesn't stop someone from proving me
>wrong though.
>

No, the days of sweetly crystalline Cray programming are gone forever.

Only people like Terje and Linus (what is it with these guys from
Scandinavia, anyway?) worry about the real effects of OoO on a
day-to-day basis. The rest of us simply accept that the processor
somehow gets away with running twenty times as fast as the memory bus
and get on with business.

When I mentioned my five and dime crystal ball, the one thing I don't
think most people would have been able to forsee was the ability to
keep a couple of hundred instructions in flight by trickery that's
invisible to all but the most detail-oriented of programmers.

My crystal ball says that helper threads are going to work the same
way. The helper threads that apparently so horrify you are just
another turn of the screw. If you've got a few hundred instructions
in flight with predication, speculation, hoisting, and fixup already,
what's a few hundred more here or there?

A bitch to debug? I would imagine so, but I just _know_ you're not
going to tell me that CSP threads are easy.

RM

Rupert Pigott

unread,

Apr 11, 2004, 5:44:53 PM4/11/04

to

Robert Myers wrote:

> A bitch to debug? I would imagine so, but I just _know_ you're not
> going to tell me that CSP threads are easy.

Fundamentally you still have to tackle the same synchronisation
issues that the problem throws up, but you don't add a whole bunch
of *hidden* coupling (eg : memory contention).

Cheers,
Rupert

Robert Myers

unread,

Apr 11, 2004, 7:52:47 PM4/11/04

to

I _think_ you're worried about a non-problem, but I'm willing to be
educated. Helper threads don't introduce user-visible separate
threads or user-visible memory sharing, and the processor maintains
the illusion of in-order execution.

Things can go horrifyingly wrong for hardware architects and compiler
designers, but it's up to them to see to it that life is no harder for
end-users than life would be without helper threads.

Once you've lost deterministic execution--and you have with OoO--you
can't rely on single-step debugger execution to expose what's actually
going on, but that's not a problem that's newly-introduced by helper
threads.

RM

Hank Oredson

unread,

Apr 11, 2004, 8:56:34 PM4/11/04

to

"Robert Myers" <rmy...@rustuck.com> wrote in message

news:cp2j701vcij39q676...@4ax.com...

And this all happens on each of the 64 processors on the motherboard.

But maybe the 32 processor chip, two per box, will not happen quite
as soon as I think it will ...

> A bitch to debug? I would imagine so, but I just _know_ you're not
> going to tell me that CSP threads are easy.
>
> RM

--

... Hank

http://horedson.home.att.net
http://w0rli.home.att.net

del cecchi

unread,

Apr 11, 2004, 11:17:43 PM4/11/04

to

"David C. DiNucci" <da...@elepar.com> wrote in message
news:4078F375...@elepar.com...

> Robert Myers wrote:
> > ...I think we'll probably wind up with both (SMT and CMP).
>
> Even with my silly assumptions, I can buy that: Enough CMP to handle
> the minimum number of threads expected to be active, and SMT (i.e.
> dynamically-allocatable pipe stages) to provide headroom. What I
don't
> buy (if it was implied) is that the CMP and SMP will be present in the
> same processors, since there are good reasons for them to run at
> different clock speeds. In fact, I might imagine SMTs (at high temp)
> sprinkled among the CMPs (at lower temp) to even out the cooling.
>
> -- Dave

You guys got a heck of a crystal ball. It can see all the way to 2003.
:-)
http://www.hotchips.org/archive/hc15/pdf/11.ibm.pdf

Robert Myers

unread,

Apr 12, 2004, 1:38:00 AM4/12/04

to

A fact I took note of as recently as April 2, under the nominal
heading of, under the nominal subject heading Re: [OT] Microsoft
aggressive search plans revealed:

RM>Power 4 isn't multi-threaded, but Power 5 is, and I'll start to get
RM>really excited when this new openness makes a multi-threaded core
RM>available to play with as IP. ;-).
RM>
RM>(I know, Del, there's just no pleasing some people).

Slide Number 12 and the slides before it make it clear that the Power
5 implementation of SMT addresses the issue of what to do when having
multiple threads active is a disadvantage (has Nick said anything
about that?). Slide 11 shows an example with a max payoff of 25%,
pretty well in line with what has most often been reported with Intel
Hyperthreading.

RM

Maynard Handley

unread,

Apr 12, 2004, 5:14:45 PM4/12/04

to

In article <4078F375...@elepar.com>,

"David C. DiNucci" <da...@elepar.com> wrote:

> My sense is that the motivation for hyperthreading relates almost
> entirely to #2, because having lots of threads is the exception rather
> than the rule, because of the widely accepted belief that PARALLEL
> PROGRAMMING IS HARD!, and as long as that belief exists, so will shared
> pipelines.
>
> > ...I think we'll probably wind up with both (SMT and CMP).
>
> Even with my silly assumptions, I can buy that: Enough CMP to handle
> the minimum number of threads expected to be active, and SMT (i.e.
> dynamically-allocatable pipe stages) to provide headroom. What I don't
> buy (if it was implied) is that the CMP and SMP will be present in the
> same processors, since there are good reasons for them to run at
> different clock speeds. In fact, I might imagine SMTs (at high temp)
> sprinkled among the CMPs (at lower temp) to even out the cooling.

It is not useful to say that PARALLEL PROGRAMMING IS HARD when there are
very different things being discussed.
At the per-chip level, the issue is whether it is hard to have a CPU
running four threads or so at a time. At the Robert Myers level, the
issue is whether it is hard to have a computer running 10,000 threads at
a time. I can well believe that the second is a very hard problem.
The first, however, is right now an artificially hard problem. It is
artificially hard because while there are frequently plenty of fragments
of code that can be parallelized, they are maybe 1000 instructions long,
so the OS overhead in doing so makes the parallelization impractical.
The overhead comes from
? CPU level overhead that assumes modifications made in this thread need
to propagate out (to another CPU) and the CPU has to wait till that has
happened and
? OS overhead that assumes that it's an unacceptable waste of resources
to have a (virtual) processor sit idle, and so enforces switches into
and out of kernel, moving threads onto and off run lists and so on, as
synchronization resources are acquired and released.

Now if we updates the programming model to describe that this collection
of threads is tightly coupled and should run as a single unit on a
single physical CPU, we can get rid of the CPU level overhead because
we'd know that the mods made by thread 0A are in the cache of CPU 0 and
will be seen by thread 0B without any extra hard work on the part of the
CPU. The problem, of course, now is that one has to annotate the code
correctly to specify that THIS sync operation only involves syncing with
THAT thread, so is local, while some other sync operation is global and
needs to be propagated to all CPUs. One can see a programming model that
works well with say 2 or 4 threads but falls apart after that.
Likewise we need to be able to tell the OS that these two threads are
tightly coupled and when one waits on the other, the wait should occur
through some hardware wait mechanism, with the waiting thread going into
an HW idle state of some sort, rather than context switching to some
alternative waiting task.

Both of these are feasible, in the sense that it's not too much of a
stretch to modify the HW and SW to get them to happen.
The PROBLEM really, is that they aren't very fwd or bwd compatible.
They're not fwd compatible because the mechanisms they use to try to be
efficient (to make threading small fragments of code worthwhile) fall
apart when you have too many threads to keep track of and too many
interactions --- it's just too hard to ensure that these memory updates
only affect the threads local to this CPU and don't need to be
propagated to other CPUs. And they're not backward compatible because
code written assuming that it can run as two physical threads, with no
overhead cost for threading code fragments a thousand cycles long, will
run like a slug on a single-threaded CPU or a standard DP machine, with
OS intervention every 1000 cycles to switch to the other thread.

So that's the problem as I see it.
Maybe someone smarter than me can construct some primitives that do
allow scaling (in at least either the fwd or bwd direction) that will
allow one to thread these tiny fragments of code. But in the absence of
that, life sucks. The issue is not that the parallelism doesn't exist
but that one cannot usefully get to it through todays HW and OS
abstractions.

Maynard

Stephen Sprunk

unread,

Apr 12, 2004, 5:30:25 PM4/12/04

to

"Robert Myers" <rmy...@rustuck.com> wrote in message

news:en9k70pm2khm2co4r...@4ax.com...

> On Sun, 11 Apr 2004 22:17:43 -0500, "del cecchi"
> <dcecchi...@att.net> wrote:
> >You guys got a heck of a crystal ball. It can see all the way to 2003.
> >:-)
> >http://www.hotchips.org/archive/hc15/pdf/11.ibm.pdf
>

> Slide 11 shows an example with a max payoff of 25%,
> pretty well in line with what has most often been reported with Intel
> Hyperthreading.

And with slide 5 saying that SMT adds 24% to the size of each core (as
opposed to the 5% commonly claimed on comp.arch), that means Power5's SMT
provides no net gain in performance per mm^2.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

Tony Nelson

unread,

Apr 12, 2004, 6:05:43 PM4/12/04

to

In article <07c94d238480b9ff...@news.teranews.com>,
"Stephen Sprunk" <ste...@sprunk.org> wrote:
...

> And with slide 5 saying that SMT adds 24% to the size of each core (as
> opposed to the 5% commonly claimed on comp.arch), that means Power5's SMT
> provides no net gain in performance per mm^2.

Nowadays I bet designers are happy when they get performance increases
proportional to area increases.
____________________________________________________________________
TonyN.:' tony...@shore.net
'

Robert Myers

unread,

Apr 12, 2004, 6:33:51 PM4/12/04

to

On Mon, 12 Apr 2004 21:30:25 GMT, "Stephen Sprunk"
<ste...@sprunk.org> wrote:

>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:en9k70pm2khm2co4r...@4ax.com...
>> On Sun, 11 Apr 2004 22:17:43 -0500, "del cecchi"
>> <dcecchi...@att.net> wrote:
>> >You guys got a heck of a crystal ball. It can see all the way to 2003.
>> >:-)
>> >http://www.hotchips.org/archive/hc15/pdf/11.ibm.pdf
>>
>> Slide 11 shows an example with a max payoff of 25%,
>> pretty well in line with what has most often been reported with Intel
>> Hyperthreading.
>
>And with slide 5 saying that SMT adds 24% to the size of each core (as
>opposed to the 5% commonly claimed on comp.arch), that means Power5's SMT
>provides no net gain in performance per mm^2.
>

A study c. 1996 by Patterson et. al. showed even OoO processors
stalled 60% of the time on OLTP workloads, making OLTP workloads a
natural target for SMT. Whether IBM got it right or not this time
seems almost beside the point. It amazes me it took them this long to
get into the game.

RM

Robert Myers

unread,

Apr 12, 2004, 7:18:54 PM4/12/04

to

On Mon, 12 Apr 2004 21:14:45 GMT, Maynard Handley
<nam...@redheron.com> wrote:

<snip>

>
>So that's the problem as I see it.
>Maybe someone smarter than me can construct some primitives that do
>allow scaling (in at least either the fwd or bwd direction) that will
>allow one to thread these tiny fragments of code. But in the absence of
>that, life sucks. The issue is not that the parallelism doesn't exist
>but that one cannot usefully get to it through todays HW and OS
>abstractions.
>

So, you've got some dumb-ass von Neumann module that you want to run a
few thousand copies of. You modify the API, however it looks, to fit,
say, software cabling. Software cabling doesn't know or care what's
inside the module, it just knows and enforces the rules under which
the module can be invoked and how the module is allowed to transmit
and receive data. Software cabling invokes the module, assigns it to
a multi-threaded processor, and the magic of transparent
parallelization takes over. Software hasn't got a clue that there's
even more parallelization is going on inside the box. Simple, no?

RM

del cecchi

unread,

Apr 12, 2004, 8:47:42 PM4/12/04

to

"Robert Myers" <rmy...@rustuck.com> wrote in message

news:o46m70tusl425tiog...@4ax.com...

You will recall, I'm sure, that the farmers on the tundra had
multithreading (2) on Northstar. Yes, I recall that some in comp.arch
said it is not "real" multithreading because it was somewhat coarser
grained, switching threads on a cache miss. In the Power4 it was
decided to have two whole cores rather than two threads in one core.
Now in the Power5 they have upgraded the cores to have multithreading.

It's a funny thing. There is at least one department of folks whose job
it is to study the effects of microarchitecture trade offs on
performance. They are armed with a great deal of information about
workloads and with special simulators and well honed models. Too bad
they can't do as well as some spectator monday morning quarterback. I
guess they aren't smart enough. Or maybe the chip has to get out on
schedule.

del cecchi

Robert Myers

unread,

Apr 12, 2004, 9:24:33 PM4/12/04

to

On Mon, 12 Apr 2004 19:47:42 -0500, "del cecchi"
<dcecchi...@att.net> wrote:

>You will recall, I'm sure, that the farmers on the tundra had
>multithreading (2) on Northstar.

No, I didn't.

>Too bad
>they can't do as well as some spectator monday morning quarterback. I
>guess they aren't smart enough. Or maybe the chip has to get out on
>schedule.
>

You can read it that way if you want to, or you can read it as my
saying there's something about this I don't understand. The trades
are obviously complicated: people have taken hard stands on both sides
of the issue.

Neither IBM nor Intel says, "This is the real reason this chip does or
does not have this particular feature." Marketing will say stuff, of
course, but I don't think you would expect anyone to pay much
attention to it.

The chip comes out, and it's left to Monday morning quarterbacks to
try to figure out what's really going on. Hyperthreading for Intel is
kind of marginal. It's a natural thing for Intel to try, because it's
a way to try to recapture some of the IPC they gave up in going to a
longer pipeline.

My (Monday morning quarterback) belief, though, is that much more
significant hyperthreading is their game plan for Itanium, and that
Hyperthreading was really an R&D project for Itanium with a little
marketing pizazz as a gimme. In particular, since hyperthreading has
come out for the P4, it has been all but useless for a big chunk of P4
applications. That may change, especially with the new instructions
in SSE3.

For IBM, the story is different. As I understand SMT, it's a clear
win for OLTP workloads, and IBM won't have boxes in CompUSA with "SMT
included" on them. That means that SMT is included or not included
strictly on the merits, unless, like Intel, they are looking more to
the future than to the present.

The fact that IBM has had SMT, then didn't have SMT, and now has SMT,
all with OLTP presumably as the target application, doesn't leave me
thinking I'm smarter than the guys with all the numbers. It leaves me
wondering if I know what this is really all about.

RM

del cecchi

unread,

Apr 12, 2004, 11:43:13 PM4/12/04

to

"Robert Myers" <rmy...@rustuck.com> wrote in message

news:guem70h7s9tmdjic8...@4ax.com...

Actually, you might not have been hanging around here in Northstar days.
It was a couple of processor iterations before Power4. Back in the Old
Days. :-) And the SMT was sort of coarse, to the point that some here
refused to call it that.

I think that on any design there are always a number of ways to meet the
objectives. And the objectives include cost (mfg and development) and
schedule as well as performance and power and that stuff. Northstar
designers thought they could get a performance boost for OS400 at an
affordable cost by building a two thread SMT that switched threads on a
cache miss. Power4 thought they would be better off buying more silicon
and putting two processor cores on the chip, with each core being more
complicated than the Nstar core. Different strokes for different folks,
as we used to say.

In the case of Power5, I'm sure the question was how best to get a
performance boost out of the chip using the enhanced density. Make it 4
cores? Add some resource and convert the existing core to SMT? Build a
new spiffy core with 57 stages in the pipe? 128 bit FPU?

decisions decisions. So many transistors, so little time.

del cecchi

Terje Mathisen

unread,

Apr 13, 2004, 1:55:48 AM4/13/04

to

Maynard, in my simplistic view, this should be feasible today, using the
current 'lockless' primitives:

Multiple threads on a single physical cpu will get _very_ good
performance using such primitives since all the relevant (L2) cache
lines will be exclusively owned, right?

The real problem is the need for the OS to get out of the way, and in
particular not do a lot of the _really_ stupid things that still happen
today, i.e. like WinXP ping-pong'ing a single cpu-limited thread all
over all available cpus, instead of giving it a little cpu affinity by
default. :-(

As soon as you have more cpu-hogging threads than you have physical
resources for, you'll lose, but allocating N-1 threads on a N-thread
machine, and then assuming the OS will be clueful enough to keep them
stable isn't too much to expect, is it?

When you need to setup an NxM array, with N threads on each of M cpus,
then you need a little more help, but as long as you're willing to (more
or less) manually allocate threads to (virtual) cpus that should be enough.

OTOH, so far I haven't found any way (under WinXP) to even figure out
how my cpus are numbered!

Is it (0,1),(2,3) for (cpu0-thread0, cpu0-thread1),(cpu1-thread0,
cpu1-thread1) or does it start by going across the physical cpus:

(cpu0-thread0, cpu1-thread0),(cpu0-thread1, cpu1-thread1)

Any hints?

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

David C. DiNucci

unread,

Apr 13, 2004, 4:14:23 AM4/13/04

to

Robert Myers wrote:
>
> On Sun, 11 Apr 2004 00:27:49 -0700, "David C. DiNucci"
> <da...@elepar.com> wrote:
> >I am apparently *still* (after literally all these years) still missing
> >something very basic, so if someone has the time, maybe they can finally
> >help me get straight on this.
> >
> I had forgotten you were in the Nick Maclaren camp on this one (might
> as well put it that way, because that's the way I think of it).

I am willing to have it characterized as so, though I have no doubts
that my thinking disagrees with Nick's in at least some important ways,
and I have no intention of putting words into his mouth (or thoughts
into his fingers).

> ... I had

> (believe it or not) thought of starting the paragraph to which you are
> replying more or less the way you started yours, which is that I
> *still* don't understand the case for separate cores (except the one I
> mentioned). Were die layout not a problem, you'd put everything in
> the same place, and let who belonged to what sort itself out as
> needed. Die layout, of course, is a consideration.

Describing the issue as "die layout" makes it sound more specific than
it really is. The issue is the latency required to schedule resources
to a thread and then the related issue of moving data between those
resources. You seem to be advocating allocating resources at a very fine
granularity, and then dynamically routing the data to and from them as
needed. This is essentially a traditional dataflow philosophy, and that
might explain any perceived similarity between Nick's aversion to it and
my own: We've seen it before. It's easy to go in pretending that
there's no latency because each portion is so tiny, but you experience
that latency with every operation, or in this case, every functional
unit, for both scheduling and routing. It's death by a million cuts.
Then you hope that you can hide that latency by parallelism, and maybe
you can to some extent, but the parallelism would still do you more good
if you didn't have the latency.

Pipelines are nice because you schedule several resources at a time
(i.e. all the stages in the pipe), yet you share them at a finer
granularity (one stage at a time), and since the sharing follows a fixed
pattern, successive stages (in time) can be made close to one another
(in space). To the extent that streaming architectures can be considered
as a configurable pipe, that's great, but even then you miss out on some
of those spacetime proximities.

Maybe this is the basis of our different views on this subject. You
consider the functional units more as a pool of resources allocatable at
will, while I regard them more as a pipe. If you share some stages of a
pipe, those stages need to operate faster if they are not to be a
bottleneck. Even *I* can see the merit of tossing instructions from
different threads into a single pipe, assuming that there would be
bubbles otherwise, but that's apparently not what we're discussing.

> If you try to run helper threads on separate cores, you lose most of
> the advantages because of the overhead of communicating through L2.
> If you don't share L2, there is no point at all in helper threads.

This is the very reason I stated my cache-sharing assumption below--i.e.

> >So if my assumptions above are correct, and if multiple cores can be as
> >effective at sharing caches and integrating with other aspects of the

> >memory subsystem as their hyperthreaded counterparts, ...

But returning back:

> I don't get your argument for a fast clock at all. You don't balance
> resources by speeding up the clock. You balance resources by
> balancing resources. Your best chance of getting an ideal match is
> when all resources are in one big pool and can be assigned as needed
> to the workload that requires them. Your worst chance of getting an
> ideal match is to have the resource pool arbitrarily divided into
> pieces. Simple queuing theory, as someone put it.

Except you're leaving the queues (pipes) out of your queuing theory. If
I remember, you were the one talking about the importance of minimizing
data movement, etc., and now you seem to be advocating routing data
almost arbitrarily to get it from one FU to the next. And you are
moving and changing more state in order to accomodate that dynamic
routing. I'm getting flashbacks to tags and matching store in dataflow
machines.

And the chips get hotter and hotter.

<snip>

> "Okay," I responded, drawing my breath in slowly, "what is it that we
> need to do so that people can do these kinds of things in user space,
> since I don't think the dozens of researchers publishing papers on
> these strategies have been planning on running everything in
> privileged mode." At which point Nick responded with a list of
> requirements that look achievable to someone who is not a professional
> computer architect.

So, I read this as an explanation of your earlier statement regarding

"the possibility of close cooperation among threads".

> >So if my assumptions above are correct, and if multiple cores can be as

> >effective at sharing caches and integrating with other aspects of the
> >memory subsystem as their hyperthreaded counterparts, I am still led to
> >the conclusion that multiple cores get their bonus points from lower
> >clock speed (and therefore less heat) for similar (or better)
> >performance when there are enough threads to keep those cores busy,
>
> As I've stated, if you could put everything in one place, there would
> be no advantage at all to multiple cores (other than amortizing NRE by
> printing cores on dies like postage stamps).

I won't spend time arguing statements which we agree are based on false
premises.

> Since you can't put
> everything in one place, there is a balance between acceptable heat
> distribution, die size, and being able to reach everything in a single
> core in a single clock (although I suspect that even that constraint
> is going to go by the board at some point).

And, as I state above, I believe you are oversimplifying. The problem is
not just (and maybe even not primarily) distributing the clock. It's
allocating resources and routing data between them, all in the least
amount of time and state changes. I believe that fine-grain allocation
and routing is not the way to best accomplish that.

> >and
> >hyperthreaded cores get their bonus points from (1) possibly requiring
> >less floorplan/cost to manufacture support for n threads, and (2) the
> >ability to dynamically and productively allocate otherwise idle
> >resources/stages to active threads. I assume that these play off
> >against one another--i.e. that the ability to run the multiple cores
> >slower also allows them to be built smaller/simpler (with the forth chip
> >sited earlier as an extreme in that direction), but that if multiple
> >cores do end up significantly larger than the hyperthreaded
> >counterparts, that would affect their ability to share caches as
> >effectively.
> >
>
> The primary advantage of closely-coupled threads on a single core is
> that data sharing is not done through L2 cache or memory.

So, it seems clear that you regard threads running on a single SMT core
as being related differently (in terms of programming model) than those
that might run anywhere else.

> >My sense is that the motivation for hyperthreading relates almost
> >entirely to #2, because having lots of threads is the exception rather
> >than the rule, because of the widely accepted belief that PARALLEL
> >PROGRAMMING IS HARD!, and as long as that belief exists, so will shared
> >pipelines.
> >
>
> I think you are doing yourself and Software Cabling a disservice by
> adopting this posture. While in theory you could use the machinery of
> Software Cabling to address any level of parallelism, its natural
> target is coarse-grained parallelism.

I didn't say anything about Software Cabling, and even if had been
floating around in the back of my mind (as it is wont to do), I
certainly wasn't advocating its use in fine-grain parallelism. I do tend
to believe that there is a tendency to rely on fine-grained parallelism
when coarse-grain would work better, for the very reasons I've already
stated.

> ... Helper threads are a way of letting the processor or

> processor and supporting software do the fine-grained parallelism.

So am I to understand that you see the primary value of SMT as its
ability to support these helper threads to prefetch data into cache?

And thanks, I do believe that you answered my original question,
-- Dave
-----------------------------------------------------------------
David C. DiNucci Elepar Tools for portable grid,
da...@elepar.com http://www.elepar.com parallel, distributed, &
503-439-9431 Beaverton, OR 97006 peer-to-peer computing

David C. DiNucci

unread,

Apr 13, 2004, 4:15:58 AM4/13/04

to

Well, yes and no.

First, if all you want to do is run a few thousand copies of some von
Neumann module (dumb-ass or otherwise), Software Cabling is probably
overkill for you, but there are lots of other things that aren't. If you
don't want to do it by hand, Ninf comes to mind, but I assume projects
like Gridbus and Globus have similar tools, and for that matter, even
something like Linda Piranha.

Second, what you say is true, to the extent that the magic of
transparent parallelization exists. Obviously, if there's some black
box out there that will run my d-avNm faster than the yellow box next to
it, then all else being equal, I'll choose the black box. But if I'm
paying more to buy the black box, and/or power it, and/or cool it, my
choice might very well be different. And if I can make the cheaper,
cooler, less power-hungry yellow one actually go faster than the black
one just by providing my code to it in a different yet very programmable
and understandable form (and, yes, I am thinking of SC in this case), I
would personally be far more interested in the yellow one--though I'm
sure there are many others who would not be. In the end, it depends on
which market niche you're going for, and I've chosen mine.

Nick Maclaren

unread,

Apr 13, 2004, 4:10:18 AM4/13/04

to

In article <cp2j701vcij39q676...@4ax.com>,

Robert Myers <rmy...@rustuck.com> writes:
|>
|> Only people like Terje and Linus (what is it with these guys from
|> Scandinavia, anyway?) worry about the real effects of OoO on a
|> day-to-day basis. The rest of us simply accept that the processor
|> somehow gets away with running twenty times as fast as the memory bus
|> and get on with business.

Well, I would, if I didn't spend most of my time dealing with much
more elementary implementation imbecilities :-(

The real effects of out-of-order execution can be VERY visible
to interrupt handlers, but how many systems provide competent
application-level interrupt handling nowadays?

Regards,
Nick Maclaren.

Maynard Handley

unread,

Apr 13, 2004, 4:39:24 AM4/13/04

to

In article <c5fvd5$em4$1...@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> wrote:

> Maynard, in my simplistic view, this should be feasible today, using the
> current 'lockless' primitives:
>
> Multiple threads on a single physical cpu will get _very_ good
> performance using such primitives since all the relevant (L2) cache
> lines will be exclusively owned, right?
>
> The real problem is the need for the OS to get out of the way, and in
> particular not do a lot of the _really_ stupid things that still happen
> today, i.e. like WinXP ping-pong'ing a single cpu-limited thread all
> over all available cpus, instead of giving it a little cpu affinity by
> default. :-(

Well:
* Yes the HW stuff is available in that load-locked/store-conditional
will work. The problem is that you're not supposed to just use
load-locked/store-conditional, you're also supposed to throw some sort
of synchronizing instruction in there. I lose track of the details from
CPU to CPU, but my understanding is that you COULD get away without that
synchronizing instruction if you only cared about synching to this CPU.
Of course since IBM doesn't make any SMT CPUs it's a moot point.
Now two POWER4 cores sharing a die and an L2 presumably matches this
situation, but I don't know what IBM recommend has to be done to sync
one CPU of the die with the other CPU, and with no interest in off-die
CPUs. I don't know if they consider that an interesting problem ---
perhaps not because the tend to sell POWER4s in boxes with many CPUs.
* Of course we agree on the need for the OS to get out the way.

But, as I said before, the summary is that the sweet spot for how to do
this is neither fwd nor bwd compatible, which is why I am uncertain
about how things will play out.

> As soon as you have more cpu-hogging threads than you have physical
> resources for, you'll lose, but allocating N-1 threads on a N-thread
> machine, and then assuming the OS will be clueful enough to keep them
> stable isn't too much to expect, is it?

Even that, however, assumes you're willing to write code for say 1, 2
and 4 virtual-CPU machines. That's tough when you want such fine-grained
threading. It's a real hassle to keep work like that in sync, and
current IDEs don't do a good job of allowing you either to hide one view
of the code (so you can concentrate on say only the 2-way threading
code) or conversely show the different (but equivalent) code paths side
by side.
Perhaps it's no different from writing code to run on three different
architectures, and appropriate factoring, macros and inline functions
can help? Time will tell.

> When you need to setup an NxM array, with N threads on each of M cpus,
> then you need a little more help, but as long as you're willing to (more
> or less) manually allocate threads to (virtual) cpus that should be enough.

But at this point your life gets really tough. The language primitives
(if you're using Java or COM) or the OS primitives (eg pthreads) really
aren't set up to express [this data structure (mutex, semaphore,
whatever) here needs to be synced with these local threads, and there
needs to be synced with all threads]. What I mean is, there's no easy
way to express that at this point I want the action of acquiring a mutex
to involve only a load-locked/store-conditional loop WITHOUT the extra
"publish to the other CPUs" work, or at this point I want the act of
waiting for a mutex to involve a spin-loop, not an OS queue.

> OTOH, so far I haven't found any way (under WinXP) to even figure out
> how my cpus are numbered!
>
> Is it (0,1),(2,3) for (cpu0-thread0, cpu0-thread1),(cpu1-thread0,
> cpu1-thread1) or does it start by going across the physical cpus:
>
> (cpu0-thread0, cpu1-thread0),(cpu0-thread1, cpu1-thread1)
>
> Any hints?

Terje, I'm an MacOS X PPC boy! What can I tell you about WinXP?
Of course in MacOS X land we haven't even hit this problem yet --- but
I'm sure we will soon enough.

Maynard

Maynard Handley

unread,

Apr 13, 2004, 4:42:31 AM4/13/04

to

In article <hd8m70tpn8avvpt24...@4ax.com>,
Robert Myers <rmy...@rustuck.com> wrote:

Sure. And if you're ray-tracing or web-serving life is likewise easy. As
we went through a big song-and-dance a few weeks ago, however, those
problems don't interest me much, and aren't much relevant to the world
that's the bread-and-butter of Windows and MacOS X. What is interesting
in that world (eg my MPEG <B>DECODE</B> example) is that there is
parallelism there, but it is fine-grained --- and can't efficiently be
accessed through current language/OS models.

Maynard

Maynard Handley

unread,

Apr 13, 2004, 4:51:03 AM4/13/04

to

In article <c5fnle$13su7$1...@ID-129159.news.uni-berlin.de>,
"del cecchi" <dcecchi...@att.net> wrote:

> Actually, you might not have been hanging around here in Northstar days.
> It was a couple of processor iterations before Power4. Back in the Old
> Days. :-) And the SMT was sort of coarse, to the point that some here
> refused to call it that.
>
> I think that on any design there are always a number of ways to meet the
> objectives. And the objectives include cost (mfg and development) and
> schedule as well as performance and power and that stuff. Northstar
> designers thought they could get a performance boost for OS400 at an
> affordable cost by building a two thread SMT that switched threads on a
> cache miss. Power4 thought they would be better off buying more silicon
> and putting two processor cores on the chip, with each core being more
> complicated than the Nstar core. Different strokes for different folks,
> as we used to say.
>
> In the case of Power5, I'm sure the question was how best to get a
> performance boost out of the chip using the enhanced density. Make it 4
> cores? Add some resource and convert the existing core to SMT? Build a
> new spiffy core with 57 stages in the pipe? 128 bit FPU?
>
> decisions decisions. So many transistors, so little time.
>
> del cecchi

Do you really think such drastic changes would occur? To me POWER4 looks
like the venerable P6 core: a really nice foundation that can have all
sorts of extras tricked onto it, but basically good enough for the next
five or more years. SMT is about as drastic a change as I'd expect.

What really happened from PPro through PII and PIII? Wasn't it just
minor changes and the addition of various poorly thought out MMX style
instructions?

Maynard

Robert Myers

unread,

Apr 13, 2004, 5:38:42 AM4/13/04

to

On Tue, 13 Apr 2004 01:14:23 -0700, "David C. DiNucci"
<da...@elepar.com> wrote:

>Robert Myers wrote:
>>
>> On Sun, 11 Apr 2004 00:27:49 -0700, "David C. DiNucci"
>> <da...@elepar.com> wrote:

<snip>

>And, as I state above, I believe you are oversimplifying. The problem is
>not just (and maybe even not primarily) distributing the clock. It's
>allocating resources and routing data between them, all in the least
>amount of time and state changes. I believe that fine-grain allocation
>and routing is not the way to best accomplish that.
>

<snip>

>>
>> The primary advantage of closely-coupled threads on a single core is
>> that data sharing is not done through L2 cache or memory.
>
>So, it seems clear that you regard threads running on a single SMT core
>as being related differently (in terms of programming model) than those
>that might run anywhere else.
>

There are two sharply different ways you can use SMT (and even if the
world I'm describing is artificially black and white, indulge my
tendency to oversimplify for purposes of discussion). One of them is
to run two threads that are completely unrelated to one another (they
share as little data as possible and are at least uncorrelated in
demand on resources). That's a possible use of SMT, but it's not very
interesting to me. Deciding whether it's worth it or not probably
involves sitting through long meetings and coming up with an
inconclusive answer.

The other way to use threads (in this artificially black and white
universe) is to work with data that belong in the same cache and
shouldn't have to travel long distances to be used multiple times
because you have a (temporarily) small working set. When you enter
this small world, you can go through the instruction stream one step
at a time, you can, in effect, initiate threadlets and suspend them
with register renaming the way an OoO processor does, or you can just
jump into the pile of instructions and look for stuff you can do (I
hope you appreciate my use of terms of art). The exciting possibility
of agile SMT is the latter.

There are many proposals floating around for how you parallelize a
nominally serial instruction stream. Forcing out long-latency memory
requests is the most obvious one and the one that has gotten the most
attention, but it's not the only possibility. Dick Wilmot has spoken
clearly and cogently for a more ambitious strategy, and I am sure
there are plenty more in the literature. For these strategies to
work, you need to be able to execute separate threads almost as
effortlessly as an OoO processor currently initiates, suspends, and
resumes execution of tiny groups of instructions.

If, in my suggesting creating a dataflow machine on the fly, you are
imagining a general dataflow architecture, with prospective dance
partners far apart on a crowded dance floor trying to find one
another, I can see why you would be horrified. I'm not suggesting
such a thing. I'm suggesting taking instructions and data that belong
close together in space and time and giving the processor more
flexibility as to how it executes them than a single pipe OoO
processor can. Just one more turn of the screw, as I said to Rupert
Pigott.

SMT as originally introduced in the P4 won't support the kind of
parallelism that I see as potentially extremely valuable, and I don't
know whether or not the thread synchronization instructions in SSE3
will change that situation at all, so I can see that I have probably
confused alot of people. I hope I've made myself a little more clear.

RM

Joe Seigh

unread,

Apr 13, 2004, 7:35:49 AM4/13/04

to

Maynard Handley wrote:
>
> In article <c5fvd5$em4$1...@osl016lin.hda.hydro.com>,
> Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>
> > Maynard, in my simplistic view, this should be feasible today, using the
> > current 'lockless' primitives:
> >
> > Multiple threads on a single physical cpu will get _very_ good
> > performance using such primitives since all the relevant (L2) cache
> > lines will be exclusively owned, right?
> >
> > The real problem is the need for the OS to get out of the way, and in
> > particular not do a lot of the _really_ stupid things that still happen
> > today, i.e. like WinXP ping-pong'ing a single cpu-limited thread all
> > over all available cpus, instead of giving it a little cpu affinity by
> > default. :-(
>
> Well:
> * Yes the HW stuff is available in that load-locked/store-conditional
> will work. The problem is that you're not supposed to just use
> load-locked/store-conditional, you're also supposed to throw some sort
> of synchronizing instruction in there. I lose track of the details from
> CPU to CPU, but my understanding is that you COULD get away without that
> synchronizing instruction if you only cared about synching to this CPU.

...

>
> But at this point your life gets really tough. The language primitives
> (if you're using Java or COM) or the OS primitives (eg pthreads) really
> aren't set up to express [this data structure (mutex, semaphore,
> whatever) here needs to be synced with these local threads, and there
> needs to be synced with all threads]. What I mean is, there's no easy
> way to express that at this point I want the action of acquiring a mutex
> to involve only a load-locked/store-conditional loop WITHOUT the extra
> "publish to the other CPUs" work, or at this point I want the act of
> waiting for a mutex to involve a spin-loop, not an OS queue.
>

Part of the problem is that hardware designers have no idea how multi-threading
is actually used. Take strongly coherent cache. No correctly written portable
multi-threaded program depends on strongly choherent cache. Yet hardware
designers insist that it is needed. It's some kind of fetish thing as near as
I can figure out.

BTW, some of the mutex implementations are smart enough to spin when the holder
of the lock is running but otherwise suspend to avoid wasting cpu cycles. You
also might look into lock-free algorithms that allow forward progress without
regard to the status of other threads.

Joe Seigh

Stephen Fuld

unread,

Apr 13, 2004, 1:10:40 PM4/13/04

to

"del cecchi" <dcecchi...@att.net> wrote in message
news:c5fnle$13su7$1...@ID-129159.news.uni-berlin.de...

The distinction may be important to answering the question that Robert
posed. The Northstar multi threading was sefinitly multi threading, but it
wasn't *simultaneous* multi threading as there were instructions from only
one thread at a time at any given pipeline stage. I believe the technical
name for what NS did was "switch on event" (SOE) multithreading. In the NS
case, the event was an L2 cache miss.

Now the reason that may be relevant is that I expect converting a single
thread core to doing true SMT is more work on the core than doing SOEMT.
The scheduler is more complex, and you must add more stuff to keep track of
which thread things belong to. This requires additional development time.

Given that, it is reasonable to think that, given the requirements to get
something out, the Power 4 chose the simpler development of putting two
cores on the same die, thus allowing them to get the product out while doing
the, presumably harder, development work to get the core to be SMT. The
next logical step is to apply that work, the "SMTing" of the core, to the
existing Power 4 design to get to the next performance level.

Of course, this is pure speculation, but it seems reasonable.

--
- Stephen Fuld
e-mail address disguised to prevent spam

Nick Maclaren

unread,

Apr 13, 2004, 1:22:19 PM4/13/04

to

In article <kaVec.26739$i74.5...@bgtnsc04-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>Now the reason that may be relevant is that I expect converting a single
>thread core to doing true SMT is more work on the core than doing SOEMT.
>The scheduler is more complex, and you must add more stuff to keep track of
>which thread things belong to. This requires additional development time.
>
>Given that, it is reasonable to think that, given the requirements to get
>something out, the Power 4 chose the simpler development of putting two
>cores on the same die, thus allowing them to get the product out while doing
>the, presumably harder, development work to get the core to be SMT. The
>next logical step is to apply that work, the "SMTing" of the core, to the
>existing Power 4 design to get to the next performance level.

That assumes that Eggers-style SMT DOES deliver "the next performance"
level over shared-cache SMP/CMP or MTA/Northstar-style SOEMT. One of
the consequences of additional complexity is that you may have to
omit optimisations in order to get correctness.

TANSTAAFL.

Regards,
Nick Maclaren.

Robert Myers

unread,

Apr 13, 2004, 7:43:26 PM4/13/04

to

On Mon, 12 Apr 2004 22:43:13 -0500, "del cecchi"
<dcecchi...@att.net> wrote:

>
>Actually, you might not have been hanging around here in Northstar days.
>It was a couple of processor iterations before Power4. Back in the Old
>Days. :-) And the SMT was sort of coarse, to the point that some here
>refused to call it that.
>

I was indeed not around for the original Northstar discussion, but I
do believe that the SOEMT aspect of Northstar has been discussed
whilst I have been present, much as the Dormouse was present in the
courtroom of the King and Queen of Hearts*. I don't think I had read
the Patterson paper at the time.

RM

*Well, at any rate, the Dormouse said-- the Hatter went on, looking
anxiously round to see if he would deny it too: but the Dormouse
denied nothing, being fast asleep.

Terje Mathisen

unread,

Apr 14, 2004, 1:51:58 AM4/14/04

to

Maynard Handley wrote:
> But at this point your life gets really tough. The language primitives
> (if you're using Java or COM) or the OS primitives (eg pthreads) really
> aren't set up to express [this data structure (mutex, semaphore,
> whatever) here needs to be synced with these local threads, and there
> needs to be synced with all threads]. What I mean is, there's no easy
> way to express that at this point I want the action of acquiring a mutex
> to involve only a load-locked/store-conditional loop WITHOUT the extra
> "publish to the other CPUs" work, or at this point I want the act of
> waiting for a mutex to involve a spin-loop, not an OS queue.

It isn't quite (or even nearly?) this bad:

All the lockless primitives (I assume you write your own code instead of
depending upon OS calls/libs!) can be used directly in your own code,
which means there's no need to involve the OS at all, or at least not
for thread/cpu syncing.

Anyway, the real reason this is a feasible way to go is that, due to
cache coherence, all of this will work transparently across threads
and/or cpus, the only difference is that a cpu-local lock means that the
L2 cache line will never need to change ownership, and this is crucial
for performance.

Joe Seigh

unread,

Apr 14, 2004, 7:17:39 AM4/14/04

to

Terje Mathisen wrote:
>
> All the lockless primitives (I assume you write your own code instead of
> depending upon OS calls/libs!) can be used directly in your own code,
> which means there's no need to involve the OS at all, or at least not
> for thread/cpu syncing.
>
> Anyway, the real reason this is a feasible way to go is that, due to
> cache coherence, all of this will work transparently across threads
> and/or cpus, the only difference is that a cpu-local lock means that the
> L2 cache line will never need to change ownership, and this is crucial
> for performance.

You'd still need memory barriers because of the out of order execution.
The performance hit from that isn't significant? Especially if you are
going for finer grained multi-threading.

Joe Seigh

Terje Mathisen

unread,

Apr 14, 2004, 9:51:29 AM4/14/04

to

Joe Seigh wrote:

All bus traffic is significant, but I'm not too worried about a write
barrier (WB) or two near each locked update.

I would like to know what's the most efficient WB on x86, i.e. cpuid is
serializing, but might not (probably cannot!) guarantee anything about
write order outside the current cpu.

What is Linux using these days?

Joe Seigh

unread,

Apr 14, 2004, 11:36:57 AM4/14/04

to

Terje Mathisen wrote:
>
> I would like to know what's the most efficient WB on x86, i.e. cpuid is
> serializing, but might not (probably cannot!) guarantee anything about
> write order outside the current cpu.
>
> What is Linux using these days?
>

For general memory barrier I think they're using

lock add [esp + 0], 0

Not all pentiums have a real membar instruction yet.

Stores for x86 should be in order except for some of the special extended
instructions and the string stores.

Joe Seigh

Sander Vesik

unread,

Apr 14, 2004, 1:37:07 PM4/14/04

to

Stephen Fuld <s.f...@pleaseremove.att.net> wrote:
>
> The distinction may be important to answering the question that Robert
> posed. The Northstar multi threading was sefinitly multi threading, but it
> wasn't *simultaneous* multi threading as there were instructions from only
> one thread at a time at any given pipeline stage. I believe the technical
> name for what NS did was "switch on event" (SOE) multithreading. In the NS
> case, the event was an L2 cache miss.

See the problem is that when you go to shorten SOE MT you end up with SMT again ;-)

On a more serious note - why is the presence of instructions in a single
stage - and not the presence of instructions in the pipeline from different
threads used as the metric? After all, both require you to tag results with
teh thread and even more, if there are execution units that do long async
computations - say a divider - you may *occasionaly* have instructions from
two threads executing in the same pipeline stage.

--
Sander

+++ Out of cheese error +++

Eric

unread,

Apr 14, 2004, 1:19:42 PM4/14/04

to

Terje Mathisen wrote:
>
> I would like to know what's the most efficient WB on x86, i.e. cpuid is
> serializing, but might not (probably cannot!) guarantee anything about
> write order outside the current cpu.

The P4 has LFENCE, SFENCE and MFENCE instructions, and a PTE
can be marked for weak ordered accesses, but I have not heard
of any OS that makes weak ordering available to apps.
(Going by their descriptions, the x86 SFENCE behaves different
from an Alpha WMB, more like a an Alpha MB.)

Eric

unread,

Apr 14, 2004, 1:18:01 PM4/14/04

to

Maynard Handley wrote:
>
> <snip>

>
> But at this point your life gets really tough. The language primitives
> (if you're using Java or COM) or the OS primitives (eg pthreads) really
> aren't set up to express [this data structure (mutex, semaphore,
> whatever) here needs to be synced with these local threads, and there
> needs to be synced with all threads]. What I mean is, there's no easy
> way to express that at this point I want the action of acquiring a mutex
> to involve only a load-locked/store-conditional loop WITHOUT the extra
> "publish to the other CPUs" work, or at this point I want the act of
> waiting for a mutex to involve a spin-loop, not an OS queue.
>

> <snip>

I don't follow you here. What you seem to be describing is an
ordinary cpu spinlock, but that would never be used by a thread.
If the mutex is not available then the acquiring thread needs to
yield and wait for the owner to release, and that involves the OS.

There is one situation where this is not strictly so, where the
lock duration is so short (e.g. a linked list push or pop) that
it can be worthwhile for the acquiring thread to try spinning for a
little while, just in case the lock is released quickly. However
since the owner thread could be rescheduled while holding the lock,
there is no point in spinning and awaiting release for too long.

Eric

Stephen Fuld

unread,

Apr 14, 2004, 1:46:07 PM4/14/04

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:c5h7kb$j4r$1...@pegasus.csx.cam.ac.uk...

> In article <kaVec.26739$i74.5...@bgtnsc04-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
> >
> >Now the reason that may be relevant is that I expect converting a single
> >thread core to doing true SMT is more work on the core than doing SOEMT.
> >The scheduler is more complex, and you must add more stuff to keep track
of
> >which thread things belong to. This requires additional development
time.
> >
> >Given that, it is reasonable to think that, given the requirements to get
> >something out, the Power 4 chose the simpler development of putting two
> >cores on the same die, thus allowing them to get the product out while
doing
> >the, presumably harder, development work to get the core to be SMT. The
> >next logical step is to apply that work, the "SMTing" of the core, to the
> >existing Power 4 design to get to the next performance level.
>
> That assumes that Eggers-style SMT DOES deliver "the next performance"
> level over shared-cache SMP/CMP or MTA/Northstar-style SOEMT.

No, it actually only assumes that the IBM people who made that decision
thought that it did. :-) I don't want to get into that argument again.
Since IBM tends to put out white papers/redbooks, etc. we may find out more
information on the reality as time passes.

Alexander Terekhov

unread,

Apr 14, 2004, 3:12:01 PM4/14/04

to

Joe Seigh wrote:
[...]

> Stores for x86 should be in order except for some of the special extended
> instructions and the string stores.

On Itanic, ordinary x86 loads have acquire semantics (hoist-
load/store mbar for subsequent loads/stores in the program order)
and ordinary stores have release semantics (sink-load/store mbar
for preceding loads/stores in the program order).

regards,
alexander.

Joe Seigh

unread,

Apr 14, 2004, 3:31:14 PM4/14/04

to

"sink" and Itanic are an appropiate conjunction. It was nice that
Intel attempted to define their memory model semantics with examples
but did anyone notice how bizarre some of those examples are? None
of them are typical programming contructs.

Joe Seigh

Nick Maclaren

unread,

Apr 14, 2004, 3:50:55 PM4/14/04

to

In article <zNefc.23162$K_.6...@bgtnsc05-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>> >next logical step is to apply that work, the "SMTing" of the core, to the
>> >existing Power 4 design to get to the next performance level.
>>
>> That assumes that Eggers-style SMT DOES deliver "the next performance"
>> level over shared-cache SMP/CMP or MTA/Northstar-style SOEMT.
>
>No, it actually only assumes that the IBM people who made that decision
>thought that it did. :-) I don't want to get into that argument again.

A fair point - well, two of them :-)

>Since IBM tends to put out white papers/redbooks, etc. we may find out more
>information on the reality as time passes.

Maybe. Those books tend to be what is supposed to happen, and are
reliable when things have gone well, but can be pretty misleading
when they haven't.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Apr 14, 2004, 4:53:35 PM4/14/04

to

Joe Seigh wrote:

Not guaranteed if you write to multiple locations in the same cache
line: They can be combined and the written in a single burst, even if
you wrote it in a non-sequential order, afaik.

I.e. Write Combining memory ranges can even be used for graphics frame
buffers, where this specific feature can cause problems for
memory-mapped control ports in the same range.

Iain McClatchie

unread,

Apr 14, 2004, 5:04:33 PM4/14/04

to

Terje> I would like to know what's the most efficient WB on x86, i.e. cpuid is
Terje> serializing, but might not (probably cannot!) guarantee anything about
Terje> write order outside the current cpu.

What are you two talking about?

x86 processors have processor ordered memory semantics. Write barriers
aren't needed. Write order apparent to any other CPU is the program order
in the other thread.

This is not complicated. It's a simple semantic. CPU designers like this
simple semantic because it's easier to explain than nearly anything else,
and because they mostly get it for free with their OoO cores.

To help bury this topic, let me point out that the Alpha guys made a big deal
of not promising processor ordering. I don't see that it ever would have
bought them anything, except a lot of excitement among software weenies who
thought they had a new problem to think about.

So far as I know, processor ordering does not give you trivial locking, so
that's still an issue.

Joe Seigh

unread,

Apr 14, 2004, 5:41:38 PM4/14/04

to

Iain McClatchie wrote:
>
> Terje> I would like to know what's the most efficient WB on x86, i.e. cpuid is
> Terje> serializing, but might not (probably cannot!) guarantee anything about
> Terje> write order outside the current cpu.
>
> What are you two talking about?

The x86 memory model.

>
> x86 processors have processor ordered memory semantics. Write barriers
> aren't needed. Write order apparent to any other CPU is the program order
> in the other thread.

processor order != program order

Processor order is used when discussing whether cache is transparent or not,
i.e. whether cache preserves processor order.

But x86 stores are in program order AFAIK except for the write combining stuff
which sounded to me like it was taking place in the processor's store buffers
not cache (from the Intel docs).

Joe Seigh

Alexander Terekhov

unread,

Apr 14, 2004, 6:21:59 PM4/14/04

to

Iain McClatchie wrote:
[...]

> What are you two talking about?

For starters,

http://www.crhc.uiuc.edu/ece412/papers/models_tutorial.pdf

regards,
alexander.

Maynard Handley

unread,

Apr 14, 2004, 7:10:49 PM4/14/04

to

In article <407D9276...@xemaps.com>,
Joe Seigh <jsei...@xemaps.com> wrote:

What are "the examples" you refer to above?
No-one is claiming that these are an important issue for "normal" code.
The point is that they ARE an important issue when you are trying to
write code that involves some sort of synchronization between 2 CPUs
that do not share a cache, for example if you are writing a mutex
acquire/mutex release primitive. If you have never written or pondered
such code and the various things that can go wrong in the sequence of
events, you're really not in a position to judge.

Maynard

Joe Seigh

unread,

Apr 14, 2004, 7:39:11 PM4/14/04

to

Intel Itanium Architecture Software Developer's Manual
Volume 2: System Architecture
section 2.2 Memory Ordering in the Intel Itanium Architecture

Actually I have. Numerous times. In a commercial operating system
even. I've even done a formal definition of memory model semantics.

Joe Seigh

Eric Gouriou

unread,

Apr 14, 2004, 8:35:39 PM4/14/04

to

Joe Seigh wrote:
> Maynard Handley wrote:
>
>>In article <407D9276...@xemaps.com>,
>> Joe Seigh <jsei...@xemaps.com> wrote:
>>
>>

>>>[...] It was nice that

>>>Intel attempted to define their memory model semantics with examples
>>>but did anyone notice how bizarre some of those examples are? None
>>>of them are typical programming contructs.
>>>
>>>Joe Seigh
>>
>>What are "the examples" you refer to above?

[...]

> Intel Itanium Architecture Software Developer's Manual
> Volume 2: System Architecture
> section 2.2 Memory Ordering in the Intel Itanium Architecture

That's not the best place to learn about those semantics. Instead
I'd recommend the formal specification document:
<URL:http://www.intel.com/design/Itanium/Downloads/251429.htm>

You should feel at home in the formalism.

Eric

Joe Seigh

unread,

Apr 14, 2004, 8:33:02 PM4/14/04

to

Eric Gouriou wrote:

>
> Joe Seigh wrote:
> [...]
> > Intel Itanium Architecture Software Developer's Manual
> > Volume 2: System Architecture
> > section 2.2 Memory Ordering in the Intel Itanium Architecture
>
> That's not the best place to learn about those semantics. Instead
> I'd recommend the formal specification document:
> <URL:http://www.intel.com/design/Itanium/Downloads/251429.htm>
>
> You should feel at home in the formalism.
>

Thanks. That is a lot nicer.

Joe Seigh

Nate D. Tuck

unread,

Apr 14, 2004, 9:24:40 PM4/14/04

to

In article <guem70h7s9tmdjic8...@4ax.com>,

Robert Myers <rmy...@rustuck.com> wrote:
>My (Monday morning quarterback) belief, though, is that much more
>significant hyperthreading is their game plan for Itanium, and that
>Hyperthreading was really an R&D project for Itanium with a little
>marketing pizazz as a gimme. In particular, since hyperthreading has
>come out for the P4, it has been all but useless for a big chunk of P4
>applications. That may change, especially with the new instructions
>in SSE3.

For Itanium, it's all about coming up with some kind of story for performance.
They haven't had a good one there yet and the architecture is not very OOO
friendly.

>The fact that IBM has had SMT, then didn't have SMT, and now has SMT,

I don't think Northstar fit the definition of SMT. It's a coarse grain
multithreader like the Tera. Nothing wrong with that, just trying
to keep definitions straight.

nate

Robert Myers

unread,

Apr 14, 2004, 10:33:06 PM4/14/04

to

On Thu, 15 Apr 2004 01:24:40 -0000, n...@turing.cs.hmc.edu (Nate D.
Tuck) wrote:

>In article <guem70h7s9tmdjic8...@4ax.com>,
>Robert Myers <rmy...@rustuck.com> wrote:
>>My (Monday morning quarterback) belief, though, is that much more
>>significant hyperthreading is their game plan for Itanium, and that
>>Hyperthreading was really an R&D project for Itanium with a little
>>marketing pizazz as a gimme. In particular, since hyperthreading has
>>come out for the P4, it has been all but useless for a big chunk of P4
>>applications. That may change, especially with the new instructions
>>in SSE3.
>
>For Itanium, it's all about coming up with some kind of story for performance.
>They haven't had a good one there yet and the architecture is not very OOO
>friendly.
>

Right. With SMT (of a kind not currently available in stores) and a
bit of cleverness, you can get a hefty slice of the advantages of an
OoO processor without actual OoO. Through profiling and compile-time
analysis, you identify problematical loads and create a thread to deal
with them. The spun-off thread doesn't have enough information to be
the final calculation, but it does have enough information to force
out memory requests that couldn't be generated without run-time
analysis.

The fact that published papers with Intel coauthorship exploring such
strategies for Itanium have appeared and various hints about SMT and
Itanium that have appeared here and there (some of the here) lead me
to conclude that such a strategy is being actively explored for
Itanium.

As Dick Wilmot pointed out, SMT seems like a better use of transistors
for getting the right things into cache than just adding more
megabytes of cache.

RM

Pheonix

unread,

Apr 15, 2004, 12:06:26 AM4/15/04

to

Robert Myers <rmy...@rustuck.com> wrote in message news:<klqe705f63oher239...@4ax.com>...
> And, I might add, probably a perpetually instatiable appetite for
> throughput.
>
> http://www.nytimes.com/2004/04/10/technology/10GAME.html?pagewanted=1&hp
>
> <quote>
>

I noticed that in the article, it appears to compare box office ticket
sales to the sales of video game hw and software combined. Does anyone
know of more specific breakdown of that 10 billion figure between hw
vs. sw on the game consoles? For a wag of 30 million unit at $200,
that could be high-end estimate.

There are indeed similarity between the two industries. 1st-run movie
has a shelf life of a few weeks when it has the opportunity to be a
hit or a miss. It a movie sells in the range of 10 to 20 mln tickets
(at $8 a pop) or more during 1st run, it's probably labeld a hit. Each
game title has a few months of shelf-life to be sold at $30-50 a pop,
after that it's deep discounted and getting squeezed out of shelf
space. What's the threshold of sale volume of a game during the 1st
run to be consider a hit with follow-on sequels, I don't know? Is it 1
million?

The movie industry has reached a plateau of roughly 1 billion
person/view annually. I'm tempted to estimate the game software
industry is probably standing at somewhere between 100-200 mln boxes a
year.

There are aspect of these comparisons that suggest game software looks
attractive to new players, I can't help wonder about other aspects
that begs questions...

The cost of develop/market a game title is probably substatial as
well. Perhaps less substantial on the marketing side. But the trend
seems to be that developing a game title takes much longer than
shooting a movie. A movie shoot typically is completed in a few
months. There was a time when typical game development takes 18 months
or so. Now it's much longer. A movie production budget may range from
several mln to 100 mln. (I recall LOTR was ~300 mln for 3 movies) I
believe game development budget has passed that low end already, are
there indications that it's heading in the same watermark in the not
very distant future?

Nate D. Tuck

unread,

Apr 15, 2004, 1:41:26 AM4/15/04

to

In article <o8sr705sqo18phq33...@4ax.com>,

Robert Myers <rmy...@rustuck.com> wrote:
>Right. With SMT (of a kind not currently available in stores) and a
>bit of cleverness, you can get a hefty slice of the advantages of an
>OoO processor without actual OoO. Through profiling and compile-time
>analysis, you identify problematical loads and create a thread to deal
>with them. The spun-off thread doesn't have enough information to be
>the final calculation, but it does have enough information to force
>out memory requests that couldn't be generated without run-time
>analysis.

Yes. See Jamison Collins' paper. There's a lot of other people doing
a lot of different spins on this idea. Wisconsin people will say it's
all shades of Multiscalar. They may be right.

>The fact that published papers with Intel coauthorship exploring such
>strategies for Itanium have appeared and various hints about SMT and
>Itanium that have appeared here and there (some of the here) lead me
>to conclude that such a strategy is being actively explored for
>Itanium.

This was presented at IDF this year, so it isn't a big secret.

>As Dick Wilmot pointed out, SMT seems like a better use of transistors
>for getting the right things into cache than just adding more
>megabytes of cache.

One thing to beware of is that most of these papers don't try very hard
to include good hardware prefetchers.

nate

Terje Mathisen

unread,

Apr 15, 2004, 2:17:31 AM4/15/04

to

Iain McClatchie wrote:

> Terje> I would like to know what's the most efficient WB on x86, i.e. cpuid is
> Terje> serializing, but might not (probably cannot!) guarantee anything about
> Terje> write order outside the current cpu.
>
> What are you two talking about?
>
> x86 processors have processor ordered memory semantics. Write barriers
> aren't needed. Write order apparent to any other CPU is the program order
> in the other thread.

So why do SMP Linux have write barriers on x86?

Why can you have multiple writes to the same location (i.e. a stack
variable?) within a short period of time, and this turns into a single
full cache line read followed by a full cache line write?

Why does Intel even designate memory ranges as Write Combining, unless
multiple writes to the same area can be coalesced into a single actual
bus transfer?

> This is not complicated. It's a simple semantic. CPU designers like this
> simple semantic because it's easier to explain than nearly anything else,
> and because they mostly get it for free with their OoO cores.

Where, oh where, is Andy Glew when we need him!
:-)

Iain, please help me!

Life would be so much simpler if you were correct, so I'd really like at
least a pointer to the relevant page(s) of the Intel/AMD manuals where
all writes are promised to propagate all the way onto the actual RAM
chips in program order.

Terje Mathisen

unread,

Apr 15, 2004, 2:44:26 AM4/15/04

to

Alexander Terekhov wrote:

> Iain McClatchie wrote:
> [...]
>
>>What are you two talking about?
>
>
> For starters,
>
> http://www.crhc.uiuc.edu/ece412/papers/models_tutorial.pdf

Thanks!

After reading all 23 pages of that, it does seem like x86 is in the
class of cpus that provide strong cross-cpu ordering, but the
(relatively) recent introduction of memory fence/barrier opcodes
indicates that these ordering guarantees might disappear in the future.

Terje Mathisen

unread,

Apr 15, 2004, 2:52:02 AM4/15/04

to

Eric Gouriou wrote:
> That's not the best place to learn about those semantics. Instead
> I'd recommend the formal specification document:
> <URL:http://www.intel.com/design/Itanium/Downloads/251429.htm>

On page 6 of that document:
"Other models might allow a separate visibility order for each
processor, examples include DASH processor consistency, and Intel
achitecture memory ordering. The visibility order associated with a
processor orders only the operations that could be observed by that
processor (eg. its own reads and writes)."

So, from this it does seem like some form of write barrier is needed?

Stephen Fuld

unread,

Apr 15, 2004, 2:54:20 AM4/15/04

to

"Sander Vesik" <san...@haldjas.folklore.ee> wrote in message
news:10819643...@haldjas.folklore.ee...

> Stephen Fuld <s.f...@pleaseremove.att.net> wrote:
> >
> > The distinction may be important to answering the question that Robert
> > posed. The Northstar multi threading was sefinitly multi threading, but
it
> > wasn't *simultaneous* multi threading as there were instructions from
only
> > one thread at a time at any given pipeline stage. I believe the
technical
> > name for what NS did was "switch on event" (SOE) multithreading. In the
NS
> > case, the event was an L2 cache miss.
>
> See the problem is that when you go to shorten SOE MT you end up with SMT
again ;-)

Yes - the problem with "overloaded" acronyms. :-(

> On a more serious note - why is the presence of instructions in a single
> stage - and not the presence of instructions in the pipeline from
different
> threads used as the metric? After all, both require you to tag results
with
> teh thread and even more, if there are execution units that do long async
> computations - say a divider - you may *occasionaly* have instructions
from
> two threads executing in the same pipeline stage.

As to the second point, you have to ask Del how that was handled. But since
they only switched on an L2 miss, it probably wouldn't be too much of a
performance penalty of they waited for an in process long instructions to
finish. But I really don't know.

As to the first point, The only would need one tag per pipestage versus one
tag per instruction in flight. I am speculating that the former would be
simpler. And if you were willing to wait a few cycles to drain the pipeline
(remember L2 miss, not even L1 miss), you could avoid even that having only
one tag per core. But again, Del would have to provide information as to
what was actually done as I freely admit mine is pure speculation.

Nick Maclaren

unread,

Apr 15, 2004, 3:45:01 AM4/15/04

to

In article <107rp2o...@corp.supernews.com>,

n...@turing.cs.hmc.edu (Nate D. Tuck) writes:
|>
|> >The fact that IBM has had SMT, then didn't have SMT, and now has SMT,
|>
|> I don't think Northstar fit the definition of SMT. It's a coarse grain
|> multithreader like the Tera. Nothing wrong with that, just trying
|> to keep definitions straight.

One of the many reasons that I loathe the unecessary invention of
technical terms is that it constrains many people's thinking.
Eggers-style SMT is just one example among many of automatic
thread interleaving. Northstar-style SOEMT is another, the
POWER4/Regatta memory management system is another (just), some
of the vector and semi-separate floating-point machines were
others[*], and so on.

Both Sun and IBM have said things in public, so I am not breaking
NDA in saying that their 'SMT' isn't going to use quite the same
model as any of Eggers, Intel Hyperthreading or the EV8. And the
same is true of other vendors I could mention ....

[*] If you have a system with 2 CPUs (on the same chip or not)
and a single vector/floating-point unit that can have instructions
passed to it by either CPU, in what sense is it not partially SMT?
Let's have a mathematically precise definition of SMT :-)

Regards,
Nick Maclaren.

Anton Ertl

unread,

Apr 15, 2004, 4:37:48 AM4/15/04

to

Terje Mathisen <terje.m...@hda.hydro.com> writes:

>Iain McClatchie wrote:
>> Write order apparent to any other CPU is the program order
>> in the other thread.
>
>So why do SMP Linux have write barriers on x86?

No idea. Does it? I don't know that x86 has memory barrier
instructions.

>Why can you have multiple writes to the same location (i.e. a stack
>variable?) within a short period of time, and this turns into a single
>full cache line read followed by a full cache line write?

Why not? The cache coherency protocol ensures that other CPUs don't
see the writes in the wrong order (actually, for writes to the same
location, does any architecture require a write barrier between the
writes?).

>Why does Intel even designate memory ranges as Write Combining, unless
>multiple writes to the same area can be coalesced into a single actual
>bus transfer?

Write-through and write-combining are used for devices that don't
participate in the cache-coherency protocol, i.e., memory-mapped I/O
devices.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Michael S

unread,

Apr 15, 2004, 5:56:58 AM4/15/04

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote in message news:<c5k8cf$12p$1...@osl016lin.hda.hydro.com>...

Terje,
Stores for x86 (with exception of SSE/SSE2 non-temporary stores and
some cases of REP MOVS and REP STOS) aren't guaranteed to be written
to memory in the program order but _are_ guaranteed (by cache
coherency protocol) to be observed in the program order by other
processors. And for IPC that's all that matter.

If the program at CPU A writes to location X and then to location Y
then CPU B will never see the situation of {X=old;Y=new}. On the
other hand, I/O devices which knows nothing about cache coherency
observe stores to WB-cached areas in arbitrary order, but WB-cached
areas aren't very usefull for communication with I/O devices anyway.

Joe Seigh

unread,

Apr 15, 2004, 7:47:34 AM4/15/04

to

Terje Mathisen wrote:
> > http://www.crhc.uiuc.edu/ece412/papers/models_tutorial.pdf
>
> Thanks!
>
> After reading all 23 pages of that, it does seem like x86 is in the
> class of cpus that provide strong cross-cpu ordering, but the
> (relatively) recent introduction of memory fence/barrier opcodes
> indicates that these ordering guarantees might disappear in the future.
>

You should assume so. Nothing is cast in concrete, so you can assume that
the rules can change out from under you. Intel doesn't look at software
the way the rest of us do. To them it's just model dependent device/graphics
drivers and have an ROI the same as hardware. So the rest of us who want
our programs to last more than a year just aren't part of that consideration
for the most part with respect to multi-threading.

If you want portability, you should be using an abstraction layer to handle
any portability issues. The ones I use, or would use, are mutexes, event
signaling, various kinds of smart pointers, and pseudo memory barriers. You
create a platform independent formal definition, code to that, and deal with
the api implementation on a platform basis.

I don't really like the memory barriers as an abstraction as they're rather
low level and near enough to the real ones that confusion between the abstract
api and the real hw might be an issue. But there are situations that the
other abstract api's won't handle.

Linux has four I think. These probably are not be the actual pseudo ops
rmb (load/load ordering)
wmb (store/store ordering)
rmb_dependent (load/load dependent ordering)
mb (everything/everything ordering)

They're whatever is required on the various platforms including nothing
if nothing is required.

This won't guarantee that some future hw implemention won't break your
api. Nothing will. HW engineers aren't programmers and some programming
issues aren't on their scope. But if you stick close to some of the
common conventions you should have safety in numbers at least.

Joe Seigh

Alexander Terekhov

unread,

Apr 15, 2004, 8:40:37 AM4/15/04

to

Terje Mathisen wrote:
[...]

> After reading all 23 pages of that, it does seem like x86 is in the

> class of cpus that provide strong cross-cpu ordering, ...

It's quite confusing. Weakly ordered memory and streaming stuff aside
for a moment... (in Itanic terms)

- loads have acquire semantics

- stores have release semantics

- lock instructions have release and acquire semantics (fully fenced).

It's way too strong but they aren't going to break tons of software.

regards,
alexander.

Alexander Terekhov

unread,

Apr 15, 2004, 8:45:53 AM4/15/04

to

Joe Seigh wrote:
[...]

> Linux has four I think. These probably are not be the actual pseudo ops
> rmb (load/load ordering)
> wmb (store/store ordering)
> rmb_dependent (load/load dependent ordering)
> mb (everything/everything ordering)

They also have "op.acquire" and "op.release". All in all, it's pretty
brain-damaged. {std::}atomic<> is the way to go. ;-)

regards,
alexander.

Del Cecchi

unread,

Apr 15, 2004, 9:54:13 AM4/15/04

to

"Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message
news:wkqfc.32522$i74.7...@bgtnsc04-news.ops.worldnet.att.net...
snip

> As to the second point, you have to ask Del how that was handled. But
since
> they only switched on an L2 miss, it probably wouldn't be too much of a
> performance penalty of they waited for an in process long instructions to
> finish. But I really don't know.
>
> As to the first point, The only would need one tag per pipestage versus
one
> tag per instruction in flight. I am speculating that the former would be
> simpler. And if you were willing to wait a few cycles to drain the
pipeline
> (remember L2 miss, not even L1 miss), you could avoid even that having
only
> one tag per core. But again, Del would have to provide information as to
> what was actually done as I freely admit mine is pure speculation.
>
> --
> - Stephen Fuld
> e-mail address disguised to prevent spam
>
>

Sorry, I just a po' country circuit designer. I was primarily a spectator
at the Nstar et al design, although perhaps one of the papers from something
like the IBM journal of R and D or Microprocessor Reports might discuss it.
As I recall, these were processors that had short pipelines. Short pipes
are easier and they are a better fit with "commercial" applications that
Nstar was targeted towards.

Perhaps someone who really knows will post, but my assumption would be that
they did the simplest alternative.

del

Eric

unread,

Apr 15, 2004, 10:08:12 AM4/15/04

to

Alexander Terekhov wrote:
>
> For starters,
>
> http://www.crhc.uiuc.edu/ece412/papers/models_tutorial.pdf

And for the real keeners in the audience, there is
"Memory Consistency Models for Shared-Memory Multiprocessors"
also by Kourosh Gharachorloo (300 pages)

http://www.research.compaq.com/wrl/techreports/abstracts/95.9.html

Eric

Nate D. Tuck

unread,

Apr 15, 2004, 11:09:17 AM4/15/04

to

In article <c5leht$1i6$1...@pegasus.csx.cam.ac.uk>,

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>One of the many reasons that I loathe the unecessary invention of
>technical terms is that it constrains many people's thinking.

Or it delineates between distinct concepts. Take your pick.
Is it important to have any distinct concepts in your mind
or is everything kind of the same?

>Eggers-style SMT is just one example among many of automatic
>thread interleaving.

Very true.

>[*] If you have a system with 2 CPUs (on the same chip or not)
>and a single vector/floating-point unit that can have instructions
>passed to it by either CPU, in what sense is it not partially SMT?
>Let's have a mathematically precise definition of SMT :-)

SMT as is commonly accepted is an OOO machine that performs rename
for multiple threads and is capable of issuing instructions
from those threads on the same cycle.

That's not mathematically precise, but it doesn't fit most of the examples
you gave.

nate

Nick Maclaren

unread,

Apr 15, 2004, 11:21:58 AM4/15/04

to

In article <107t9ct...@corp.supernews.com>,

n...@turing.cs.hmc.edu (Nate D. Tuck) writes:
|>

|> >[*] If you have a system with 2 CPUs (on the same chip or not)
|> >and a single vector/floating-point unit that can have instructions
|> >passed to it by either CPU, in what sense is it not partially SMT?
|> >Let's have a mathematically precise definition of SMT :-)
|>
|> SMT as is commonly accepted is an OOO machine that performs rename
|> for multiple threads and is capable of issuing instructions
|> from those threads on the same cycle.

Why necessarily out-of-order? Renaming what? And what's a cycle?
Seriously.

There is a good case for saying that the above description includes
the POWER4 memory system, so we have the interesting question of
whether the memory system is part of the CPU or not.

|> That's not mathematically precise, but it doesn't fit most of the
|> examples you gave.

It fits at least some of the separate vector/floating-point unit
systems. Most didn't allow distinct threads to use the attached
unit in alternate cycles, of course, but some of that was imposed
by the software and it was in theory possible.

I can't remember if it was possible for the System/390 vector
unit (assuming that the multiple threads were tasks in the same
address space, of course), but it may have been.

Regards,
Nick Maclaren.

Mike Haertel

unread,

Apr 15, 2004, 12:34:57 PM4/15/04

to

On 2004-04-15, Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> Life would be so much simpler if you were correct, so I'd really like at
> least a pointer to the relevant page(s) of the Intel/AMD manuals where
> all writes are promised to propagate all the way onto the actual RAM
> chips in program order.

They aren't. Cache evictions to memory occur in an order completely
unrelated to the order in which the cache lines were dirtied.

Ignoring the special case of WC memory for a moment, for regular WB
cached memory the x86 semantics are simply that:

if (I, cpu A) do:
store -> X
store -> Y
and someone else (cpu B) does:
load Y -> (and sees result of CPU A's "store Y")
then
if B subsequently does "load X" it must see CPU A's "Store X"
(or something else that later overwrote location X, but it
cannot see the value that was in X before CPU A's store X)

Basically the x86 semantics are a formalization of the 486's store queue.

Each processor is allowed to have a (potentially very long) queue
of stores it has performed, that have not yet been committed to the
coherent shared memory image. Loads on each processor snoop their
own store queue before taking values from the coherent memory image,
so that loads always see a version of memory coherent with stores
done by the local processor at least.

However, loads do not snoop other processors' queues of uncommitted stores.

So the x86 memory ordering guarantee is simply that you will observe
stores from other processors in the same order those processors make
the stores. It doesn't guarantee that you will observe them right away,
and also doesn't guarantee that all processors in the system will observe
the stores at the same time.

X86 also doesn't guarantee that if you (cpu A) and another observer (cpu B)
are observing streams of stores from cpus C and D, that A and B will both see
the same interleaving streams of stores from C and D.

Store fence instructions drain the store queue.

Joe Seigh

unread,

Apr 15, 2004, 5:01:46 PM4/15/04

to

One way anyway. But I'll think I just chill it on this semantics stuff.
Somebody mentioned formal semantics for a certain multi-threaded construct
and I sent them some stuff on it and pretty much scared them off. Heck,
even Posix doesn't do formal semantics. Not knowing the actual semantics
hasn't stopped anyone from using it. And as long as you cannot recreate
an actual bug then it's not a bug (as anyone who's ever called in a bug
to a software vendow knows).

Joe Seigh

Nate D. Tuck

unread,

Apr 15, 2004, 6:52:39 PM4/15/04

to

In article <c5m9am$rnh$1...@pegasus.csx.cam.ac.uk>,

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>Why necessarily out-of-order? Renaming what? And what's a cycle?
>Seriously.

I'm not going to fight intentional obtuseness. Develop your own
taxonomy if you want. However, there is a fairly large group of people
who have a definition of SMT that seems to differ from yours.

nate

Brannon Batson

unread,

Apr 15, 2004, 7:02:44 PM4/15/04

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote in message news:<c5lbej$jck$1...@osl016lin.hda.hydro.com>...

> Eric Gouriou wrote:
> > That's not the best place to learn about those semantics. Instead
> > I'd recommend the formal specification document:
> > <URL:http://www.intel.com/design/Itanium/Downloads/251429.htm>
>
> On page 6 of that document:
> "Other models might allow a separate visibility order for each
> processor, examples include DASH processor consistency, and Intel
> achitecture memory ordering. The visibility order associated with a
> processor orders only the operations that could be observed by that
> processor (eg. its own reads and writes)."
>
> So, from this it does seem like some form of write barrier is needed?
>
> Terje

What's the question? Can't find the original thread...

Brannon
not speaking for Intel

Robert Myers

unread,

Apr 15, 2004, 7:03:12 PM4/15/04

to

On Thu, 15 Apr 2004 21:01:46 GMT, Joe Seigh <jsei...@xemaps.com>
wrote:

<snip>

>Heck,
>even Posix doesn't do formal semantics. Not knowing the actual semantics
>hasn't stopped anyone from using it.

http://www.virtualsheetmusic.com/downloads/Chopin/FMarch.html

Not...knowing...the...actual...semantics...hasn't...stopped...anyone...from...using...it.

All kinds of fatuous comparisons come to mind. I won't indulge them.
If human beings specify actions for *computers* without knowing the
exact meaning of the actions they are specifying, when is it that we
do anything where we even know what we claim to be the intended
outcome of our actions?

RM

Brannon Batson

unread,

Apr 15, 2004, 7:22:03 PM4/15/04

to

Joe Seigh <jsei...@xemaps.com> wrote in message news:<407E774F...@xemaps.com>...

> Terje Mathisen wrote:
> > > http://www.crhc.uiuc.edu/ece412/papers/models_tutorial.pdf
> >
> > Thanks!
> >
> > After reading all 23 pages of that, it does seem like x86 is in the
> > class of cpus that provide strong cross-cpu ordering, but the
> > (relatively) recent introduction of memory fence/barrier opcodes
> > indicates that these ordering guarantees might disappear in the future.
> >
>
> You should assume so. Nothing is cast in concrete, so you can assume that
> the rules can change out from under you. Intel doesn't look at software
> the way the rest of us do. To them it's just model dependent device/graphics
> drivers and have an ROI the same as hardware. So the rest of us who want
> our programs to last more than a year just aren't part of that consideration
> for the most part with respect to multi-threading.

I disagree with this. Intel goes to spectacular lengths to make sure
that 20 year old software still runs. There are cases where software
was coded to undefined or unsupported behavior, and there Intel does a
cost/benefit analysis on breaking the legacy--but breaking a legacy
(even an unintentional or clearly bad legacy) is frowned upon.

> If you want portability, you should be using an abstraction layer to handle
> any portability issues. The ones I use, or would use, are mutexes, event
> signaling, various kinds of smart pointers, and pseudo memory barriers. You
> create a platform independent formal definition, code to that, and deal with
> the api implementation on a platform basis.

That is all true.

> [snip]

>
> This won't guarantee that some future hw implemention won't break your
> api. Nothing will. HW engineers aren't programmers and some programming
> issues aren't on their scope.

Hmmmm, if I was less mature I would make the statement that "CPU HW
engineers know vastly more about programming than programmers know
about HW"--but instead I'll say: There's no mysticism about memory
ordering amongst the people that make those HW decisions. We also
track carefully the common software perversions of architected memory
models to avoid (unless necessary) breaking 'bad' software.

> But if you stick close to some of the
> common conventions you should have safety in numbers at least.

Always good advice.

> Joe Seigh

Joe Seigh

unread,

Apr 15, 2004, 7:53:57 PM4/15/04

to

Brannon Batson wrote:
>
> Joe Seigh <jsei...@xemaps.com> wrote in message news:<407E774F...@xemaps.com>...

> > You should assume so. Nothing is cast in concrete, so you can assume that
> > the rules can change out from under you. Intel doesn't look at software
> > the way the rest of us do. To them it's just model dependent device/graphics
> > drivers and have an ROI the same as hardware. So the rest of us who want
> > our programs to last more than a year just aren't part of that consideration
> > for the most part with respect to multi-threading.
>
> I disagree with this. Intel goes to spectacular lengths to make sure
> that 20 year old software still runs. There are cases where software
> was coded to undefined or unsupported behavior, and there Intel does a
> cost/benefit analysis on breaking the legacy--but breaking a legacy
> (even an unintentional or clearly bad legacy) is frowned upon.

Yes, your 8088 program will still run provided you only run it single
threaded.

> >
> > This won't guarantee that some future hw implemention won't break your
> > api. Nothing will. HW engineers aren't programmers and some programming
> > issues aren't on their scope.
>
> Hmmmm, if I was less mature I would make the statement that "CPU HW
> engineers know vastly more about programming than programmers know
> about HW"--but instead I'll say: There's no mysticism about memory
> ordering amongst the people that make those HW decisions. We also
> track carefully the common software perversions of architected memory
> models to avoid (unless necessary) breaking 'bad' software.
>

'bad' software? You mean software that was correct according to the memory model
in effect at the time it was written. Oh, I see. If you call it 'bad' then it's
ok to break it. It deserved to be broken. :)

Seriously, is this 'bad' formally defined. So then all I would have to do
avoid this badness and my programs would be future proof. That would truly
be cool.

Joe Seigh

del cecchi

unread,

Apr 15, 2004, 8:41:01 PM4/15/04

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message

news:407F218D...@xemaps.com...

(didn't snip because I couldn't decide where. )

Sort of snarky, eh? Maybe you should open a cold beer and sit down
with the Mythical Man Month. Examples of bad programming, let me count
the ways.

Using undocumented instructions or side effects
Storing data in the upper 8 bits of S/360 et al 24 bit addresses
Counting on memory ordering more rigid than specified.
Using timing loops.

Bad programming must be like Pornography to the Supreme Court... I know
it when I see it. One of those things that seemed like a good idea at
the time, sort of like edge triggered latches.

del cecchi

unread,

Apr 15, 2004, 8:43:08 PM4/15/04

to

"Robert Myers" <rmy...@rustuck.com> wrote in message

news:o8sr705sqo18phq33...@4ax.com...

> Itanium.
>
> As Dick Wilmot pointed out, SMT seems like a better use of transistors
> for getting the right things into cache than just adding more
> megabytes of cache.
>
> RM
>

Remember, things are not always what they seem. VLIW seemed like a good
idea too.

Joe Seigh

unread,

Apr 15, 2004, 8:49:20 PM4/15/04

to

Robert Myers wrote:
>
> On Thu, 15 Apr 2004 21:01:46 GMT, Joe Seigh <jsei...@xemaps.com>
> wrote:
>
> <snip>
>
> >Heck,
> >even Posix doesn't do formal semantics. Not knowing the actual semantics
> >hasn't stopped anyone from using it.
>
> <cue Lento movement from Chopin Sonata Op. 35 (Funeral March)>
>
> http://www.virtualsheetmusic.com/downloads/Chopin/FMarch.html
>

(snip)

I'd prefer "Cantus in memory of Benjamin Britten" by Arvo Part instead
if we're going to go that route.

Joe Seigh

Robert Myers

unread,

Apr 15, 2004, 9:29:03 PM4/15/04

to

On Thu, 15 Apr 2004 19:43:08 -0500, "del cecchi"
<dcecchi...@att.net> wrote:

>
>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:o8sr705sqo18phq33...@4ax.com...
>> Itanium.
>>
>> As Dick Wilmot pointed out, SMT seems like a better use of transistors
>> for getting the right things into cache than just adding more
>> megabytes of cache.
>>
>>
>

>Remember, things are not always what they seem. VLIW seemed like a good
>idea too.
>

It depends on from whose perspective you view these things. IBM seems
to have lost much of its investment in VLIW, but then, IBM is
_forever_ losing its investment in things because it doesn't stick
with them. To be a little less "snarky" about it, IBM has almost
always been doing so many different things that it can lose more ideas
than most companies will ever think of and still thrive.

Intel doesn't seem to be prepared to lose its investment in VLIW.
Whether that's a good thing or a bad thing for Intel and its
stockholders isn't my problem.

In my not so humble opinion, VLIW and the research it has instigated
have been a very good things for many subject areas related to
computer architecture and to the theory and practice of computation.

Or, to narrow the matter down selfishly, VLIW has been a _very_ good
thing for me to think about because it has caused me to explore issues
that turn out to be of both potential practical and mathematical
interest and that it otherwise might never have occurred to me to
think about at all.

It's possible that another questionable design decision on the part of
Intel; viz, trading clock speed and a deeper stack for IPC on NetBurst
would have led me down the same path, and at this point, I'm prepared
to reiterate my original (naive) judgment of the P4: an Edsel of a
processor if ever there was.

So I'll just be my usual eccentric, iconoclastic self and repeat the
unoriginal observation that lemons are the key ingredient of lemonade.

RM

Joe Seigh

unread,

Apr 15, 2004, 9:51:57 PM4/15/04

to

del cecchi wrote:
>
> "Joe Seigh" <jsei...@xemaps.com> wrote in message

> news:407F218D...@xemaps.com...

> >
> >
> > Seriously, is this 'bad' formally defined. So then all I would have
> to do
> > avoid this badness and my programs would be future proof. That would
> truly
> > be cool.
> >
> > Joe Seigh
>
> (didn't snip because I couldn't decide where. )
>
> Sort of snarky, eh? Maybe you should open a cold beer and sit down
> with the Mythical Man Month. Examples of bad programming, let me count
> the ways.
>

I think the problem with the term "future proof" is that it is hard
to imagine the future (otherwise we'd be there by now). Since no one
can imagine this, then it must be 'bad' code that is problem, not because
we couldn't predict something that is by definition unpredictable.

I can't imagine in the current memory model if you used memory barriers
correctly that your code would ever break. Of course I imagined that
serveral times in the past, also.

> Storing data in the upper 8 bits of S/360 et al 24 bit addresses

Nobody imagined that you'd need more than 24 bits for an address. Ever.

Joe Seigh

john jakson

unread,

Apr 16, 2004, 2:02:41 AM4/16/04

to

"del cecchi" <dcecchi...@att.net> wrote in message news:<c5na3n$3pmkj$1...@ID-129159.news.uni-berlin.de>...

Aha, fond memories of MacOS programming when the 68K was 24b address,
only Apple broke all their own rules too hiding SW mmu stuff up there
in top byte. Every single OS upgrade broke endless apps precisely
because of all the above and most Mac SW never seemed to survive more
than 2 OS upgrades with overhauling. There used to be an endless list
of no-nos, but but my fav was always self-moifying code, I was as
guilty as most of em.

For some peculiar reason I got the impression that MS-Intel bent over
backwards to keep really old stuff alive much longer than Apple did.
Ofcourse broken SW usually works ok in a suitable VM, thank goodness
for emulation.

> Bad programming must be like Pornography to the Supreme Court... I know
> it when I see it. One of those things that seemed like a good idea at
> the time, sort of like edge triggered latches.
>
> del cecchi

johnjakson_usa_com

Terje Mathisen

unread,

Apr 16, 2004, 2:09:18 AM4/16/04

to

Mike Haertel wrote:
> So the x86 memory ordering guarantee is simply that you will observe
> stores from other processors in the same order those processors make
> the stores. It doesn't guarantee that you will observe them right away,
> and also doesn't guarantee that all processors in the system will observe
> the stores at the same time.

Thanks Mike, this is the way I have been (tacitly) programming for ~20
years, it is good to know that it is still valid. :-)

I.e. the guarantee of ordered observation of stores is sufficient to
ensure that you can copy an active structure to a new location, make
whatever updates you desire, and then do an atomic write to redirect a
global pointer to the current version.

As long as you have a single process making updates (but potentially
many readers), you don't even need any form of locking on any of these
write operations. :-)

OTOH, it is much harder to know exactly when it is safe to
reclaim/garbage collect the old version, this is where the RCU OS hack
comes in.

Terje Mathisen

unread,

Apr 16, 2004, 2:27:53 AM4/16/04

to

Brannon Batson wrote:

No problem, you already answered in the original thread. :-)

I.e. x86 SMP machines use a cache coherency protocol that ensures that
all cpus see stores in the order they have been programmed, without
needing any write barrier.

Nick Maclaren

unread,

Apr 16, 2004, 3:51:00 AM4/16/04

to

In article <107u4hn...@corp.supernews.com>,

Are you very young, very new in the area or very forgetful?

As I have posted before, this SMT terminology fiasco is precisely
the same as the RISC one on a smaller scale. I am telling you (from
certain knowledge) that the term SMT will be applied within a few
years to technologies that are very different from what falls within
your "definition".

No, I don't know what your definition is, precisely, but I don't need
to for that statement to be correct :-)

SMT, like RISC, started off as an academic political term designed to
make old and/or minor techniques sound like major new advances. It
was then taken over by the marketdroids, and is fast heading towards
meaninglessness.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Apr 16, 2004, 4:07:55 AM4/16/04

to

In article <c5ntaf$3nn$1...@osl016lin.hda.hydro.com>,

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>Mike Haertel wrote:
>> So the x86 memory ordering guarantee is simply that you will observe
>> stores from other processors in the same order those processors make
>> the stores. It doesn't guarantee that you will observe them right away,
>> and also doesn't guarantee that all processors in the system will observe
>> the stores at the same time.
>
>Thanks Mike, this is the way I have been (tacitly) programming for ~20
>years, it is good to know that it is still valid. :-)
>
>I.e. the guarantee of ordered observation of stores is sufficient to
>ensure that you can copy an active structure to a new location, make
>whatever updates you desire, and then do an atomic write to redirect a
>global pointer to the current version.
>
>As long as you have a single process making updates (but potentially
>many readers), you don't even need any form of locking on any of these
>write operations. :-)

But essentially only if you are operating in cache bypass mode. What
I think that a lot of the complications are for are to try to specify
the caching semantics - and I agree that they aren't a great success.

This is one of the reasons that I would favour a much simpler shared
model, without automatic cache coherence. I don't believe that
there are a significant number of programs that are both correct and
rely on automatic cache coherence - largely because the latter has
no specification in most programming languages!

Regards,
Nick Maclaren.

Alexander Terekhov

unread,

Apr 16, 2004, 5:43:14 AM4/16/04

to

Brannon Batson wrote:
[...]

> Hmmmm, if I was less mature I would make the statement that "CPU HW
> engineers know vastly more about programming than programmers know
> about HW"--but instead I'll say: There's no mysticism about memory
> ordering amongst the people that make those HW decisions.

Lack of "naked" (and also sort of "less heavier than acquire/release"
hoist/sink for loads and stores) read-modify-write atomics aside for
a meoment, would you please tell me why IA-64 doesn't provide xchg
with release semantics (it provides only acquire semantics for xchg).

class swap_based_mutex_for_windows { // noncopyable

atomic<int> m_lock_status; // 0: free, 1/-1: locked/contention
auto_reset_event m_retry_event;

public:

// ctor/dtor [w/o lazy event init]

void lock() throw() {
if (m_lock_status.swap(1, msync::acq))
while (m_lock_status.swap(-1, msync::acq))
m_retry_event.wait();
}

bool trylock() throw() {
return !m_lock_status.swap(1, msync::acq) ?
true : !m_lock_status.swap(-1, msync::acq);
}

bool timedlock(absolute_timeout const & timeout) throw() {
if (m_lock_status.swap(1, msync::acq)) {
while (m_lock_status.swap(-1, msync::acq))
if (!m_retry_event.timedwait(timeout))
return false;
}
return true;
}

void unlock() throw() {
if (m_lock_status.swap(0, msync::rel) < 0)
m_retry_event.set();
}

};

msync::things (enums) are used for overloading (resoved at compile
time). I mean things like

template<typename min_msync, typename update_msync>
bool decrement(min_msync mms, update_msync ums) throw() {
numeric val;
do {
val = m_value.load(msync::none);
assert(min() < val);
if (min() + 1 == val) {
m_value.store(min(), mms);
return false;
}
} while (!m_value.attempt_update(val, val - 1, ums));
return true;
}

template<typename min_msync, typename update_msync>
bool decrement(min_msync mms, update_msync ums, may_not_store_min_t) throw() {
numeric val;
do {
val = m_value.load(mms);
assert(min() < val);
if (min() + 1 == val)
return false;
} while (!m_value.attempt_update(val, val - 1, ums));
return true;
}

[...]

bool decrement(msync::acq_t) throw() {
return decrement(msync::acq, msync::none);
}

bool decrement(msync::rel_t) throw() {
return decrement(msync::none, msync::rel);
}

bool decrement(msync::none_t) throw() {
return decrement(msync::none, msync::none);
}

bool decrement(may_not_store_min_t) throw() {
return decrement(msync::acq, msync::rel, may_not_store_min);
}

bool decrement(msync::acq_t, may_not_store_min_t) throw() {
return decrement(msync::acq, msync::none, may_not_store_min);
}

bool decrement(msync::rel_t, may_not_store_min_t) throw() {
return decrement(msync::none, msync::rel, may_not_store_min);
}

bool decrement(msync::none_t, may_not_store_min_t) throw() {
return decrement(msync::none, msync::none, may_not_store_min);
}

http://www.terekhov.de/pthread_refcount_t/experimental/refcount.cpp

regards,
alexander.

Terje Mathisen

unread,

Apr 16, 2004, 7:38:31 AM4/16/04

to

Nick Maclaren wrote:
> In article <c5ntaf$3nn$1...@osl016lin.hda.hydro.com>,
> Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>>I.e. the guarantee of ordered observation of stores is sufficient to
>>ensure that you can copy an active structure to a new location, make
>>whatever updates you desire, and then do an atomic write to redirect a
>>global pointer to the current version.
>>
>>As long as you have a single process making updates (but potentially
>>many readers), you don't even need any form of locking on any of these
>>write operations. :-)
>
> But essentially only if you are operating in cache bypass mode. What

No, I don't think so.

> I think that a lot of the complications are for are to try to specify
> the caching semantics - and I agree that they aren't a great success.

The whole idea behind cpu caches is that they should be totally
transparent to the cpu(s), beyond improving (at least on average) the
performance.

I.e. I believe you're wrong here Nick: x86 SMP systems behave as if all
of the cpus were cache-less with unbuffered in-order writes and no
internal forwarding.

Joe Seigh

unread,

Apr 16, 2004, 7:57:09 AM4/16/04

to

Nick Maclaren wrote:
>
> But essentially only if you are operating in cache bypass mode. What
> I think that a lot of the complications are for are to try to specify
> the caching semantics - and I agree that they aren't a great success.
>
> This is one of the reasons that I would favour a much simpler shared
> model, without automatic cache coherence. I don't believe that
> there are a significant number of programs that are both correct and
> rely on automatic cache coherence - largely because the latter has
> no specification in most programming languages!
>

Any program that relies on cache coherence is probably incorrect. The
only exception would be some forward progress or performance thing that's
outside the memory model anyway, like spin lock waiting.

For strongly coherent cache, probably the best strategy is a simple
test loop until you see the lock available, then attempt to use an
interlocked instruction to get the lock. Add membars as needed.

For non automatic cache, you'd need a mechanism to get the cache
line updated. Membar, cpu serialization, cache directive, or whatever.

If you went the latter route, presumably there would be some efficient
mechanism put in place. Like a Prescott style MONITOR/MWAIT that didn't
suck.

Joe Seigh

Nick Maclaren

unread,

Apr 16, 2004, 8:06:29 AM4/16/04

to

In article <c5ogjn$ffs$1...@osl016lin.hda.hydro.com>,

Terje Mathisen <terje.m...@hda.hydro.com> writes:
|>
|> >>I.e. the guarantee of ordered observation of stores is sufficient to
|> >>ensure that you can copy an active structure to a new location, make
|> >>whatever updates you desire, and then do an atomic write to redirect a
|> >>global pointer to the current version.
|> >>
|> >>As long as you have a single process making updates (but potentially
|> >>many readers), you don't even need any form of locking on any of these
|> >>write operations. :-)
|> >
|> > But essentially only if you are operating in cache bypass mode. What
|>
|> No, I don't think so.

We may be talking about different things. See below.

|> > I think that a lot of the complications are for are to try to specify
|> > the caching semantics - and I agree that they aren't a great success.
|>
|> The whole idea behind cpu caches is that they should be totally
|> transparent to the cpu(s), beyond improving (at least on average) the
|> performance.

A nice idea in theory, not so good in practice. Few designs are
completely transparent, at least to privileged code. Applications
thus rely on the system not exposing the effects to them.

|> I.e. I believe you're wrong here Nick: x86 SMP systems behave as if all
|> of the cpus were cache-less with unbuffered in-order writes and no
|> internal forwarding.

This is where we are at cross-purposes; I was actually talking about
the more complex models that are typically specified in 'RISC'
architectures. My comment was that they aren't a great success
above the simple x86 model.

My point about the cache bypass was that write ordering is NOT
enough to enable reliable structure update; you also require no
reads to move ahead of a write to that location, or to impose some
similar constraint another way. In your first paragraph above,
you don't say that. Consider a circular chain of threads, each
of which does the following:

Updates a global pointer, as you describe
Wakes up the next thread and sleeps (atomically)
Loads the global pointer, and makes changes
Repeat

This is an old and good design, but breaks if the load can get out
of step with the updates.

Regards,
Nick Maclaren.

Eric

unread,

Apr 16, 2004, 10:52:07 AM4/16/04

to

I am not so sure. This is exactly what I was looking for in the
Intel docs (IA32 Vol 3 Sys prog guide 24547209.pdf).
It also explicitly states in section 7.2.2 Memory Ordering for
Pentium4, Xeon and P6 Family that, in lieu of serializing instructions,
reads can be carried out speculatively and in any order. It does
NOT mention a replay trap if a variable that was pre-read changes.

That says to me that the following sequence is possible
(as per the spec, which may differ from what actually occurs)

Program Order:
P1 P2
---------- ---------
store data load ptr
store ptr load data

Execution Order:
P1 P2
---------- ---------
load data (read bypasses, picks up old data value)
store data
load ptr (read picks up new ptr)
store ptr

Note that the stores do occur in their program order on P2,
as per the spec, but by interleaving them with the loads they
appear out of order.

The net result is that you could wind up with a new pointer to
the old record data (most likely the uninitialized field values).

An LFENCE between the load ptr and load data serializes and
prevents the read bypass.

Does this sound correct? (Did I just break a lot of peoples code?)

Eric

unread,

Apr 16, 2004, 11:08:32 AM4/16/04

to

Eric wrote:
>

Oops I meant:

Execution Order:
P1 P2
---------- ---------
load data (read bypasses, picks up old data value)
store data

store ptr

load ptr (read picks up new ptr)

Eric

Alexander Terekhov

unread,

Apr 16, 2004, 12:12:13 PM4/16/04

to

Eric wrote:
[...]

> Does this sound correct? (Did I just break a lot of peoples code?)

http://www.well.com/~aleks/CompOsPlan9/0005.html

regards,
alexander.