Another place where there's alot of money

49 views
Skip to first unread message

Robert Myers

unread,
Apr 9, 2004, 11:40:24 PM4/9/04
to

And, I might add, probably a perpetually instatiable appetite for
throughput.

http://www.nytimes.com/2004/04/10/technology/10GAME.html?pagewanted=1&hp

<quote>

Computer games represent one of the fastest-growing, most profitable
entertainment businesses. Making movies, by contrast, is getting
tougher and more expensive, now costing, with marketing fees, an
average of $103 million a film. That is one reason, among others, that
those with power in Hollywood are avidly seeking to get into the game
business while also reshaping standard movie contracts so they can
grab a personal share of game rights.

<snip>

Ridley Scott, best known for science fiction fantasies like "Blade
Runner" and "Alien" as well as the historical epic "Gladiator," has
been meeting with video game company executives, too, arguing that
games offer greater creative opportunities these days because they are
less expensive to make and not constrained by the roughly two-hour
time frame of a conventional movie.

"The idea that a world, the characters that inhabit it, and the
stories those characters share can evolve with the audience's
participation and, perhaps, exist in a perpetual universe is indeed
very exciting to me," said Mr. Scott, who is seeking a video game
maker to form a partnership with him and his brother Tony.

</quote>

The article moves on to make some cautionary comments, likening the
current world of game authors to that of comic book writers.

The talent will go where the money is. If computer games attract the
best creative talent (apologies to those who believe that comic books
are high art), the money will go there, too. We ain't see nuttin'
yet.

RM

john jakson

unread,
Apr 10, 2004, 11:03:47 AM4/10/04
to
Robert Myers <rmy...@rustuck.com> wrote in message news:<klqe705f63oher239...@4ax.com>...

Now this is where MTA could really take off, I wouldn't be surprised
if its already in there somewhere.

regards

johnjakson_usa_com

Robert Myers

unread,
Apr 10, 2004, 5:27:30 PM4/10/04
to
On 10 Apr 2004 08:03:47 -0700, johnj...@yahoo.com (john jakson)
wrote:

>Robert Myers <rmy...@rustuck.com> wrote in message news:<klqe705f63oher239...@4ax.com>...

<snip>


>>
>> The talent will go where the money is. If computer games attract the
>> best creative talent (apologies to those who believe that comic books
>> are high art), the money will go there, too. We ain't see nuttin'
>> yet.
>>

>


>Now this is where MTA could really take off, I wouldn't be surprised
>if its already in there somewhere.
>

I think Ian McClatchie has explained to us that indeed it is:

>Ian>I think some of the research you are talking about is happening
>for the gaming market, right now.
>
>Robert> Hardware and software support for lightweight threads.
>
>Ian>Graphics chips spawn multiple "threads" PER CYCLE.
>
>Robert> Hardware and software support for streaming computation.
>
>Ian>These threads coordinate access to memories.
>
>Robert> Tools, strategies, and infrastructure to make specialized hardware
>Robert> like ASIC's more easily available for use by science.
>
>Ian>Not here quite yet, since the gaming guys do single precision and
>you folks want DP mostly. But check out www.gpgpu.org. I honestly
>think that there is a real chance the graphics folks are going to
>sneak up on and overwhelm the CPU guys for physics codes, maybe
>unintentionally.

That's why I can't take IBM's entries into the supercomputer market
altogether seriously. Their best technologies are going into games.
The golden rule, you know.

Servers? Supercomputers? x86? B-o-o-o-ring.

RM

Rupert Pigott

unread,
Apr 10, 2004, 7:04:33 PM4/10/04
to

The other day I was grinding my teeth at the minimal level of
committment to open-source from the 3D GFX card makers because
the bloody drivers locked up when I dragged a window AND I
wanted to run them on something other than x86 Linux.

Then it occurred to me : You could probably get some mileage
out of running WireGL/Chromium on a small BG/L style system on
PCI-Express card. So there's one cool app that is games related
and could harness BG/L type nodes. Anything to stick two fingers
up at the 3D hardware vendors would be good right now. Unhappy
customer. :/

Cheers,
Rupert

Robert Myers

unread,
Apr 10, 2004, 8:09:35 PM4/10/04
to

One way to read that, of course, is that they think they've really got
something worth hiding.

I just (like, last night, or, really, this morning) got my Promise
SATA raid array running on a driver Promise-supplied driver compiled
from source (against an Enterprise Server SMP kernel, no less), and
(as you may or may not know), the history of Promise controllers and
Linux has not been a smooth one. It is, on the other hand, x86 Linux,
and I haven't yet booted from the RAID array.

The fact that Promise has completely caved and that there will be a
GPL driver native to the 2.6 Kernel fills me with hope.

I don't generally suffer from a shortage of self-confidence, but
graphics drivers is where I draw the line. I don't know what it is
_other_ than x86 Linux you are using, but, by the time you've gotten
to the desktop in a Linux system running X, you've got so many layers
and configuration files that serially redefine what the previous layer
defined that you're lucky if you've still got your sanity in hand.

I figure that using a GPU to do physics should, by comparison, be a
walk in the park.

>Then it occurred to me : You could probably get some mileage
>out of running WireGL/Chromium on a small BG/L style system on
>PCI-Express card. So there's one cool app that is games related
>and could harness BG/L type nodes. Anything to stick two fingers
>up at the 3D hardware vendors would be good right now. Unhappy
>customer. :/
>

It will never happen, of course, but as a way of throwing some real?
world?, er, light ;-) on the whole issue, I'll bet you'd get top
billing on slashdot for such a project. That way, we could get the
discussion down to some meaningful benchmarks, like frame rate on
Quake.

One day of fame and a swamped server in return for a hefty slice of
your sanity? You just didn't seem quite _that_ far gone. ;-).

On the _other_ hand, a small BG/L system on a PCI-Express card sounds
like it should be a day and a half's work for Del's shop, and I'll bet
IBM would sell a few.

RM

john jakson

unread,
Apr 11, 2004, 12:54:34 AM4/11/04
to
Robert Myers <rmy...@rustuck.com> wrote in message news:<51pg70t6bevebrvb2...@4ax.com>...


I have been speculating that MTA will replace RISC just as it replaced
CISC, now thats worth some flame.

I had this horrible idea that might not be so horrible.

Since x86 went from plain CISC to ride along with RISC (and beat all
of them in the process), I wonder if it could jump again onto MTA. I
see no reason why the actual instructions executed by the barrel
engine in the MTA could not be x86 codes or atleast the RISC subset
that most compilers emit for.

In my design I already allow for some variable length codes 16b x 1-4,
so thats not a problem. My branches cost near 0 cycles (1/8 cycle for
fetch ahead) taken or not. The execution unit can bind 1 or 2 branches
with a non branch so there is a speed boost there of about 10-30%, no
branch prediction of any sort needed here.

An x86 MTA would probably need 1 or 2 extra pipe stages to resolve the
more complex 8b instruction encodings. The codes I would leave out can
be microcoded in firmware, and the penalty for those is not so bad if
a little help is given in HW trapping. Probably not so practical in
FPGA but worth some thinking. FPU is still a problem though. If FPGA
can hit 250MHz after P/R, then a full effort custom VLSI should be
able to go 3-10x faster, limited only by the cycle rates of dualport
ram or 8b adds and similar.

It would allow all the benefits of MTA to get with all the benefits of
having all that code, OSes, compilers etc. The details involved in
executing simplified x86 ops are the really the same as clean sheet
ops for new MTA. Question is which is harder, writing a compiler for
blank ISA or just using x86 codes.

regards

johnjakson_usa_com

Robert Myers

unread,
Apr 11, 2004, 1:19:43 AM4/11/04
to
On 10 Apr 2004 21:54:34 -0700, johnj...@yahoo.com (john jakson)
wrote:

<snip>


>
>I have been speculating that MTA will replace RISC just as it replaced
>CISC, now thats worth some flame.
>
>I had this horrible idea that might not be so horrible.
>
>Since x86 went from plain CISC to ride along with RISC (and beat all
>of them in the process), I wonder if it could jump again onto MTA. I
>see no reason why the actual instructions executed by the barrel
>engine in the MTA could not be x86 codes or atleast the RISC subset
>that most compilers emit for.
>

I've sort of taken it for granted that that's where x86 processors are
headed, but then, I'm not a computer architect. If you can't run the
pipeline faster because the energy costs of doing so are too high,
then you have to go to more pipelines. That's the paradigm shift or
whatever that Gelsinger was nattering on about.

There are only so many ways you can deploy multiple pipelines on a
single die. The only advantage that I can see to separate cores is
that the entire die doesn't have to be reachable in a single clock.
As a trade against that, you lose the possibility of close cooperation
among threads. I think we'll probably wind up with both (SMT and
CMP).

My reference to x86 as boring was really meant to refer to further
tweaks and die shrinks on the current architectures, not particularly
to the instruction set.

RM

David C. DiNucci

unread,
Apr 11, 2004, 3:27:49 AM4/11/04
to
Robert Myers wrote:
> There are only so many ways you can deploy multiple pipelines on a
> single die. The only advantage that I can see to separate cores is
> that the entire die doesn't have to be reachable in a single clock.
> As a trade against that, you lose the possibility of close cooperation
> among threads. ...

I am apparently *still* (after literally all these years) still missing
something very basic, so if someone has the time, maybe they can finally
help me get straight on this.

First, assuming that "separate cores" share no resources/stages, the
alternative is to share some resources/stages between threads, in which
case, to achieve the same maximum performance with multiple threads, the
clock needs to run fast enough to ensure that any such shared
resources/stages are not a bottleneck. That is one motivation for high
clock which is not present for multiple cores (and it seems to me to be
a major one).

Second, what kind of "close cooperation" is currently possible between
threads which share a core? I recall that the early Tera MTA only
allowed threads to interact through memory, so unless that has changed
significantly in hyperthreaded architectures (and I don't recall seeing
that it had), I don't know what "possibility" you are losing with
multiple cores.

So if my assumptions above are correct, and if multiple cores can be as
effective at sharing caches and integrating with other aspects of the
memory subsystem as their hyperthreaded counterparts, I am still led to
the conclusion that multiple cores get their bonus points from lower
clock speed (and therefore less heat) for similar (or better)
performance when there are enough threads to keep those cores busy, and
hyperthreaded cores get their bonus points from (1) possibly requiring
less floorplan/cost to manufacture support for n threads, and (2) the
ability to dynamically and productively allocate otherwise idle
resources/stages to active threads. I assume that these play off
against one another--i.e. that the ability to run the multiple cores
slower also allows them to be built smaller/simpler (with the forth chip
sited earlier as an extreme in that direction), but that if multiple
cores do end up significantly larger than the hyperthreaded
counterparts, that would affect their ability to share caches as
effectively.

My sense is that the motivation for hyperthreading relates almost
entirely to #2, because having lots of threads is the exception rather
than the rule, because of the widely accepted belief that PARALLEL
PROGRAMMING IS HARD!, and as long as that belief exists, so will shared
pipelines.

> ...I think we'll probably wind up with both (SMT and CMP).

Even with my silly assumptions, I can buy that: Enough CMP to handle
the minimum number of threads expected to be active, and SMT (i.e.
dynamically-allocatable pipe stages) to provide headroom. What I don't
buy (if it was implied) is that the CMP and SMP will be present in the
same processors, since there are good reasons for them to run at
different clock speeds. In fact, I might imagine SMTs (at high temp)
sprinkled among the CMPs (at lower temp) to even out the cooling.

-- Dave

Robert Myers

unread,
Apr 11, 2004, 9:24:13 AM4/11/04
to
On Sun, 11 Apr 2004 00:27:49 -0700, "David C. DiNucci"
<da...@elepar.com> wrote:

>Robert Myers wrote:
>> There are only so many ways you can deploy multiple pipelines on a
>> single die. The only advantage that I can see to separate cores is
>> that the entire die doesn't have to be reachable in a single clock.
>> As a trade against that, you lose the possibility of close cooperation
>> among threads. ...
>
>I am apparently *still* (after literally all these years) still missing
>something very basic, so if someone has the time, maybe they can finally
>help me get straight on this.
>

I had forgotten you were in the Nick Maclaren camp on this one (might
as well put it that way, because that's the way I think of it). I had
(believe it or not) thought of starting the paragraph to which you are
replying more or less the way you started yours, which is that I
*still* don't understand the case for separate cores (except the one I
mentioned). Were die layout not a problem, you'd put everything in
the same place, and let who belonged to what sort itself out as
needed. Die layout, of course, is a consideration.

>First, assuming that "separate cores" share no resources/stages, the
>alternative is to share some resources/stages between threads, in which
>case, to achieve the same maximum performance with multiple threads, the
>clock needs to run fast enough to ensure that any such shared
>resources/stages are not a bottleneck. That is one motivation for high
>clock which is not present for multiple cores (and it seems to me to be
>a major one).
>

If you try to run helper threads on separate cores, you lose most of
the advantages because of the overhead of communicating through L2.
If you don't share L2, there is no point at all in helper threads.

I don't get your argument for a fast clock at all. You don't balance
resources by speeding up the clock. You balance resources by
balancing resources. Your best chance of getting an ideal match is
when all resources are in one big pool and can be assigned as needed
to the workload that requires them. Your worst chance of getting an
ideal match is to have the resource pool arbitrarily divided into
pieces. Simple queuing theory, as someone put it.

>Second, what kind of "close cooperation" is currently possible between
>threads which share a core? I recall that the early Tera MTA only
>allowed threads to interact through memory, so unless that has changed
>significantly in hyperthreaded architectures (and I don't recall seeing
>that it had), I don't know what "possibility" you are losing with
>multiple cores.
>

Tera MTA is not a good way to think about the strategies that are
being discussed. The idea of Tera MTA was to have a slew of
_separate_ threads that advanced only every n clocks so that every
single instruction could have a latency of n without any other magic
at all.

There is a paper I have cited so many times that I get tired of
looking it up that describes one of several strategies for using
helper threads to, in effect, expand the run-time scheduling window of
Itanium without introducing OoO scheduling. The one particular paper
examines using helper threads with SMT and CMP, and CMP loses big
time. Helper threads, by definition, are trying to advance a primary
thread and are not conceptually separate as would be the threads in
Tera MTA.

The last time I mentioned the paper in question, Nick's response was,
more or less, "Oh that. Well, we've known for decades that you could
get that kind of speedup for a helper thread--only in privileged
mode." [Insert here a tart response to the "well we've known for
decades" line that I'm composing and going to attach to a hot key].

"Okay," I responded, drawing my breath in slowly, "what is it that we
need to do so that people can do these kinds of things in user space,
since I don't think the dozens of researchers publishing papers on
these strategies have been planning on running everything in
privileged mode." At which point Nick responded with a list of
requirements that look achievable to someone who is not a professional
computer architect.

>So if my assumptions above are correct, and if multiple cores can be as
>effective at sharing caches and integrating with other aspects of the
>memory subsystem as their hyperthreaded counterparts, I am still led to
>the conclusion that multiple cores get their bonus points from lower
>clock speed (and therefore less heat) for similar (or better)
>performance when there are enough threads to keep those cores busy,

As I've stated, if you could put everything in one place, there would
be no advantage at all to multiple cores (other than amortizing NRE by
printing cores on dies like postage stamps). Since you can't put
everything in one place, there is a balance between acceptable heat
distribution, die size, and being able to reach everything in a single
core in a single clock (although I suspect that even that constraint
is going to go by the board at some point).

>and
>hyperthreaded cores get their bonus points from (1) possibly requiring
>less floorplan/cost to manufacture support for n threads, and (2) the
>ability to dynamically and productively allocate otherwise idle
>resources/stages to active threads. I assume that these play off
>against one another--i.e. that the ability to run the multiple cores
>slower also allows them to be built smaller/simpler (with the forth chip
>sited earlier as an extreme in that direction), but that if multiple
>cores do end up significantly larger than the hyperthreaded
>counterparts, that would affect their ability to share caches as
>effectively.
>

The primary advantage of closely-coupled threads on a single core is
that data sharing is not done through L2 cache or memory.

>My sense is that the motivation for hyperthreading relates almost
>entirely to #2, because having lots of threads is the exception rather
>than the rule, because of the widely accepted belief that PARALLEL
>PROGRAMMING IS HARD!, and as long as that belief exists, so will shared
>pipelines.
>

I think you are doing yourself and Software Cabling a disservice by
adopting this posture. While in theory you could use the machinery of
Software Cabling to address any level of parallelism, its natural
target is coarse-grained parallelism. The programmer can think
locally von Neumann, and let SC take care of the parallel programming
stuff globally. Helper threads are a way of letting the processor or
processor and supporting software do the fine-grained parallelism.

RM

john jakson

unread,
Apr 11, 2004, 11:27:31 AM4/11/04
to
Robert Myers wrote in message

snipping

> I've sort of taken it for granted that that's where x86 processors are
> headed, but then, I'm not a computer architect. If you can't run the
> pipeline faster because the energy costs of doing so are too high,
> then you have to go to more pipelines. That's the paradigm shift or
> whatever that Gelsinger was nattering on about.
>

MTA doesn't have long pipelines, its best to think of multiple wheels
each with N faces. N might be 4,8,16 but any small int may suffice.
Each N processes is different and is identified by a Pid. N is chosen
for the necesary pipeline depth needed to sustain reg to reg typ int
operations at highest speed. This means that if a P returns N cycles
later, it can pick up previous results saving some cache b/w. The
multiple wheels which may carry the same Pids or different Pid sets.

The wheels have different functions, those on the instruction fetch
side of the queue maintain the PC and push N ops terminated by
branches into the N queues faster than needed. Those on the exec side
are for int, branch cc ops and possibly fpu,dsp or whatever. The
retired bra sends back offsets to the fetch unit. A process can't be
in fetch and exec at same time, so the 2 sides are coupled only
through the exchange of bra targets, ie what to do next.

Is the pipeine N or is N*noOfWheels. What about the queue, an op may
be fetched ahead and wait for 60 or cycles before it goes to exec. I'd
call it a N stage pipeline with maybe 60 latency stages.

regards

johnjakson_usa_com

Rupert Pigott

unread,
Apr 11, 2004, 12:16:55 PM4/11/04
to
Robert Myers wrote:

> I had forgotten you were in the Nick Maclaren camp on this one (might
> as well put it that way, because that's the way I think of it). I had
> (believe it or not) thought of starting the paragraph to which you are
> replying more or less the way you started yours, which is that I
> *still* don't understand the case for separate cores (except the one I
> mentioned). Were die layout not a problem, you'd put everything in
> the same place, and let who belonged to what sort itself out as
> needed. Die layout, of course, is a consideration.

Hardware Schmardware. ;)

I've come to think that the real challenge is choosing a parallel
coding model that has as many lives as the von Neumann cat. The von
Neumann pussy needs to run out of lives first, and it appears to
have happened already. The current generation of killer micros are
basically a bunch of parallel pussies, I think the software will
follow, albeit slowly and painfully as it always has done. I have
some hope that I'll be quite happy with the state of affairs in a
couple of years time.

Cheers,
Rupert

Robert Myers

unread,
Apr 11, 2004, 1:12:27 PM4/11/04
to
On Sun, 11 Apr 2004 17:16:55 +0100, Rupert Pigott
<r...@dark-try-removing-this-boong.demon.co.uk> wrote:

>Robert Myers wrote:
>
>> I had forgotten you were in the Nick Maclaren camp on this one (might
>> as well put it that way, because that's the way I think of it). I had
>> (believe it or not) thought of starting the paragraph to which you are
>> replying more or less the way you started yours, which is that I
>> *still* don't understand the case for separate cores (except the one I
>> mentioned). Were die layout not a problem, you'd put everything in
>> the same place, and let who belonged to what sort itself out as
>> needed. Die layout, of course, is a consideration.
>
>Hardware Schmardware. ;)
>

Except that, especially with OoO, hardware has done much of the heavy
lifting. As annoying as it may be to the real hardware architects to
have a bunch of software types going on about how hardware should be
designed, it's a subject that we software types simply cannot ignore.

I take alot of heat, some of it clearly good-natured, about my
fixation on the Cray-I, but that was a machine that you could not
program effectively without understanding how it worked. Not only
that, the programming model was simple enough that I could get my
feeble brain around it quickly enough and get on with the physics.

Either life was alot simpler then, or the world awaits another Seymour
to cut through the clutter and find the right choice of tricks that's
worth the bother and that can be utilized in actual practice. The
right answer is probably not that it's either/or, but both.

>I've come to think that the real challenge is choosing a parallel
>coding model that has as many lives as the von Neumann cat. The von
>Neumann pussy needs to run out of lives first, and it appears to
>have happened already. The current generation of killer micros are
>basically a bunch of parallel pussies, I think the software will
>follow, albeit slowly and painfully as it always has done. I have
>some hope that I'll be quite happy with the state of affairs in a
>couple of years time.
>

Once you could see the attack of the killer micros coming, _most_ of
what was going to happen could be forseen with a five and dime crystal
ball, _including_ the fact that many processes would be running in
parallel. Two decades of fiddling and fumbling have passed, and you
think a couple of years is going to bring software bliss? Have you
been celebrating a religious holiday in some highly non-traditional
fashion? ;-).

RM

Rupert Pigott

unread,
Apr 11, 2004, 1:36:30 PM4/11/04
to
Robert Myers wrote:

[SNIP]

> Once you could see the attack of the killer micros coming, _most_ of
> what was going to happen could be forseen with a five and dime crystal
> ball, _including_ the fact that many processes would be running in
> parallel. Two decades of fiddling and fumbling have passed, and you
> think a couple of years is going to bring software bliss? Have you
> been celebrating a religious holiday in some highly non-traditional
> fashion? ;-).

The key difference is : parallelism has been pervasive in hardware
that everyday programmers can get their hands on for a good decade
now. You can see the attitude changing, Threads are hip and cool,
if you look at the job ads on this side of the pond you'll see a
lot of them making explicit mention of "Threading Skills" in the
job specs. What's more : The hardware is actually getting even more
parallel too and in the not too distant future SMP (or at least
SMT systems) will be common as muck.

So while I think shared memory threads are the spawn of Satan, I
don't mind the side effect that they give me lots of parallel HW
to play with. This HW will probably run just as well with my CSP
monkey business as it will with shared memory threads too.

If you really want the Cray simplicity I think you are barking up
the wrong tree with SMT. The threads of execution are intricately
coupled by (userspace invisible) implementation dependant resource
contention issues. I suspect that this helper thread concept will
pan out the same too, that doesn't stop someone from proving me
wrong though.

Cheers,
Rupert

Robert Myers

unread,
Apr 11, 2004, 2:33:55 PM4/11/04
to
On Sun, 11 Apr 2004 18:36:30 +0100, Rupert Pigott
<r...@dark-try-removing-this-boong.demon.co.uk> wrote:

<snip>


>
>If you really want the Cray simplicity I think you are barking up
>the wrong tree with SMT. The threads of execution are intricately
>coupled by (userspace invisible) implementation dependant resource
>contention issues. I suspect that this helper thread concept will
>pan out the same too, that doesn't stop someone from proving me
>wrong though.
>

No, the days of sweetly crystalline Cray programming are gone forever.

Only people like Terje and Linus (what is it with these guys from
Scandinavia, anyway?) worry about the real effects of OoO on a
day-to-day basis. The rest of us simply accept that the processor
somehow gets away with running twenty times as fast as the memory bus
and get on with business.

When I mentioned my five and dime crystal ball, the one thing I don't
think most people would have been able to forsee was the ability to
keep a couple of hundred instructions in flight by trickery that's
invisible to all but the most detail-oriented of programmers.

My crystal ball says that helper threads are going to work the same
way. The helper threads that apparently so horrify you are just
another turn of the screw. If you've got a few hundred instructions
in flight with predication, speculation, hoisting, and fixup already,
what's a few hundred more here or there?

A bitch to debug? I would imagine so, but I just _know_ you're not
going to tell me that CSP threads are easy.

RM

Rupert Pigott

unread,
Apr 11, 2004, 5:44:53 PM4/11/04
to
Robert Myers wrote:

> A bitch to debug? I would imagine so, but I just _know_ you're not
> going to tell me that CSP threads are easy.

Fundamentally you still have to tackle the same synchronisation
issues that the problem throws up, but you don't add a whole bunch
of *hidden* coupling (eg : memory contention).

Cheers,
Rupert

Robert Myers

unread,
Apr 11, 2004, 7:52:47 PM4/11/04
to

I _think_ you're worried about a non-problem, but I'm willing to be
educated. Helper threads don't introduce user-visible separate
threads or user-visible memory sharing, and the processor maintains
the illusion of in-order execution.

Things can go horrifyingly wrong for hardware architects and compiler
designers, but it's up to them to see to it that life is no harder for
end-users than life would be without helper threads.

Once you've lost deterministic execution--and you have with OoO--you
can't rely on single-step debugger execution to expose what's actually
going on, but that's not a problem that's newly-introduced by helper
threads.

RM

Hank Oredson

unread,
Apr 11, 2004, 8:56:34 PM4/11/04
to
"Robert Myers" <rmy...@rustuck.com> wrote in message
news:cp2j701vcij39q676...@4ax.com...

And this all happens on each of the 64 processors on the motherboard.

But maybe the 32 processor chip, two per box, will not happen quite
as soon as I think it will ...

> A bitch to debug? I would imagine so, but I just _know_ you're not
> going to tell me that CSP threads are easy.
>
> RM


--

... Hank

http://horedson.home.att.net
http://w0rli.home.att.net


del cecchi

unread,
Apr 11, 2004, 11:17:43 PM4/11/04
to

"David C. DiNucci" <da...@elepar.com> wrote in message
news:4078F375...@elepar.com...

> Robert Myers wrote:
> > ...I think we'll probably wind up with both (SMT and CMP).
>
> Even with my silly assumptions, I can buy that: Enough CMP to handle
> the minimum number of threads expected to be active, and SMT (i.e.
> dynamically-allocatable pipe stages) to provide headroom. What I
don't
> buy (if it was implied) is that the CMP and SMP will be present in the
> same processors, since there are good reasons for them to run at
> different clock speeds. In fact, I might imagine SMTs (at high temp)
> sprinkled among the CMPs (at lower temp) to even out the cooling.
>
> -- Dave

You guys got a heck of a crystal ball. It can see all the way to 2003.
:-)
http://www.hotchips.org/archive/hc15/pdf/11.ibm.pdf


Robert Myers

unread,
Apr 12, 2004, 1:38:00 AM4/12/04
to

A fact I took note of as recently as April 2, under the nominal
heading of, under the nominal subject heading Re: [OT] Microsoft
aggressive search plans revealed:

RM>Power 4 isn't multi-threaded, but Power 5 is, and I'll start to get
RM>really excited when this new openness makes a multi-threaded core
RM>available to play with as IP. ;-).
RM>
RM>(I know, Del, there's just no pleasing some people).

Slide Number 12 and the slides before it make it clear that the Power
5 implementation of SMT addresses the issue of what to do when having
multiple threads active is a disadvantage (has Nick said anything
about that?). Slide 11 shows an example with a max payoff of 25%,
pretty well in line with what has most often been reported with Intel
Hyperthreading.

RM

RM

Maynard Handley

unread,
Apr 12, 2004, 5:14:45 PM4/12/04
to
In article <4078F375...@elepar.com>,

"David C. DiNucci" <da...@elepar.com> wrote:

> My sense is that the motivation for hyperthreading relates almost
> entirely to #2, because having lots of threads is the exception rather
> than the rule, because of the widely accepted belief that PARALLEL
> PROGRAMMING IS HARD!, and as long as that belief exists, so will shared
> pipelines.
>
> > ...I think we'll probably wind up with both (SMT and CMP).
>
> Even with my silly assumptions, I can buy that: Enough CMP to handle
> the minimum number of threads expected to be active, and SMT (i.e.
> dynamically-allocatable pipe stages) to provide headroom. What I don't
> buy (if it was implied) is that the CMP and SMP will be present in the
> same processors, since there are good reasons for them to run at
> different clock speeds. In fact, I might imagine SMTs (at high temp)
> sprinkled among the CMPs (at lower temp) to even out the cooling.

It is not useful to say that PARALLEL PROGRAMMING IS HARD when there are
very different things being discussed.
At the per-chip level, the issue is whether it is hard to have a CPU
running four threads or so at a time. At the Robert Myers level, the
issue is whether it is hard to have a computer running 10,000 threads at
a time. I can well believe that the second is a very hard problem.
The first, however, is right now an artificially hard problem. It is
artificially hard because while there are frequently plenty of fragments
of code that can be parallelized, they are maybe 1000 instructions long,
so the OS overhead in doing so makes the parallelization impractical.
The overhead comes from
? CPU level overhead that assumes modifications made in this thread need
to propagate out (to another CPU) and the CPU has to wait till that has
happened and
? OS overhead that assumes that it's an unacceptable waste of resources
to have a (virtual) processor sit idle, and so enforces switches into
and out of kernel, moving threads onto and off run lists and so on, as
synchronization resources are acquired and released.

Now if we updates the programming model to describe that this collection
of threads is tightly coupled and should run as a single unit on a
single physical CPU, we can get rid of the CPU level overhead because
we'd know that the mods made by thread 0A are in the cache of CPU 0 and
will be seen by thread 0B without any extra hard work on the part of the
CPU. The problem, of course, now is that one has to annotate the code
correctly to specify that THIS sync operation only involves syncing with
THAT thread, so is local, while some other sync operation is global and
needs to be propagated to all CPUs. One can see a programming model that
works well with say 2 or 4 threads but falls apart after that.
Likewise we need to be able to tell the OS that these two threads are
tightly coupled and when one waits on the other, the wait should occur
through some hardware wait mechanism, with the waiting thread going into
an HW idle state of some sort, rather than context switching to some
alternative waiting task.

Both of these are feasible, in the sense that it's not too much of a
stretch to modify the HW and SW to get them to happen.
The PROBLEM really, is that they aren't very fwd or bwd compatible.
They're not fwd compatible because the mechanisms they use to try to be
efficient (to make threading small fragments of code worthwhile) fall
apart when you have too many threads to keep track of and too many
interactions --- it's just too hard to ensure that these memory updates
only affect the threads local to this CPU and don't need to be
propagated to other CPUs. And they're not backward compatible because
code written assuming that it can run as two physical threads, with no
overhead cost for threading code fragments a thousand cycles long, will
run like a slug on a single-threaded CPU or a standard DP machine, with
OS intervention every 1000 cycles to switch to the other thread.

So that's the problem as I see it.
Maybe someone smarter than me can construct some primitives that do
allow scaling (in at least either the fwd or bwd direction) that will
allow one to thread these tiny fragments of code. But in the absence of
that, life sucks. The issue is not that the parallelism doesn't exist
but that one cannot usefully get to it through todays HW and OS
abstractions.

Maynard

Stephen Sprunk

unread,
Apr 12, 2004, 5:30:25 PM4/12/04
to
"Robert Myers" <rmy...@rustuck.com> wrote in message
news:en9k70pm2khm2co4r...@4ax.com...

> On Sun, 11 Apr 2004 22:17:43 -0500, "del cecchi"
> <dcecchi...@att.net> wrote:
> >You guys got a heck of a crystal ball. It can see all the way to 2003.
> >:-)
> >http://www.hotchips.org/archive/hc15/pdf/11.ibm.pdf
>
> Slide 11 shows an example with a max payoff of 25%,
> pretty well in line with what has most often been reported with Intel
> Hyperthreading.

And with slide 5 saying that SMT adds 24% to the size of each core (as
opposed to the 5% commonly claimed on comp.arch), that means Power5's SMT
provides no net gain in performance per mm^2.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

Tony Nelson

unread,
Apr 12, 2004, 6:05:43 PM4/12/04
to
In article <07c94d238480b9ff...@news.teranews.com>,
"Stephen Sprunk" <ste...@sprunk.org> wrote:
...

> And with slide 5 saying that SMT adds 24% to the size of each core (as
> opposed to the 5% commonly claimed on comp.arch), that means Power5's SMT
> provides no net gain in performance per mm^2.

Nowadays I bet designers are happy when they get performance increases
proportional to area increases.
____________________________________________________________________
TonyN.:' tony...@shore.net
'

Robert Myers

unread,
Apr 12, 2004, 6:33:51 PM4/12/04
to
On Mon, 12 Apr 2004 21:30:25 GMT, "Stephen Sprunk"
<ste...@sprunk.org> wrote:

>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:en9k70pm2khm2co4r...@4ax.com...
>> On Sun, 11 Apr 2004 22:17:43 -0500, "del cecchi"
>> <dcecchi...@att.net> wrote:
>> >You guys got a heck of a crystal ball. It can see all the way to 2003.
>> >:-)
>> >http://www.hotchips.org/archive/hc15/pdf/11.ibm.pdf
>>
>> Slide 11 shows an example with a max payoff of 25%,
>> pretty well in line with what has most often been reported with Intel
>> Hyperthreading.
>
>And with slide 5 saying that SMT adds 24% to the size of each core (as
>opposed to the 5% commonly claimed on comp.arch), that means Power5's SMT
>provides no net gain in performance per mm^2.
>

A study c. 1996 by Patterson et. al. showed even OoO processors
stalled 60% of the time on OLTP workloads, making OLTP workloads a
natural target for SMT. Whether IBM got it right or not this time
seems almost beside the point. It amazes me it took them this long to
get into the game.

RM

Robert Myers

unread,
Apr 12, 2004, 7:18:54 PM4/12/04
to
On Mon, 12 Apr 2004 21:14:45 GMT, Maynard Handley
<nam...@redheron.com> wrote:

<snip>


>
>So that's the problem as I see it.
>Maybe someone smarter than me can construct some primitives that do
>allow scaling (in at least either the fwd or bwd direction) that will
>allow one to thread these tiny fragments of code. But in the absence of
>that, life sucks. The issue is not that the parallelism doesn't exist
>but that one cannot usefully get to it through todays HW and OS
>abstractions.
>

So, you've got some dumb-ass von Neumann module that you want to run a
few thousand copies of. You modify the API, however it looks, to fit,
say, software cabling. Software cabling doesn't know or care what's
inside the module, it just knows and enforces the rules under which
the module can be invoked and how the module is allowed to transmit
and receive data. Software cabling invokes the module, assigns it to
a multi-threaded processor, and the magic of transparent
parallelization takes over. Software hasn't got a clue that there's
even more parallelization is going on inside the box. Simple, no?

RM

del cecchi

unread,
Apr 12, 2004, 8:47:42 PM4/12/04
to

"Robert Myers" <rmy...@rustuck.com> wrote in message
news:o46m70tusl425tiog...@4ax.com...
You will recall, I'm sure, that the farmers on the tundra had
multithreading (2) on Northstar. Yes, I recall that some in comp.arch
said it is not "real" multithreading because it was somewhat coarser
grained, switching threads on a cache miss. In the Power4 it was
decided to have two whole cores rather than two threads in one core.
Now in the Power5 they have upgraded the cores to have multithreading.

It's a funny thing. There is at least one department of folks whose job
it is to study the effects of microarchitecture trade offs on
performance. They are armed with a great deal of information about
workloads and with special simulators and well honed models. Too bad
they can't do as well as some spectator monday morning quarterback. I
guess they aren't smart enough. Or maybe the chip has to get out on
schedule.

del cecchi


Robert Myers

unread,
Apr 12, 2004, 9:24:33 PM4/12/04
to
On Mon, 12 Apr 2004 19:47:42 -0500, "del cecchi"
<dcecchi...@att.net> wrote:

>You will recall, I'm sure, that the farmers on the tundra had
>multithreading (2) on Northstar.

No, I didn't.

>Too bad
>they can't do as well as some spectator monday morning quarterback. I
>guess they aren't smart enough. Or maybe the chip has to get out on
>schedule.
>

You can read it that way if you want to, or you can read it as my
saying there's something about this I don't understand. The trades
are obviously complicated: people have taken hard stands on both sides
of the issue.

Neither IBM nor Intel says, "This is the real reason this chip does or
does not have this particular feature." Marketing will say stuff, of
course, but I don't think you would expect anyone to pay much
attention to it.

The chip comes out, and it's left to Monday morning quarterbacks to
try to figure out what's really going on. Hyperthreading for Intel is
kind of marginal. It's a natural thing for Intel to try, because it's
a way to try to recapture some of the IPC they gave up in going to a
longer pipeline.

My (Monday morning quarterback) belief, though, is that much more
significant hyperthreading is their game plan for Itanium, and that
Hyperthreading was really an R&D project for Itanium with a little
marketing pizazz as a gimme. In particular, since hyperthreading has
come out for the P4, it has been all but useless for a big chunk of P4
applications. That may change, especially with the new instructions
in SSE3.

For IBM, the story is different. As I understand SMT, it's a clear
win for OLTP workloads, and IBM won't have boxes in CompUSA with "SMT
included" on them. That means that SMT is included or not included
strictly on the merits, unless, like Intel, they are looking more to
the future than to the present.

The fact that IBM has had SMT, then didn't have SMT, and now has SMT,
all with OLTP presumably as the target application, doesn't leave me
thinking I'm smarter than the guys with all the numbers. It leaves me
wondering if I know what this is really all about.

RM

del cecchi

unread,
Apr 12, 2004, 11:43:13 PM4/12/04
to

"Robert Myers" <rmy...@rustuck.com> wrote in message
news:guem70h7s9tmdjic8...@4ax.com...

Actually, you might not have been hanging around here in Northstar days.
It was a couple of processor iterations before Power4. Back in the Old
Days. :-) And the SMT was sort of coarse, to the point that some here
refused to call it that.

I think that on any design there are always a number of ways to meet the
objectives. And the objectives include cost (mfg and development) and
schedule as well as performance and power and that stuff. Northstar
designers thought they could get a performance boost for OS400 at an
affordable cost by building a two thread SMT that switched threads on a
cache miss. Power4 thought they would be better off buying more silicon
and putting two processor cores on the chip, with each core being more
complicated than the Nstar core. Different strokes for different folks,
as we used to say.

In the case of Power5, I'm sure the question was how best to get a
performance boost out of the chip using the enhanced density. Make it 4
cores? Add some resource and convert the existing core to SMT? Build a
new spiffy core with 57 stages in the pipe? 128 bit FPU?

decisions decisions. So many transistors, so little time.

del cecchi


Terje Mathisen

unread,
Apr 13, 2004, 1:55:48 AM4/13/04
to

Maynard, in my simplistic view, this should be feasible today, using the
current 'lockless' primitives:

Multiple threads on a single physical cpu will get _very_ good
performance using such primitives since all the relevant (L2) cache
lines will be exclusively owned, right?

The real problem is the need for the OS to get out of the way, and in
particular not do a lot of the _really_ stupid things that still happen
today, i.e. like WinXP ping-pong'ing a single cpu-limited thread all
over all available cpus, instead of giving it a little cpu affinity by
default. :-(

As soon as you have more cpu-hogging threads than you have physical
resources for, you'll lose, but allocating N-1 threads on a N-thread
machine, and then assuming the OS will be clueful enough to keep them
stable isn't too much to expect, is it?

When you need to setup an NxM array, with N threads on each of M cpus,
then you need a little more help, but as long as you're willing to (more
or less) manually allocate threads to (virtual) cpus that should be enough.

OTOH, so far I haven't found any way (under WinXP) to even figure out
how my cpus are numbered!

Is it (0,1),(2,3) for (cpu0-thread0, cpu0-thread1),(cpu1-thread0,
cpu1-thread1) or does it start by going across the physical cpus:

(cpu0-thread0, cpu1-thread0),(cpu0-thread1, cpu1-thread1)

Any hints?

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

David C. DiNucci

unread,
Apr 13, 2004, 4:14:23 AM4/13/04
to
Robert Myers wrote:
>
> On Sun, 11 Apr 2004 00:27:49 -0700, "David C. DiNucci"
> <da...@elepar.com> wrote:
> >I am apparently *still* (after literally all these years) still missing
> >something very basic, so if someone has the time, maybe they can finally
> >help me get straight on this.
> >
> I had forgotten you were in the Nick Maclaren camp on this one (might
> as well put it that way, because that's the way I think of it).

I am willing to have it characterized as so, though I have no doubts
that my thinking disagrees with Nick's in at least some important ways,
and I have no intention of putting words into his mouth (or thoughts
into his fingers).

> ... I had


> (believe it or not) thought of starting the paragraph to which you are
> replying more or less the way you started yours, which is that I
> *still* don't understand the case for separate cores (except the one I
> mentioned). Were die layout not a problem, you'd put everything in
> the same place, and let who belonged to what sort itself out as
> needed. Die layout, of course, is a consideration.

Describing the issue as "die layout" makes it sound more specific than
it really is. The issue is the latency required to schedule resources
to a thread and then the related issue of moving data between those
resources. You seem to be advocating allocating resources at a very fine
granularity, and then dynamically routing the data to and from them as
needed. This is essentially a traditional dataflow philosophy, and that
might explain any perceived similarity between Nick's aversion to it and
my own: We've seen it before. It's easy to go in pretending that
there's no latency because each portion is so tiny, but you experience
that latency with every operation, or in this case, every functional
unit, for both scheduling and routing. It's death by a million cuts.
Then you hope that you can hide that latency by parallelism, and maybe
you can to some extent, but the parallelism would still do you more good
if you didn't have the latency.

Pipelines are nice because you schedule several resources at a time
(i.e. all the stages in the pipe), yet you share them at a finer
granularity (one stage at a time), and since the sharing follows a fixed
pattern, successive stages (in time) can be made close to one another
(in space). To the extent that streaming architectures can be considered
as a configurable pipe, that's great, but even then you miss out on some
of those spacetime proximities.

Maybe this is the basis of our different views on this subject. You
consider the functional units more as a pool of resources allocatable at
will, while I regard them more as a pipe. If you share some stages of a
pipe, those stages need to operate faster if they are not to be a
bottleneck. Even *I* can see the merit of tossing instructions from
different threads into a single pipe, assuming that there would be
bubbles otherwise, but that's apparently not what we're discussing.

> If you try to run helper threads on separate cores, you lose most of
> the advantages because of the overhead of communicating through L2.
> If you don't share L2, there is no point at all in helper threads.

This is the very reason I stated my cache-sharing assumption below--i.e.


> >So if my assumptions above are correct, and if multiple cores can be as
> >effective at sharing caches and integrating with other aspects of the

> >memory subsystem as their hyperthreaded counterparts, ...

But returning back:

> I don't get your argument for a fast clock at all. You don't balance
> resources by speeding up the clock. You balance resources by
> balancing resources. Your best chance of getting an ideal match is
> when all resources are in one big pool and can be assigned as needed
> to the workload that requires them. Your worst chance of getting an
> ideal match is to have the resource pool arbitrarily divided into
> pieces. Simple queuing theory, as someone put it.

Except you're leaving the queues (pipes) out of your queuing theory. If
I remember, you were the one talking about the importance of minimizing
data movement, etc., and now you seem to be advocating routing data
almost arbitrarily to get it from one FU to the next. And you are
moving and changing more state in order to accomodate that dynamic
routing. I'm getting flashbacks to tags and matching store in dataflow
machines.

And the chips get hotter and hotter.

<snip>


> "Okay," I responded, drawing my breath in slowly, "what is it that we
> need to do so that people can do these kinds of things in user space,
> since I don't think the dozens of researchers publishing papers on
> these strategies have been planning on running everything in
> privileged mode." At which point Nick responded with a list of
> requirements that look achievable to someone who is not a professional
> computer architect.

So, I read this as an explanation of your earlier statement regarding


"the possibility of close cooperation among threads".

> >So if my assumptions above are correct, and if multiple cores can be as


> >effective at sharing caches and integrating with other aspects of the
> >memory subsystem as their hyperthreaded counterparts, I am still led to
> >the conclusion that multiple cores get their bonus points from lower
> >clock speed (and therefore less heat) for similar (or better)
> >performance when there are enough threads to keep those cores busy,
>
> As I've stated, if you could put everything in one place, there would
> be no advantage at all to multiple cores (other than amortizing NRE by
> printing cores on dies like postage stamps).

I won't spend time arguing statements which we agree are based on false
premises.

> Since you can't put
> everything in one place, there is a balance between acceptable heat
> distribution, die size, and being able to reach everything in a single
> core in a single clock (although I suspect that even that constraint
> is going to go by the board at some point).

And, as I state above, I believe you are oversimplifying. The problem is
not just (and maybe even not primarily) distributing the clock. It's
allocating resources and routing data between them, all in the least
amount of time and state changes. I believe that fine-grain allocation
and routing is not the way to best accomplish that.

> >and
> >hyperthreaded cores get their bonus points from (1) possibly requiring
> >less floorplan/cost to manufacture support for n threads, and (2) the
> >ability to dynamically and productively allocate otherwise idle
> >resources/stages to active threads. I assume that these play off
> >against one another--i.e. that the ability to run the multiple cores
> >slower also allows them to be built smaller/simpler (with the forth chip
> >sited earlier as an extreme in that direction), but that if multiple
> >cores do end up significantly larger than the hyperthreaded
> >counterparts, that would affect their ability to share caches as
> >effectively.
> >
>
> The primary advantage of closely-coupled threads on a single core is
> that data sharing is not done through L2 cache or memory.

So, it seems clear that you regard threads running on a single SMT core
as being related differently (in terms of programming model) than those
that might run anywhere else.

> >My sense is that the motivation for hyperthreading relates almost
> >entirely to #2, because having lots of threads is the exception rather
> >than the rule, because of the widely accepted belief that PARALLEL
> >PROGRAMMING IS HARD!, and as long as that belief exists, so will shared
> >pipelines.
> >
>
> I think you are doing yourself and Software Cabling a disservice by
> adopting this posture. While in theory you could use the machinery of
> Software Cabling to address any level of parallelism, its natural
> target is coarse-grained parallelism.

I didn't say anything about Software Cabling, and even if had been
floating around in the back of my mind (as it is wont to do), I
certainly wasn't advocating its use in fine-grain parallelism. I do tend
to believe that there is a tendency to rely on fine-grained parallelism
when coarse-grain would work better, for the very reasons I've already
stated.

> ... Helper threads are a way of letting the processor or


> processor and supporting software do the fine-grained parallelism.

So am I to understand that you see the primary value of SMT as its
ability to support these helper threads to prefetch data into cache?

And thanks, I do believe that you answered my original question,
-- Dave
-----------------------------------------------------------------
David C. DiNucci Elepar Tools for portable grid,
da...@elepar.com http://www.elepar.com parallel, distributed, &
503-439-9431 Beaverton, OR 97006 peer-to-peer computing

David C. DiNucci

unread,
Apr 13, 2004, 4:15:58 AM4/13/04
to

Well, yes and no.

First, if all you want to do is run a few thousand copies of some von
Neumann module (dumb-ass or otherwise), Software Cabling is probably
overkill for you, but there are lots of other things that aren't. If you
don't want to do it by hand, Ninf comes to mind, but I assume projects
like Gridbus and Globus have similar tools, and for that matter, even
something like Linda Piranha.

Second, what you say is true, to the extent that the magic of
transparent parallelization exists. Obviously, if there's some black
box out there that will run my d-avNm faster than the yellow box next to
it, then all else being equal, I'll choose the black box. But if I'm
paying more to buy the black box, and/or power it, and/or cool it, my
choice might very well be different. And if I can make the cheaper,
cooler, less power-hungry yellow one actually go faster than the black
one just by providing my code to it in a different yet very programmable
and understandable form (and, yes, I am thinking of SC in this case), I
would personally be far more interested in the yellow one--though I'm
sure there are many others who would not be. In the end, it depends on
which market niche you're going for, and I've chosen mine.

Nick Maclaren

unread,
Apr 13, 2004, 4:10:18 AM4/13/04
to

In article <cp2j701vcij39q676...@4ax.com>,

Robert Myers <rmy...@rustuck.com> writes:
|>
|> Only people like Terje and Linus (what is it with these guys from
|> Scandinavia, anyway?) worry about the real effects of OoO on a
|> day-to-day basis. The rest of us simply accept that the processor
|> somehow gets away with running twenty times as fast as the memory bus
|> and get on with business.

Well, I would, if I didn't spend most of my time dealing with much
more elementary implementation imbecilities :-(

The real effects of out-of-order execution can be VERY visible
to interrupt handlers, but how many systems provide competent
application-level interrupt handling nowadays?


Regards,
Nick Maclaren.

Maynard Handley

unread,
Apr 13, 2004, 4:39:24 AM4/13/04
to
In article <c5fvd5$em4$1...@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> wrote:

> Maynard, in my simplistic view, this should be feasible today, using the
> current 'lockless' primitives:
>
> Multiple threads on a single physical cpu will get _very_ good
> performance using such primitives since all the relevant (L2) cache
> lines will be exclusively owned, right?
>
> The real problem is the need for the OS to get out of the way, and in
> particular not do a lot of the _really_ stupid things that still happen
> today, i.e. like WinXP ping-pong'ing a single cpu-limited thread all
> over all available cpus, instead of giving it a little cpu affinity by
> default. :-(

Well:
* Yes the HW stuff is available in that load-locked/store-conditional
will work. The problem is that you're not supposed to just use
load-locked/store-conditional, you're also supposed to throw some sort
of synchronizing instruction in there. I lose track of the details from
CPU to CPU, but my understanding is that you COULD get away without that
synchronizing instruction if you only cared about synching to this CPU.
Of course since IBM doesn't make any SMT CPUs it's a moot point.
Now two POWER4 cores sharing a die and an L2 presumably matches this
situation, but I don't know what IBM recommend has to be done to sync
one CPU of the die with the other CPU, and with no interest in off-die
CPUs. I don't know if they consider that an interesting problem ---
perhaps not because the tend to sell POWER4s in boxes with many CPUs.
* Of course we agree on the need for the OS to get out the way.

But, as I said before, the summary is that the sweet spot for how to do
this is neither fwd nor bwd compatible, which is why I am uncertain
about how things will play out.

> As soon as you have more cpu-hogging threads than you have physical
> resources for, you'll lose, but allocating N-1 threads on a N-thread
> machine, and then assuming the OS will be clueful enough to keep them
> stable isn't too much to expect, is it?

Even that, however, assumes you're willing to write code for say 1, 2
and 4 virtual-CPU machines. That's tough when you want such fine-grained
threading. It's a real hassle to keep work like that in sync, and
current IDEs don't do a good job of allowing you either to hide one view
of the code (so you can concentrate on say only the 2-way threading
code) or conversely show the different (but equivalent) code paths side
by side.
Perhaps it's no different from writing code to run on three different
architectures, and appropriate factoring, macros and inline functions
can help? Time will tell.

> When you need to setup an NxM array, with N threads on each of M cpus,
> then you need a little more help, but as long as you're willing to (more
> or less) manually allocate threads to (virtual) cpus that should be enough.

But at this point your life gets really tough. The language primitives
(if you're using Java or COM) or the OS primitives (eg pthreads) really
aren't set up to express [this data structure (mutex, semaphore,
whatever) here needs to be synced with these local threads, and there
needs to be synced with all threads]. What I mean is, there's no easy
way to express that at this point I want the action of acquiring a mutex
to involve only a load-locked/store-conditional loop WITHOUT the extra
"publish to the other CPUs" work, or at this point I want the act of
waiting for a mutex to involve a spin-loop, not an OS queue.

> OTOH, so far I haven't found any way (under WinXP) to even figure out
> how my cpus are numbered!
>
> Is it (0,1),(2,3) for (cpu0-thread0, cpu0-thread1),(cpu1-thread0,
> cpu1-thread1) or does it start by going across the physical cpus:
>
> (cpu0-thread0, cpu1-thread0),(cpu0-thread1, cpu1-thread1)
>
> Any hints?

Terje, I'm an MacOS X PPC boy! What can I tell you about WinXP?
Of course in MacOS X land we haven't even hit this problem yet --- but
I'm sure we will soon enough.

Maynard

Maynard Handley

unread,
Apr 13, 2004, 4:42:31 AM4/13/04