A better SPU?

Maynard Handley

unread,

Nov 21, 2006, 1:32:13 AM11/21/06

to

When I spoke to the guys who designed Cell, I was told that the
load-store logic on a modern CPU was just horrifying, and that because
it was so large and slowed critical paths so much, the decision was made
on Cell to toss it. It was this decision that gave Cell its extreme
programmer hostility --- a programmer visible mere 256KB, along with
having to write DMA programs to shunt data to/from this very limited
local store to real RAM, thus giving us a programming model very much
like the old days of 256KB x86 systems with manually managed segment
swapping.

My question is: can one get the various supposed benefits of an SPU
without such a hostile programming model?
My idea is something along the following lines:
We will assume that most of what is in the Cell SPU is perfectly good,
and retain it. Thus we'll stick with 32 vector registers each 16 bytes
wide, the same basic operations, the same branching model etc. Likewise
we'll include only the most basic OS support instructions, with the
assumption that the OS will be running on some "real CPU" co-processor.
Where we'll differ is in the load-store model.

Rather than 256KB of local memory, we'll have 256KB of L2. We'll
likewise have standard load/store instructions and I and D TLBs. BUT
rather than TLB and L2<->RAM support on each SPU, we'll have one block
of that for, say, 4 SPUs. We won't (for now) have an L1 (or maybe just
an I L1).
The point of this model is to get to load-store logic that appears to me
(who has never designed a CPU in his life) to not have to be much more
complex than a Cell SPU (given that it's meant to have the same
performance, and same weaknesses) as the Cell SPU, but which has
(programmer) transparent support for VM and a large address space. Such
complicated logic as has to exist to support the TLBs, RAM, perhaps some
sort of inter-SPU locking mechanisms?, can be amortized over four SPUs.

Naturally one can imagine situations where this architecture could
perform worse than 4 Cell SPUs. The question is whether those situations
do, in fact, represent anything in the real world. For example, it's not
enough to show a real world algorithm that has the DMA engines on 4 Cell
SPUs running simultaneously --- you also have to demonstrate a real
world memory system that could handle that bandwidth, rather than just
switching, round-robin or so, between each DMA engine, in which case you
have a performance that's, in principle, no different from my
centralized L2+TLB controller with a large enough internal queue.

The great advantage of what I propose is that it represents much less of
a jolt to the programming model we've all grown to expect. Furthermore,
by making the underlying hardware less visible, there's ample room to
improve things (mostly) invisibly as time goes by. Obvious examples
include larger TLBs, or adding a D L1 cache, or sharing the L2+TLB
controller between 2 rather than 4 SPUs.

So: is this a great new idea that I've been very foolish not to patent;
or are there clear and obvious flaws that make it basically a
non-starter compared to the Cell SPU?

Chris Thomasson

unread,

Nov 21, 2006, 3:57:49 AM11/21/06

to

"Maynard Handley" <nam...@name99.org> wrote in message
news:name99-6D4B30.22320920112006@localhost...

> When I spoke to the guys who designed Cell, I was told that the
> load-store logic on a modern CPU was just horrifying, and that because
> it was so large and slowed critical paths so much, the decision was made
> on Cell to toss it. It was this decision that gave Cell its extreme
> programmer hostility

I don't really see the hostility... I just use very strict distributed
programming paradigm with the Cell. 256KB of 100% private memory is fine for
an on-chip distributed processing unit, IMHO... As for DMA, well that's fine
too. Systems programmers and distributed algorithms will go a long way with
the Cell Architecture...

Chris Thomasson

unread,

Nov 21, 2006, 4:04:12 AM11/21/06

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:I-ydnXBZZuL9If_Y...@comcast.com...

> "Maynard Handley" <nam...@name99.org> wrote in message
> news:name99-6D4B30.22320920112006@localhost...
>> When I spoke to the guys who designed Cell, I was told that the
>> load-store logic on a modern CPU was just horrifying, and that because
>> it was so large and slowed critical paths so much, the decision was made
>> on Cell to toss it. It was this decision that gave Cell its extreme
>> programmer hostility
>
> I don't really see the hostility... I just use very strict distributed
> programming paradigm with the Cell.

Plus, I kind of like having the instruction-set for the SPU differ from the
instruction-set of the "main" system processor (e.g., PowerPC)... It
separates things nicely...

;^)

Andrew Reilly

unread,

Nov 21, 2006, 4:33:50 AM11/21/06

to

On Tue, 21 Nov 2006 07:32:13 +0000, Maynard Handley wrote:

> When I spoke to the guys who designed Cell, I was told that the
> load-store logic on a modern CPU was just horrifying, and that because
> it was so large and slowed critical paths so much, the decision was made
> on Cell to toss it. It was this decision that gave Cell its extreme
> programmer hostility --- a programmer visible mere 256KB, along with
> having to write DMA programs to shunt data to/from this very limited
> local store to real RAM, thus giving us a programming model very much
> like the old days of 256KB x86 systems with manually managed segment
> swapping.
>
> My question is: can one get the various supposed benefits of an SPU
> without such a hostile programming model?

I've always assumed, given some of the things that IBM has said in
conjunction with Cell, that they see much of what you're asking for moving
into the software architecture, through dynamic (re)compilation and
pointer swizzling, both of which they have considerable experince with
(their Java and AS400 exercises).

Is that an inherently unreasonable approach? Does one *have* to program
these things in raw assembly language and DMA control chains?

[P.S.: anyone know if the up-coming YellowDog Linux/PS3 combination
provides access to the SPUs, or does that just play with the PPC core?]

Cheers,

--
Andrew

Thomas Lindgren

unread,

Nov 21, 2006, 7:26:13 AM11/21/06

to

Andrew Reilly <andrew-...@areilly.bpc-users.org> writes:

> On Tue, 21 Nov 2006 07:32:13 +0000, Maynard Handley wrote:
>
> > When I spoke to the guys who designed Cell, I was told that the
> > load-store logic on a modern CPU was just horrifying, and that because
> > it was so large and slowed critical paths so much, the decision was made
> > on Cell to toss it. It was this decision that gave Cell its extreme
> > programmer hostility --- a programmer visible mere 256KB, along with
> > having to write DMA programs to shunt data to/from this very limited
> > local store to real RAM, thus giving us a programming model very much
> > like the old days of 256KB x86 systems with manually managed segment
> > swapping.
> >
> > My question is: can one get the various supposed benefits of an SPU
> > without such a hostile programming model?
>
> I've always assumed, given some of the things that IBM has said in
> conjunction with Cell, that they see much of what you're asking for moving
> into the software architecture, through dynamic (re)compilation and
> pointer swizzling, both of which they have considerable experince with
> (their Java and AS400 exercises).

At least one of the articles in the IBM journals discusses the use of
MPI with Cell.

http://www.research.ibm.com/journal/sj/451/eichenberger.html
http://www.research.ibm.com/journal/sj/451/ohara.html

> Is that an inherently unreasonable approach? Does one *have* to program
> these things in raw assembly language and DMA control chains?

Given past form, Sony would probably say you'd have to do that ... But
maybe they are right. Should we encourage the old "uniform access to
infinite memory" paradigm when reality is moving away? (And some
programmers may even realize that message passing actually is a bit
familiar.)

Best,
Thomas
--
Thomas Lindgren

"Ever tried. Ever failed. No matter. Try again. Fail again. Fail better."

Elcaro Nosille

unread,

Nov 21, 2006, 7:58:22 AM11/21/06

to

Maynard Handley schrieb:

> When I spoke to the guys who designed Cell, I was told that

> the load-store logic on a modern CPU was just horrifying, ...

Yes, but the L1-latencies could be hidden to a large extent on OoOE-CPUs.
But the Cell's SPUs aren't really OoOE; afaik they work like "OoOE" on the
Pentium-1 where a pair of independent instructions could be executed in
parallel. Un such a CPU, load-latencies could hurt more; so having the
load-/store-logic of a modern CPU (with a TLB and an associative lookup
in the cache) in a SPU would be really "horrifying", but not on a full
OoOE-CPU.

> My question is: can one get the various supposed benefits of an SPU
> without such a hostile programming model?

I don't think this is possible without extra latencies.

Joe Seigh

unread,

Nov 21, 2006, 9:29:37 AM11/21/06

to

You're programming on a Cell processor now? Are you still using the
T2000? Can I have it? :)

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Jason Lee Eckhardt

unread,

Nov 21, 2006, 11:26:58 AM11/21/06

to

In article <pan.2006.11.21....@areilly.bpc-users.org>,

Andrew Reilly <andrew-...@areilly.bpc-users.org> wrote:
>On Tue, 21 Nov 2006 07:32:13 +0000, Maynard Handley wrote:
>
>> When I spoke to the guys who designed Cell, I was told that the
>> load-store logic on a modern CPU was just horrifying, and that because
>> it was so large and slowed critical paths so much, the decision was made
>> on Cell to toss it. It was this decision that gave Cell its extreme
>> programmer hostility --- a programmer visible mere 256KB, along with
>> having to write DMA programs to shunt data to/from this very limited
>> local store to real RAM, thus giving us a programming model very much
>> like the old days of 256KB x86 systems with manually managed segment
>> swapping.
>>
>> My question is: can one get the various supposed benefits of an SPU
>> without such a hostile programming model?
>
>I've always assumed, given some of the things that IBM has said in
>conjunction with Cell, that they see much of what you're asking for moving
>into the software architecture, through dynamic (re)compilation and
>pointer swizzling, both of which they have considerable experince with
>(their Java and AS400 exercises).
>
>Is that an inherently unreasonable approach? Does one *have* to program
>these things in raw assembly language and DMA control chains?

IBM has a prototype compiler that will orchestrate all of these
details automatically for the programmer. See the publications at:
http://domino.research.ibm.com/comm/research_projects.nsf/pages/cellcompiler.index.html

Of course, for those critical parts of the code where the compiler
can't quite get the job done, a highly competent hand coder can
manage the details explicitly. Managing DMA transfers and so forth
can be taxing, but it is no worse than writing assembly language :)

At Rice, we are also working on improving compiler technology for
Cell.

>
>[P.S.: anyone know if the up-coming YellowDog Linux/PS3 combination
>provides access to the SPUs, or does that just play with the PPC core?]
>

I don't know about YellowDog Linux, but this is standard in the
Linux kernel distributed with IBM's Cell SDK. There are APIs
for managing threads on the SPE's, managing DMA transfers, etc.

Eric P.

unread,

Nov 21, 2006, 11:35:11 AM11/21/06

to

Maynard Handley wrote:
>
> When I spoke to the guys who designed Cell, I was told that the
> load-store logic on a modern CPU was just horrifying, and that because
> it was so large and slowed critical paths so much, the decision was made
> on Cell to toss it. It was this decision that gave Cell its extreme
> programmer hostility --- a programmer visible mere 256KB, along with
> having to write DMA programs to shunt data to/from this very limited
> local store to real RAM, thus giving us a programming model very much
> like the old days of 256KB x86 systems with manually managed segment
> swapping.

What is extremely programmer hostile about this model?

It may be slightly inconvenient to chunk up the processing,
and I can envision that there are certain algorithms where that
chunking could cause big headaches because of interdependencies,
but it seems to me that the majority of work would be quite
straight forward.

You would split the work buffer up into 3 areas, local program
variables and 2 working buffers, and be chewing on one buffer
while the other was DMA unloading result or loading the next item.

I have never programmed a Cell, but have built systems like this
in the past. Where do you see the difficulty?

Eric

Maynard Handley

unread,

Nov 21, 2006, 2:23:03 PM11/21/06

to

In article <45633b09$0$1349$834e...@reader.greatnowhere.com>,
"Eric P." <eric_p...@sympaticoREMOVE.ca> wrote:

Have you ever written a modern video codec, say H264?
While the data you are writing OUT follows a nice linear pattern, the
source data is all over the place, in any of a dozen earlier frames at
any spatial offset. There IS locality, so that it runs acceptably on a
cached architecture, but on the Cell architecture you cannot rely on
this, you have to test that the material is in local memory before
trying to access it.

Do we know for a fact that the BluRay decoder in PS3 runs on Cell, not
on dedicated silicon? I could believe it either way, but have heard
nothing on the subject.

Maynard Handley

unread,

Nov 21, 2006, 2:32:50 PM11/21/06

to

In article <58mdnZgb5qVCIP_Y...@comcast.com>,
"Chris Thomasson" <cri...@comcast.net> wrote:

There will always be a market for weirdo chips programmed with great
effort. I'm not interested in that. I'm interested in the claim that
Cell was supposed to be a new paradigm that would map well, not to all
code in the world, but to a large body of it.
I think I was given IBM's Cell lecture 6 yrs ago, maybe 7. Since then,
everything I've seen has only added to my initial dismissal of the
architecture. I'm not interested in dissing IBM here --- I love PPC, I
love AltiVec. I'm interested in trying to rescue the good ideas in Cell
from their current embodiment which I think dooms them to irrelevance in
the larger world.

It is interesting to hear that various people are working on compilers
and tools that they believe will make this all automatic; but of course
this is what we were told back then as well. Meanwhile I have yet to
hear of someone taking a block of code that should, in theory, be a
natural match to Cell's strengths, like a video decoder or audio
encoder, compiling it, and having it just work. I'd be pleased to hear
about this occurring for some code where this is actually not that
difficult, say an AAC encoder or a JPEG decoder; but what I really want
is to hear about it in the context of a rather harder decoder like H264.

As the obvious example, as far as I know, there is no flocking to buy
the Cell blades. This is obviously in part because of their insane
pricing, but the corollary is why is IBM going with a pricing they no
makes no sense --- it seems to imply that they don't want some sort of
large base of customers angry at what they just bought.

Richard

unread,

Nov 21, 2006, 4:37:10 PM11/21/06

to

[Please do not mail me a copy of your followup]

Maynard Handley <nam...@name99.org> spake the secret code
<name99-4EEE7D.11230321112006@localhost> thusly:

>Have you ever written a modern video codec, say H264?

Nope.

>While the data you are writing OUT follows a nice linear pattern, the
>source data is all over the place, in any of a dozen earlier frames at
>any spatial offset. There IS locality, so that it runs acceptably on a
>cached architecture, but on the Cell architecture you cannot rely on
>this, you have to test that the material is in local memory before
>trying to access it.

This sounds cache-friendly to the PowerPC. There's no reason that the
PowerPC portion of the codec can't gather all the scattered data
fragments into a single buffer for the SPU to decode.

What would be the problem with that?
--
"The Direct3D Graphics Pipeline" -- DirectX 9 draft available for download
<http://www.xmission.com/~legalize/book/download/index.html>

Legalize Adulthood! <http://blogs.xmission.com/legalize/>

Del Cecchi

unread,

Nov 21, 2006, 4:41:15 PM11/21/06

to

Maynard Handley wrote:
snip

> As the obvious example, as far as I know, there is no flocking to buy
> the Cell blades. This is obviously in part because of their insane
> pricing, but the corollary is why is IBM going with a pricing they no
> makes no sense --- it seems to imply that they don't want some sort of
> large base of customers angry at what they just bought.

I won't pretend to have a clue as to why IBM prices things the way they
do. Been trying for 30+ years and still can't. However it could be as
simple as estimating the number of folks that would buy a cell blade at
various price points and optimizing profit. Or it could be part of the
arrangement with Sony. Or it could be mistaken. Perhaps they believed
that folks interested in "regular" computing would be more likely to
purchase an Opteron or Intel based blade or even a PowerPC blade. So
only those with specialized applications that a more conventional blade
couldn't satisfy would want cell and compared to the fancy programming
bill the cell would look cheap.
--
Del Cecchi
"This post is my own and doesn’t necessarily represent IBM’s positions,
strategies or opinions.”

Jason Lee Eckhardt

unread,

Nov 21, 2006, 5:42:55 PM11/21/06

to

In article <name99-4EEE7D.11230321112006@localhost>,

I'm not Eric, but I'll chime in and submit the observation that
a number of groups have implemented H.264 (and many other) codecs
on Cell very successfully. For example, see:
http://ieeexplore.ieee.org/iel5/10651/33618/01598453.pdf?arnumber=1598453
It seems that standard definition H.264 encode requires only 2 SPEs,
while SD decode requires only 1 SPE (code written in C + intrinsics).
That seems to qualify as having "run acceptably" on Cell.

Also see these slides (slides 19-), which show one way of decomposing
an H.264 encoder onto the SPEs:
http://www.hotchips.org/archives/hc17/2_Mon/HC17.S1/HC17.S1T4.pdf
Admittedly, this is a combination of data and code partitioning
that might be beyond current auto compiler technology, though it
is still relatively straightforward and easy to understand for a
reasonable programmer.

Note that the effort spent tuning H.264 for Cell does not appear to
be significantly worse than, say, tuning H.264 for a Pentium4 (e.g.,
see the discussion of a P4 implementation in "Implementation of H.264
Encoder and Decoder on Personal Computers" by Chen, et. al).

jsa...@ecn.ab.ca

unread,

Nov 21, 2006, 10:07:11 PM11/21/06

to

Maynard Handley wrote:
> It is interesting to hear that various people are working on compilers
> and tools that they believe will make this all automatic; but of course
> this is what we were told back then as well. Meanwhile I have yet to
> hear of someone taking a block of code that should, in theory, be a
> natural match to Cell's strengths, like a video decoder or audio
> encoder, compiling it, and having it just work. I'd be pleased to hear
> about this occurring for some code where this is actually not that
> difficult, say an AAC encoder or a JPEG decoder; but what I really want
> is to hear about it in the context of a rather harder decoder like H264.

I don't think the Cell is *programmer-hostile*. I think it's
*applications-hostile*. Its problem is that it has weaknesses in
addition to its strengths.

I think that we *will* see what the Cell can do, because soon enough
people are going to be working with Linux on their PlayStation 3
computers to make them do interesting things.

And 256 kB is not *that* bad. It's almost twice the memory of a 7090.

John Savard

Terje Mathisen

unread,

Nov 21, 2006, 5:04:33 PM11/21/06

to

Yeah, this is basically a network programming model.

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Nov 22, 2006, 2:39:46 AM11/22/06

to

Maynard Handley wrote:
> Have you ever written a modern video codec, say H264?
> While the data you are writing OUT follows a nice linear pattern, the
> source data is all over the place, in any of a dozen earlier frames at
> any spatial offset. There IS locality, so that it runs acceptably on a
> cached architecture, but on the Cell architecture you cannot rely on
> this, you have to test that the material is in local memory before
> trying to access it.
>
> Do we know for a fact that the BluRay decoder in PS3 runs on Cell, not
> on dedicated silicon? I could believe it either way, but have heard
> nothing on the subject.

If I wrote it, I'd almost certainly split it up by letting the SPUs (one
or more) handle all decoding, then leave the final work, including
fetching all MVs to the host cpu.

Terje Mathisen

unread,

Nov 22, 2006, 4:15:37 AM11/22/06

to

Richard wrote:
> [Please do not mail me a copy of your followup]
>
> Maynard Handley <nam...@name99.org> spake the secret code
> <name99-4EEE7D.11230321112006@localhost> thusly:
>
>> Have you ever written a modern video codec, say H264?
>
> Nope.
>
>> While the data you are writing OUT follows a nice linear pattern, the
>> source data is all over the place, in any of a dozen earlier frames at
>> any spatial offset. There IS locality, so that it runs acceptably on a
>> cached architecture, but on the Cell architecture you cannot rely on
>> this, you have to test that the material is in local memory before
>> trying to access it.
>
> This sounds cache-friendly to the PowerPC. There's no reason that the
> PowerPC portion of the codec can't gather all the scattered data
> fragments into a single buffer for the SPU to decode.
>
> What would be the problem with that?

See my previous post: Yuu would need to invert the architecture, with
the SPUs generating work for the PPC to cleanup at the end, i.e.
handling all the weird possibly non-local accesses.

The SPUs would do everything it could, but since just the luminance part
of a 1080p video frame needs at least 2 MB (more if you use 10 or 12
bits/pixel), an SPU cannot even handle MV lookups inside the current frame.

What it could do would be to sort all MV addresses (within the current
~100-200 K frame segment), and then either handle them in sorted order,
or (much more likely), send the partially-completed data out to the host
PPC.

The final alternative is only feasible if the Cell DMA engine can handle
a _lot_ of outstanding requests, in which case you'd have a front end
and a back end decoder thread in each SPU, with the back end applying
the MV data as it arrives.

I.e. having both small-granularity fast random access and a large flat
memory space is a _good_ thing.:-

Terje Mathisen

unread,

Nov 22, 2006, 5:05:00 AM11/22/06

to

Jason Lee Eckhardt wrote:
> http://ieeexplore.ieee.org/iel5/10651/33618/01598453.pdf?arnumber=1598453
> It seems that standard definition H.264 encode requires only 2 SPEs,
> while SD decode requires only 1 SPE (code written in C + intrinsics).
> That seems to qualify as having "run acceptably" on Cell.
>
> Also see these slides (slides 19-), which show one way of decomposing
> an H.264 encoder onto the SPEs:
> http://www.hotchips.org/archives/hc17/2_Mon/HC17.S1/HC17.S1T4.pdf
> Admittedly, this is a combination of data and code partitioning
> that might be beyond current auto compiler technology, though it
> is still relatively straightforward and easy to understand for a
> reasonable programmer.

Interesting indeed!

I did note though that encoding cabac bits is easier than decoding. :-)

The other thing is that they obviously had to work within the 256 K
memory constraints, and they did it by (a) limiting resolution to where
a frame (or at least a sufficiently high resolution MIP-mapped version)
would fit in the RAM buffer (MPEG2 at 720x480 has ~350 KB luminance
data, a standard definition TV signal might be less?)), and according to
the slides, they only considered a single reference frame, which (this
being a cross-coder) was already marked in the data stream in the form
of an MPEG2 key frame. They note that the motion estimation was the
effective bottleneck, here they could also have used the MPEG2 MVs as a
hint.

>
> Note that the effort spent tuning H.264 for Cell does not appear to
> be significantly worse than, say, tuning H.264 for a Pentium4 (e.g.,
> see the discussion of a P4 implementation in "Implementation of H.264
> Encoder and Decoder on Personal Computers" by Chen, et. al).

Right.

Eric P.

unread,

Nov 22, 2006, 11:47:52 AM11/22/06

to

Terje Mathisen wrote:

>
> Eric P. wrote:
> > You would split the work buffer up into 3 areas, local program
> > variables and 2 working buffers, and be chewing on one buffer
> > while the other was DMA unloading result or loading the next item.
>

> Yeah, this is basically a network programming model.

Or real time systems. For the majority of work in games
it should work well.

There are going to be some areas where the CELL architecture
is no good, so it is interesting to know where those limit are.
For example, where an algorithm requires access to a single
whole data object, and the object size exceed 2 MB (the total
of all SPU ram) as that would force excessive buffer swapping.
Or if it performs a small amount of work on a large amount of
data then the in-out DMA cost would totally outweigh any benefit
the SPU could offer.

It just would have been a bit surprising if it had run into
such a problem so soon, and on such an important application.

Eric

Stephen Sprunk

unread,

Nov 22, 2006, 6:45:33 PM11/22/06

to

"Maynard Handley" <nam...@name99.org> wrote in message
news:name99-6D4B30.22320920112006@localhost...

> Rather than 256KB of local memory, we'll have 256KB of L2.

AIUI, the reason the SPU has local memory instead of a cache is to make
load delays predictable and short. The programmer (or compiler) knows
exactly how many cycles a load will take, so they can schedule to that
number. This removes the need to have the extensive OOO logic that
general-purpose processors need to hide variable memory latency, and you
always know the exact performance you're going to get because you don't
have to worry about cache misses.

You can't change one part of a CPU design, particularly one as
fundamental as the memory structure, without considering the
implications that has for all the other parts. Everything about the
Cell SPE was designed around certain assumptions, and if you
intentionally break one assumption (i.e. fast, consistent access to
local memory) all of the other design decisions fall apart like a house
of cards.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

--
Posted via a free Usenet account from http://www.teranews.com

Chris Thomasson

unread,

Nov 22, 2006, 9:54:51 PM11/22/06

to

"Joe Seigh" <jsei...@xemaps.com> wrote in message
news:ovSdnTGDeJwzkf7Y...@comcast.com...

> Chris Thomasson wrote:
>> "Maynard Handley" <nam...@name99.org> wrote in message
>> news:name99-6D4B30.22320920112006@localhost...
>>
>>>When I spoke to the guys who designed Cell, I was told that the
>>>load-store logic on a modern CPU was just horrifying, and that because
>>>it was so large and slowed critical paths so much, the decision was made
>>>on Cell to toss it. It was this decision that gave Cell its extreme
>>>programmer hostility
>>
>>
>> I don't really see the hostility... I just use very strict distributed

>> programming paradigm with the Cell. [...]

>>
>>
>
> You're programming on a Cell processor now?

Whenever I get the chance. I have several friends that have them. I was
introduced to the Cell when a colleague asked me for my advise on a
lock-free synchronization algorithm for the PPC...

> Are you still using the T2000?

Yup.

> Can I have it? :)

I am still waiting for the CoolThreads contest....

;^)

Chris Thomasson

unread,

Nov 22, 2006, 11:23:21 PM11/22/06

to

"Chris Thomasson" <cri...@comcast.net> wrote in message
news:I-ydnXBZZuL9If_Y...@comcast.com...

DMA algorithm can be abstracted behind an assembly language based API... You
can "orchestrate" things nicely with the Cell... It forces you too so
Maynard, I do think you have a point here. Communication from shared memory
has to be short and sweet, just tell the SPU what to do, how to fill its
256KB buffer, and where to post a flag to shared memory via. DMA when its
done, or whatever distributed rendezvous synchronization mechanism you
decide you use...... from the SPU to shared memory can be setup quite
nicely.

However, I always fall back on the fact the a systems programmer should have
no problems with it... DMA, bah.. Been there, done that...

;^)

Chris Thomasson

unread,

Nov 29, 2006, 2:00:46 AM11/29/06

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:JPOdnSWTxemagvjY...@comcast.com...

> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:I-ydnXBZZuL9If_Y...@comcast.com...
>> "Maynard Handley" <nam...@name99.org> wrote in message
>> news:name99-6D4B30.22320920112006@localhost...
>>> When I spoke to the guys who designed Cell, I was told that the
>>> load-store logic on a modern CPU was just horrifying, and that because
>>> it was so large and slowed critical paths so much, the decision was made
>>> on Cell to toss it. It was this decision that gave Cell its extreme
>>> programmer hostility
>>
>> I don't really see the hostility... I just use very strict distributed
>> programming paradigm with the Cell. 256KB of 100% private memory is fine
>> for an on-chip distributed processing unit, IMHO... As for DMA, well
>> that's fine too. Systems programmers and distributed algorithms will go a
>> long way with the Cell Architecture...
>
> DMA algorithm can be abstracted behind an assembly language based API...

I did a DMA abstraction for the Cell that turns it all into a simple stream
interface... Cell can be friendly, after you create an API/ABI out of PPC
and AltiVec assembly language that abstracts the DMA stuff into a well known
simple stream interface... You can setup a distributed algorithm... The PPC
program assigns a nice sized chunk of shared memory for each SPU... I also
creates a DMA stream interface object for each SPU. Each SPU has small
metadata which holds a descriptor to the DMA stream... The SPU can stream
data into its assigned shared memory... Other SPU can request the PPC
program to stream it data from another SPU's shared memory; inter SPU
communication via one-to-many distributed algorithm controlled by the PPC.

Any thoughts?

Jan Vorbrüggen

unread,

Nov 29, 2006, 4:16:49 AM11/29/06

to

> I did a DMA abstraction for the Cell that turns it all into a simple stream
> interface... Cell can be friendly, after you create an API/ABI out of PPC
> and AltiVec assembly language that abstracts the DMA stuff into a well known
> simple stream interface...

IMO, that's only about a quarter of the job. The whole thing is only
worthwhile if you can overlap computation and communication to the largest
degree possible. And that requires actual use of the programmer's brain for
each application, again and again. Part is allocating the proper amount of
resources (which might require some experimenting), part is in getting
synchronization of the different threads that execute _on_one_SPU_ correct.
Even in the simplest case, you have three of them.

Jan

Chris Thomasson

unread,

Nov 29, 2006, 5:34:53 AM11/29/06

to

"Jan Vorbrüggen" <jvorbr...@not-mediasec.de> wrote in message
news:4t51efF...@mid.individual.net...

Good points. One thing, I believe you can remove the need for most thread
synchronization on the Cell if you bind a thread to a specific SPE via
affinity mask... The API for this is something like:

spe_create_thread()

Chris Thomasson

unread,

Nov 29, 2006, 5:47:29 AM11/29/06

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:qfCdnRIaXJfUl_jY...@comcast.com...

> "Joe Seigh" <jsei...@xemaps.com> wrote in message
> news:ovSdnTGDeJwzkf7Y...@comcast.com...
>> Chris Thomasson wrote:
>>> "Maynard Handley" <nam...@name99.org> wrote in message
>>> news:name99-6D4B30.22320920112006@localhost...
>>>
>>>>When I spoke to the guys who designed Cell, I was told that the
>>>>load-store logic on a modern CPU was just horrifying, and that because
>>>>it was so large and slowed critical paths so much, the decision was made
>>>>on Cell to toss it. It was this decision that gave Cell its extreme
>>>>programmer hostility
>>>
>>>
>>> I don't really see the hostility... I just use very strict distributed
>>> programming paradigm with the Cell. [...]
>>>
>>>
>>
>> You're programming on a Cell processor now?
>
> Whenever I get the chance.

I finally downloaded a Cell system simulator form IBM:

http://www.alphaworks.ibm.com/tech/cellsystemsim

Works fairly good. You should download it and give the Cell a spin?

lol..

;^)

Jan Vorbrüggen

unread,

Nov 29, 2006, 5:37:29 AM11/29/06

to

> Good points. One thing, I believe you can remove the need for most thread
> synchronization on the Cell if you bind a thread to a specific SPE via
> affinity mask... The API for this is something like:
>
> spe_create_thread()

Nonetheless, the input (DMA) task, the output task, and the computational task
that run on a single SPU still need synchronize among them properly.

Jan

Chris Thomasson

unread,

Nov 29, 2006, 5:57:48 AM11/29/06

to

"Jan Vorbrüggen" <jvorbr...@not-mediasec.de> wrote in message

news:4t565nF...@mid.individual.net...

Indeed.

Chris Thomasson

unread,

Nov 29, 2006, 6:44:02 AM11/29/06

to

"Jan Vorbrüggen" <jvorbr...@not-mediasec.de> wrote in message

news:4t565nF...@mid.individual.net...

FWIW, here is simple sample code for DMA:

Chris Thomasson

unread,

Nov 29, 2006, 6:45:52 AM11/29/06

to

ARGH! forgot the link!

> FWIW, here is simple sample code for DMA:

http://www.power.org/resources/devcorner/cellcorner/codesample2

Chris Thomasson

unread,

Nov 29, 2006, 6:49:57 AM11/29/06

to

"Jan Vorbrüggen" <jvorbr...@not-mediasec.de> wrote in message

news:4t565nF...@mid.individual.net...

Well, that's basic DMA stuff not thread synchronization per-se, we can
handle it. Humm... The Per-SPU Mailbox can be used for this kind of thing,
but IMHO, there a no go because they only like 32-bit words... WTF is up
with that? Almost renders them useless... Humm...

Chris Thomasson

unread,

Nov 29, 2006, 6:52:20 AM11/29/06

to

"Chris Thomasson" <cri...@comcast.net> wrote in message

news:cJmdnU-vY-hEsfDY...@comcast.com...

> "Chris Thomasson" <cri...@comcast.net> wrote in message
> news:JPOdnSWTxemagvjY...@comcast.com...
>> "Chris Thomasson" <cri...@comcast.net> wrote in message
>> news:I-ydnXBZZuL9If_Y...@comcast.com...
>>> "Maynard Handley" <nam...@name99.org> wrote in message
>>> news:name99-6D4B30.22320920112006@localhost...
>>>> When I spoke to the guys who designed Cell, I was told that the
>>>> load-store logic on a modern CPU was just horrifying, and that because
>>>> it was so large and slowed critical paths so much, the decision was
>>>> made
>>>> on Cell to toss it. It was this decision that gave Cell its extreme
>>>> programmer hostility
>>>
>>> I don't really see the hostility... I just use very strict distributed
>>> programming paradigm with the Cell. 256KB of 100% private memory is fine
>>> for an on-chip distributed processing unit, IMHO... As for DMA, well
>>> that's fine too. Systems programmers and distributed algorithms will go
>>> a long way with the Cell Architecture...
>>
>> DMA algorithm can be abstracted behind an assembly language based API...
>
> I did a DMA abstraction for the Cell that turns it all into a simple
> stream interface...

Humm... Seems like my custom DMA abstraction API performs somewhat better
than the mfc_XXX API...

Not sure why quite yet... Interesting...

dave

unread,

Nov 29, 2006, 9:59:16 AM11/29/06

to

Looks interesting, but requires registration to download.

Chris Thomasson

unread,

Nov 29, 2006, 10:30:28 AM11/29/06

to

>>>> You're programming on a Cell processor now?
>>>
>>> Whenever I get the chance.
>>
>> I finally downloaded a Cell system simulator form IBM:
>>
>> http://www.alphaworks.ibm.com/tech/cellsystemsim
>>
>>
>> Works fairly good. You should download it and give the Cell a spin?
>
> Looks interesting, but requires registration to download.

The registration process is simple and free... The simulator is a lot of
fun!

I can program the Cell on my x86!

YeeeHaw!

;^)

dave

unread,

Nov 29, 2006, 10:23:22 AM11/29/06

to

How long does it take to compile the simulator?

Chris Thomasson

unread,

Nov 29, 2006, 10:53:03 AM11/29/06

to

"dave" <d...@a64.comcast.net> wrote in message
news:U5qdnfTUEJN3OPDY...@comcast.com...

Read this:

http://dl.alphaworks.ibm.com/technologies/cellsw/CBE_SDK_Guide_1.1.pdf

You install fedora core 5. Then, you install the SDK, then, you mount the
disk image which contains the simulator and fedora core 5 for the Cell BE...

Sadly, the source code for the simulator is not provided at all...

;^(....

I still think its worth your time to check all of this out...

Chris Thomasson

unread,

Nov 29, 2006, 10:59:05 AM11/29/06

to

"dave" <d...@a64.comcast.net> wrote in message
news:U5qdnfTUEJN3OPDY...@comcast.com...

Simulator has support for emulation a system with 2 Cell BE.

So, you can simulate a dual PPC system with 16 SPU's...

Pretty cool...

Eric P.

unread,

Nov 29, 2006, 11:59:14 AM11/29/06

to

Chris Thomasson wrote:
>
> Well, that's basic DMA stuff not thread synchronization per-se, we can
> handle it. Humm... The Per-SPU Mailbox can be used for this kind of thing,
> but IMHO, there a no go because they only like 32-bit words... WTF is up
> with that? Almost renders them useless... Humm...

Be careful leaping to conclusions about Cell performance.
There are a number of resources and bottlenecks to be balanced.

For example, "The EIB data network consists of four 16-byte-wide
data rings: two running clockwise, and the other two counterclockwise.
Each ring potentially allows up to three concurrent data transfers,
as long as their paths don't overlap."

Each DMA engine allows up to 16 operations, concurrent or serialized.
DMA system addresses are virtual and are translated via page table.
Each DMA operation has a start up overhead cost of 10 to 100 clocks
(20 to 200 instruction times) and then a cost per 128-byte transfer.
However start up of one operation can be overlapped on the transfer
of another.

There are two very well written articles on Cell DMA I came
across in my travels:

This gives very good detail on the DMA and interconnect architecture
and many of the issues involved with programming. It is a must read.

CELL Multiprocessor Communication Network: Built for speed.
Michael Kistler, Michael Perrone, Fabrizio Petrin, IBM
IEEE Micro, 26(3), May/June 2006.
http://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf

The following provides a high level look at DMA overheads:

Optimizing the Use of Static Buffers for DMA on a CELL Chip
Tong Chen, Zehra Sura, Kathryn O'Brien, Kevin O'Brien
IBM T.J. Watson Research Center
http://domino.research.ibm.com/comm/research_projects.nsf/pages/cellcompiler.refs.html/$FILE/paper-chen-lcpc06.pdf

There are summary pages here too:

Cell Broadband Engine processor DMA engines
http://www-128.ibm.com/developerworks/library/pa-celldmas/
http://www-128.ibm.com/developerworks/library/pa-celldmas2/

Eric

Chris Thomasson

unread,

Nov 29, 2006, 12:21:56 PM11/29/06

to

"Eric P." <eric_p...@sympaticoREMOVE.ca> wrote in message
news:456dcc49$0$1350$834e...@reader.greatnowhere.com...

> Chris Thomasson wrote:
>>
>> Well, that's basic DMA stuff not thread synchronization per-se, we can
>> handle it. Humm... The Per-SPU Mailbox can be used for this kind of
>> thing,
>> but IMHO, there a no go because they only like 32-bit words... WTF is up
>> with that? Almost renders them useless... Humm...
>
> Be careful leaping to conclusions about Cell performance.
> There are a number of resources and bottlenecks to be balanced.

[...]

> There are two very well written articles on Cell DMA I came
> across in my travels:
>
> This gives very good detail on the DMA and interconnect architecture
> and many of the issues involved with programming. It is a must read.

^^^^^^^^^^^^^^^^^^^^^^
[...]

Must read indeed!

Thank you.

:^)

dave

unread,

Nov 29, 2006, 1:23:36 PM11/29/06

to

This is a show stopper for me, since I run only OpenBSD all applications
have to be compiled for the platform. And that every 6 months for each
new release. Also, 64-bit OpenBSD has no Linux emulation at present.

Chris Thomasson

unread,

Dec 20, 2006, 11:28:57 PM12/20/06

to

> You're programming on a Cell processor now? Are you still using the
> T2000?

Well, now I actually using the Cell emulator more than the T2000.

> Can I have it? :)

Happen to know anybody who might want to buy it?

$1000.00 or so off perhaps?

;^)