Cell Documents

Del Cecchi

unread,

Aug 25, 2005, 3:00:50 PM8/25/05

to

For those interested in Cell, there is a document dump at

http://www-128.ibm.com/developerworks/power/cell/

for a little bedtime reading. I think you have to register, but it is
free.

--------------------------------------------------------------------
Download: Cell Broadband Engine documentation

The following papers define the Cell specification and will be posted
to the IBM Semiconductor Solutions Technical Library in September.
Readers with a current IBM ID are invited to see them early and gain
access to participate in the Power Architecture™ zone's Cell discussion
forum. If you are not already a registered user, you can register now
(Note: Registered users will need to sign in to download).

Cell architecture from 20,000 feet
A high-level description of the Cell Broadband Engine (CBE), the
Synergistic Processing Elements (SPEs), and how they work together.
2 pages, 27KB | HTML (no registration required)

Cell Broadband Engine Architecture V1.0
Like the Power Architecture, but different -- the CBE Architecture
builds upon knowledge contained in the Power Architecture "books" and
describes the app-level User Mode Environment (UME) and the OS-level
Privileged Mode Environment (PME) in astonishingly rich detail.
327 pages, 4.51MB | PDF (registration required)

Synergistic Processor Unit (SPU) Instruction Set Architecture V1.0
Somewhere between a general-purpose processor and special-purpose
hardware lies the Cell SPU: designed to provide leadership performance
in game, media, and broadband applications, this document describes the
Application Binary Interface (ABI) of the Synergistic Processor Unit
(SPU). Get to know all of its instructions.
30 pages, 1.89MB | PDF (registration required)

SPU Application Binary Interface Specification V1.3
Including low-level system and language binding information, information
on loading and linking, and coding examples, this specification defines
the system interface for SPU-targeted object files to help ensure
maximum binary portability across implementations.
37 pages, 357KB | PDF (registration required)

SPU Assembly Language Specification V1.2
Unleash the full processing power of the SPUs -- you know you want to!
This specification will prove an indispensable aid in your efforts as it
takes you on a carefully-worded journey describing SPU assembly-level
syntax and machine-dependent features for the GNU assembler (but serves
as an example specification for other SPU assemblers as well).
30 pages, 122KB | PDF (registration required)

SPU C/C++ Language Extensions V2.0
Describes the basic data types, operations on these data types, and
directives and program controls required by the CBE specification;
includes sample code.
98 pages, 462KB | PDF (registration required)

Cell forum
Technical discussion of these documents is going on now at the Cell
Architecture forum; comments and errata are welcome there also, or by
e-mail to CBE_Docu...@us.ibm.com.
1 forum, HTML format (registration required)

--
Del Cecchi
"This post is my own and doesn’t necessarily represent IBM’s positions,
strategies or opinions.”

Paul

unread,

Aug 25, 2005, 3:24:55 PM8/25/05

to

"Del Cecchi" <cecchi...@us.ibm.com> wrote in message
news:3n6ir2...@individual.net...

> For those interested in Cell, there is a document dump at
>
> http://www-128.ibm.com/developerworks/power/cell/
>
> for a little bedtime reading. I think you have to register, but it is
> free.

If you don't fancy registering, its also available here:
http://cell.scei.co.jp/index_e.html

Alexander Terekhov

unread,

Aug 25, 2005, 4:14:04 PM8/25/05

to

Paul wrote:
[...]

> If you don't fancy registering, its also available here:
> http://cell.scei.co.jp/index_e.html

Do you know the laws of Japan? ;-)

regards,
alexander.

Terje Mathisen

unread,

Aug 26, 2005, 2:52:58 AM8/26/05

to

Paul wrote:

Thanks, I just got them all.

Noturally, I started reading the SPU asm manual, and that makes it
immediately obvious that this is a cpu directly targeted at MPEG style
video processing:

absdb Absolute difference of bytes
avgb Average bytes: dest = (a+b+1) >> 1 (MPEG interpolation)

ct Carry Generate: Target = carry out of (A+B)
addx Add word extended: Target = A+B+(Target & 1)

Notice the last one! It uses the least significant bit of each part of
the target register as input to an AddWithCarry operation, which means
that you need three read ports.

This pair of opcodes seems to me to be meant as building blocks for
extended/arbitrary precision calculations.

It has a full set of branch instructions that as a side-effect either
enable or disable interrupts, i.e. critical sections are supposed to be
handled this way.

It seems to handle sub-register size operations with a set of opcodes,
where one of a group of GenerateMask operations is used to generate an
input mask for a general shuffle operation.
...
There's a bunch of generalized three-input FMAC opcodes, all working on
SIMD data, like fnms (T = Acc - (a * b).

It has fsqest and frest to generate approximate reciprocal square root
and reciprocal lookup values. However, these operations does not seem to
deliver results in a standard format, instead each resulting element
consists of two parts, a base and a step, so that a following fi
(Floating Interpolate) can improve upon the table lookup results.

I'm guessing you'd then want one NR iteration to get somewhere close to
IEEE single precision.

The shufb (Shuffle bytes) opcode seems like a small extension to the
Altivec Permute, in that in addition to using 5 bits to select one of 32
possible input bytes, and can also specify three different immediate
values (0, 0x80 and 0xFF), which would be needed to make it work with
the GenerateMask operations mentioned above.

All in all a pretty general set of opcodes for SIMD data processing, it
is particularly obvious in the way each of the possible operations has
forms to work on either a set of input data (reg or immediate), or on
it's complement. This saves a lot of bubble-introducing mask setup
operations, but is normally not considered to be required on a regular cpu.

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Zeb

unread,

Aug 26, 2005, 3:51:26 AM8/26/05

to

One significant point is that single precision (32-bit) floating point
arithmeic is only available as round-to-zero mode. This may be fine for
some graphics algorithms, but for large scale computing it just won't
do. If you want round-to-nearest mode on CELL you'll have to go to
double precision, at half the throughput. At that point you are up
against commoner quad processors like POWER, etc.

Eric P.

unread,

Aug 26, 2005, 8:30:55 AM8/26/05

to

Terje Mathisen wrote:
>
> ...

> It has a full set of branch instructions that as a side-effect either
> enable or disable interrupts, i.e. critical sections are supposed to be
> handled this way.

Why would diddling interrupt enable as a branch side effect be a
benefit, as compared to the normal explicit disable & enable
instructions?

Eric

Maynard Handley

unread,

Aug 26, 2005, 9:56:45 AM8/26/05

to

What I find absolutely bizarre (and not at all encouraging for the
future of Cell as a general purpose processor as IBM and Sony people
have occasionally claimed) is that there is STILL no document in this
lot that describes how to handle the very real issues of models for
handling the miniscule memory space available to each SPU.
I've said it before and will say it again; this is the Achilles' heel of
these beasts. In this day and age, people are simply not interested in
dicking around with segments, overlays and all that weird crap from the
80's. Sure, they will do it for games; I don't deny the value of this
part in games and game-like boxes (PVRs, DTVs, audio-mixing consoles and
so on), but a general purpose box (running, presumably, Linux), where I
care about all round performance --- I want Apache and MySQL and gcc and
perl and php to run fast --- I just don't see it.

My guess is that some of these "workstations" will ship with some very
specialized code on them that does one thing and one thing only well,
maybe H264 encode, maybe some bio algorithm, and that'll be the
face-saving exit strategy from this bizarre claim that never made sense
in the first place.

Ricardo Bugalho

unread,

Aug 26, 2005, 10:09:40 AM8/26/05

to

Hi,
peak double precision throughput in CELL is expected to be one tenth of
peak single precision throughput.

Terje Mathisen

unread,

Aug 26, 2005, 2:54:36 PM8/26/05

to

Eric P. wrote:

It is only a benefit if this pair of instructions must be done a _lot_,
which is why I assume it is the intended way of operation.

Christian Bau

unread,

Aug 26, 2005, 4:30:28 PM8/26/05

to

In article <pan.2005.08.26....@ibili.uc.pt>,
Ricardo Bugalho <rbug...@ibili.uc.pt> wrote:

> Hi,
> peak double precision throughput in CELL is expected to be one tenth of
> peak single precision throughput.

However, peak double precision throughput will be very easy to achieve,
while getting peak single precision throughput is really really hard.

anonymous

unread,

Aug 26, 2005, 6:03:58 PM8/26/05

to

I noticed while reading your response, about CELL's application
specific instruction set ( language extensions ) included with using
IBM's (Toshiba/IBM) CELL processor, for MPEG and bit slice operations,
in contrast,

VLIW SMP MPP FORTH is a hypothetical solution for the MIMD shared
memory problem, synchronizing data access AND maintaining memory cache
consistency.

These problems where both SIMPLY solved by applied using a ( SMP MPP )
matrix microchip ( VLIW FORTH ) microcode engine architecture,
similarly, as in the following references from Mr. Moore and myself,(
URLs, *SMP MPP VLIW for machine code Java, Forth, C, Scheme, etc.,
http://groups.google.com/group/comp.lang.java.machine/msg/b400d03ddc0f5a4f?dmode=source&hl=en
, *Java decode alternative,
http://groups.google.com/group/comp.lang.java.machine/msg/38236e7c4267bb08?dmode=source&hl=en
) ( Or, google usenet, ask The Senate, write IBM/Defense, request
copies of my notes, etc., for VLIW SMP MPP and FORTH information. )

The essence of VLIW SMP MPP FORTH is an efficient scalable parallel
microprocessor architecture, and, importantly, is for manufacturer
customized instruction, such like that of the new IBM/Toshiba CELL
processor. ( only a few hundred of four thousand instruction openings
are defined, anyway, by me.) Those open, undefined instructions,
provide an unlimited sub-expression reduction possibility. ( If a
vertical market application needs a certain instruction feature, maybe,
for IBM or Intel, (
http://groups.google.com/group/comp.sys.ibm.pc.hardware.chips/msg/34df90dd71a0e1cd?dmode=source&hl=en
), FPU16, VID16, NET16, ..., CPU16s function as the traffic light
network for maybe a wide variety of vehicles, ( super scalable
architecture )),.

Simply, IBM/Toshiba CELL is easily outperformed, surpassed thru an
ultra high fabrication efficiency. ( Theory of co-divisional of
electronics and math expression limit(s),
http://groups.google.com/group/sci.math/msg/b5d2f119b8eeee56?dmode=source&hl=en
)

---

President Clinton is a jerk.

Andrew Reilly

unread,

Aug 26, 2005, 7:42:38 PM8/26/05

to

On Fri, 26 Aug 2005 20:54:36 +0200, Terje Mathisen wrote:

> Eric P. wrote:
>
>> Terje Mathisen wrote:
>>
>>>...
>>>It has a full set of branch instructions that as a side-effect either
>>>enable or disable interrupts, i.e. critical sections are supposed to be
>>>handled this way.
>>
>>
>> Why would diddling interrupt enable as a branch side effect be a
>> benefit, as compared to the normal explicit disable & enable
>> instructions?
>
> It is only a benefit if this pair of instructions must be done a _lot_,
> which is why I assume it is the intended way of operation.

It would also make it easy to do atomic os trap style things pretty
easily. When masking off interrupts is separate from branching, you often
(well, I've encountered it, anyway) have to muck about to check pipeline
latencies to make sure that the interrupts are truly off at the time you
branch.

Since the SPEs have unshared memory, interrupt masking is as much as you
should need for atomic operations.

Mind you: I didn't think that the SPEs were intended to be doing much
multiprogramming themselves: they've got quite heavyweight state (BIG
register file) and no memory protection. They're clearly intended for
running single tasks to completion.

And then on the other hand, the obvious programming model is to operate
entirely within completion call-backs from the DMA engine that's running
the off-chip memory access program. Maybe it's reasonable to code small
queue manipulation routines without doing a full state swap.

Interesting beastie, anyway. I'd love to see some information about the
software architecture that they've obviously got in mind to hold
everything together.

--
Andrew

Andrew Reilly

unread,

Aug 26, 2005, 7:50:07 PM8/26/05

to

They could reasonably ship with an optimised BLAS/LAPAC/FFTPAC library
that made good use of them. Then you'd just use it like a vector
supercomputer of some sort. Probably get quite good performance on that
sort of program.

I saw an article recently where ClearSpeed had done something like this
with their "50GFlop" co-processor card: they wrote a library that was
compatible with the Intel Performance Primitive (IPP) numeric library.
Code written against that would "just work (faster)".

Years and years ago, that was the model for interaction with the various
AT&T DSP32C and i860 floating point accellerator cards that were available.

Not as flexible as getting your own inner loops sped-up, but lots of real
work can be done anyway.

--
Andrew

Wes Felter

unread,

Aug 26, 2005, 10:10:00 PM8/26/05

to

On 2005-08-26 08:56:45 -0500, Maynard Handley <nam...@name99.org> said:

> What I find absolutely bizarre (and not at all encouraging for the
> future of Cell as a general purpose processor as IBM and Sony people
> have occasionally claimed) is that there is STILL no document in this
> lot that describes how to handle the very real issues of models for
> handling the miniscule memory space available to each SPU.

Apparently this is still a research topic:

http://www.research.ibm.com/cellcompiler/compiler-mem-abstract.htm

--
Wes Felter - wes...@felter.org - http://felter.org/wesley/

Eric P.

unread,

Aug 27, 2005, 9:28:27 AM8/27/05

to

Andrew Reilly wrote:
>
> On Fri, 26 Aug 2005 20:54:36 +0200, Terje Mathisen wrote:
>
> > Eric P. wrote:
> >
> >> Terje Mathisen wrote:
> >>
> >>>...
> >>>It has a full set of branch instructions that as a side-effect either
> >>>enable or disable interrupts, i.e. critical sections are supposed to be
> >>>handled this way.
> >>
> >>
> >> Why would diddling interrupt enable as a branch side effect be a
> >> benefit, as compared to the normal explicit disable & enable
> >> instructions?
> >
> > It is only a benefit if this pair of instructions must be done a _lot_,
> > which is why I assume it is the intended way of operation.
>
> It would also make it easy to do atomic os trap style things pretty
> easily. When masking off interrupts is separate from branching, you often
> (well, I've encountered it, anyway) have to muck about to check pipeline
> latencies to make sure that the interrupts are truly off at the time you
> branch.

The docs don't say anything about it helping with pipelines.
Typically if pipelines need to be drained that is done by the
hardware because otherwise it makes the software realllly
dependent on a specific hardware implementation. Not good.

It is also very unusual to manage interrupts this way.
Usually you want to save/push the current interrupt enable state
and disable, then restore the prior state. Yet I see no
ability to do this in the SPE. The only time interrupts
are ever explicitly enabled is during the boot sequence.
Doing explicit enables at any other time makes interrupt subroutines
impossible because the lower level routines reenable interrupts
when they should not. That is an interrupt management 101 mistake.

So I still don't see why it is designed this way.

> Since the SPEs have unshared memory, interrupt masking is as much as you
> should need for atomic operations.
>
> Mind you: I didn't think that the SPEs were intended to be doing much
> multiprogramming themselves: they've got quite heavyweight state (BIG
> register file) and no memory protection. They're clearly intended for
> running single tasks to completion.

At a minimum a slave needs at least 1 Master triggered interrupt
so it can interrupt a running task to terminate it and cause the
slave to move to the next item in its work queue. This would happen
if a parent thread died in the master cpu or during code development
to abort an errant infinite loop. In practice this would be a
general Command Message From Master interrupt, and then the
message would say what to do.

You would also want debugging and single stepping capabilities
and need to be able to force a register bank dump and/or load.

There are also exceptions: floating point, maybe invalid address,
maybe integer overflow, stack overflow, etc.

So there are a variety of interrupts and traps even the simplest
slave processor needs.

> And then on the other hand, the obvious programming model is to operate
> entirely within completion call-backs from the DMA engine that's running
> the off-chip memory access program. Maybe it's reasonable to code small
> queue manipulation routines without doing a full state swap.

It still needs an absolutlely minimal monitor for control
and communications.

> Interesting beastie, anyway. I'd love to see some information about the
> software architecture that they've obviously got in mind to hold
> everything together.
>
> --
> Andrew

Eric

Alex Gibson

unread,

Aug 28, 2005, 12:59:49 AM8/28/05

to

"Maynard Handley" <nam...@name99.org> wrote in message
news:name99-0F2E35.06564226082005@localhost...

Sounds pretty similar to the existing Cradle MDSP chips.

www.cradle.com
http://www.cradle.com/documentation/datasheets.shtml
http://www.cradle.com/documentation/application_notes.shtml

http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf
http://www.cradle.com/downloads/3056_CT3600_pbV3.pdf

The CT3600 supposedly can handle 16 streams of MP4 SP L3 at once.

They give you semaphore registers in the PE (processing elements)
which are risc cores which you can use to manage the DSE's ("dsp cores")

In the CT3400 chips you get one processing quad - 4 PE's and 8 DSE's
and one IO quad (2 PE's and 2 MTE's (memory transfer engines).

PE's are call GPP in CT36xx chips.

Can only program the PE's in c , the rest is done using CLA - c like
assembly language.

Claim 29.5 billian macs per second on 8 bit data for the CT3400
Claim up to 96 Giga MACs for the CT3616.

Alex

Alexander Kjeldaas

unread,

Aug 29, 2005, 5:11:45 AM8/29/05

to

Terje Mathisen wrote:
> Eric P. wrote:
>
>
>>Terje Mathisen wrote:
>>
>>
>>>...
>>>It has a full set of branch instructions that as a side-effect either
>>>enable or disable interrupts, i.e. critical sections are supposed to be
>>>handled this way.
>>
>>
>>Why would diddling interrupt enable as a branch side effect be a
>>benefit, as compared to the normal explicit disable & enable
>>instructions?
>
>
> It is only a benefit if this pair of instructions must be done a _lot_,
> which is why I assume it is the intended way of operation.
>

For one thing, the "Branch Indirect and Set Link if External Data" can
be very useful to reduce the cost of having "safe points" for
synchronization with a garbage collector. If you know you will only be
interrupted by your GC at specific points, you can do all sorts of
"illegal" stuff with your registers between these points.

You get a safe point with one instruction in this case, compared to
load, compare, branch on "normal" architectures. The cost is having to
reserve one register for the indirect branch address.

astor

Joe Seigh

unread,

Aug 29, 2005, 8:46:50 AM8/29/05

to

RCU, which is a form of GC, has the concept of safe points. They're called
quiescent states. And the overhead is even lower, usually just a store
into local storage. No call to another function for "safe point" processing.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

Eric P.

unread,

Aug 29, 2005, 10:30:39 AM8/29/05

to

"Eric P." wrote:
>
> At a minimum a slave needs at least 1 Master triggered interrupt
> so it can interrupt a running task to terminate it and cause the
> slave to move to the next item in its work queue. This would happen
> if a parent thread died in the master cpu or during code development
> to abort an errant infinite loop. In practice this would be a
> general Command Message From Master interrupt, and then the
> message would say what to do.
>
> You would also want debugging and single stepping capabilities
> and need to be able to force a register bank dump and/or load.
>
> There are also exceptions: floating point, maybe invalid address,
> maybe integer overflow, stack overflow, etc.
>
> So there are a variety of interrupts and traps even the simplest
> slave processor needs.

On closer inspection, the SPUs are simpler than a slave processor.
This is not a classic master slave asymmetric multiprocessor.
SPUs do not have exceptions nor does it appear they require their
own control program.

The PPE (master) has direct control over each SPU through 3
control/status registers. These allow the PPE to load/read an SPU
program counter, run/stop the SPU, and a status register shows
the reason why the SPU stopped (e.g. illegal instruction).
There is also a way for the PPE to to single step an SPU.
SPUs run until there is a problem or the job is complete and
just stop. If anything goes wrong, a status bit indicates so and
the PPE must diagnose the problem.

If an SPU job wants to take action based on any condition,
e.g. floating point underflow, then it must manually test
for the condition and branch to handler code.

The docs indicate there is some method for PPE to load and
unload a whole register context but don't say what it is.
This would only be needed for debugging as SPUs are not
intended to context switch, just run a single job to completion.

In such an architecture, that the SPU even supports interrupts seems
somewhat anomalous because the PPE is responsible for its control.
The architects may be thinking that an SPU can also act as a real
time controller and/or IO coprocessor. Needs further investigation.

Eric