The following papers define the Cell specification and will be posted to the IBM Semiconductor Solutions Technical Library in September. Readers with a current IBM ID are invited to see them early and gain access to participate in the Power Architecture™ zone's Cell discussion forum. If you are not already a registered user, you can register now (Note: Registered users will need to sign in to download).
Cell architecture from 20,000 feet A high-level description of the Cell Broadband Engine (CBE), the Synergistic Processing Elements (SPEs), and how they work together. 2 pages, 27KB | HTML (no registration required)
Cell Broadband Engine Architecture V1.0 Like the Power Architecture, but different -- the CBE Architecture builds upon knowledge contained in the Power Architecture "books" and describes the app-level User Mode Environment (UME) and the OS-level Privileged Mode Environment (PME) in astonishingly rich detail. 327 pages, 4.51MB | PDF (registration required)
Synergistic Processor Unit (SPU) Instruction Set Architecture V1.0 Somewhere between a general-purpose processor and special-purpose hardware lies the Cell SPU: designed to provide leadership performance in game, media, and broadband applications, this document describes the Application Binary Interface (ABI) of the Synergistic Processor Unit (SPU). Get to know all of its instructions. 30 pages, 1.89MB | PDF (registration required)
SPU Application Binary Interface Specification V1.3 Including low-level system and language binding information, information on loading and linking, and coding examples, this specification defines the system interface for SPU-targeted object files to help ensure maximum binary portability across implementations. 37 pages, 357KB | PDF (registration required)
SPU Assembly Language Specification V1.2 Unleash the full processing power of the SPUs -- you know you want to! This specification will prove an indispensable aid in your efforts as it takes you on a carefully-worded journey describing SPU assembly-level syntax and machine-dependent features for the GNU assembler (but serves as an example specification for other SPU assemblers as well). 30 pages, 122KB | PDF (registration required)
SPU C/C++ Language Extensions V2.0 Describes the basic data types, operations on these data types, and directives and program controls required by the CBE specification; includes sample code. 98 pages, 462KB | PDF (registration required)
Cell forum Technical discussion of these documents is going on now at the Cell Architecture forum; comments and errata are welcome there also, or by e-mail to CBE_Documentat...@us.ibm.com. 1 forum, HTML format (registration required)
-- Del Cecchi "This post is my own and doesn’t necessarily represent IBM’s positions, strategies or opinions.”
Noturally, I started reading the SPU asm manual, and that makes it immediately obvious that this is a cpu directly targeted at MPEG style video processing:
absdb Absolute difference of bytes avgb Average bytes: dest = (a+b+1) >> 1 (MPEG interpolation)
ct Carry Generate: Target = carry out of (A+B) addx Add word extended: Target = A+B+(Target & 1)
Notice the last one! It uses the least significant bit of each part of the target register as input to an AddWithCarry operation, which means that you need three read ports.
This pair of opcodes seems to me to be meant as building blocks for extended/arbitrary precision calculations.
It has a full set of branch instructions that as a side-effect either enable or disable interrupts, i.e. critical sections are supposed to be handled this way.
It seems to handle sub-register size operations with a set of opcodes, where one of a group of GenerateMask operations is used to generate an input mask for a general shuffle operation. ... There's a bunch of generalized three-input FMAC opcodes, all working on SIMD data, like fnms (T = Acc - (a * b).
It has fsqest and frest to generate approximate reciprocal square root and reciprocal lookup values. However, these operations does not seem to deliver results in a standard format, instead each resulting element consists of two parts, a base and a step, so that a following fi (Floating Interpolate) can improve upon the table lookup results.
I'm guessing you'd then want one NR iteration to get somewhere close to IEEE single precision.
The shufb (Shuffle bytes) opcode seems like a small extension to the Altivec Permute, in that in addition to using 5 bits to select one of 32 possible input bytes, and can also specify three different immediate values (0, 0x80 and 0xFF), which would be needed to make it work with the GenerateMask operations mentioned above.
All in all a pretty general set of opcodes for SIMD data processing, it is particularly obvious in the way each of the possible operations has forms to work on either a set of input data (reg or immediate), or on it's complement. This saves a lot of bubble-introducing mask setup operations, but is normally not considered to be required on a regular cpu.
Terje -- - <Terje.Mathi...@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
One significant point is that single precision (32-bit) floating point arithmeic is only available as round-to-zero mode. This may be fine for some graphics algorithms, but for large scale computing it just won't do. If you want round-to-nearest mode on CELL you'll have to go to double precision, at half the throughput. At that point you are up against commoner quad processors like POWER, etc.
> ... > It has a full set of branch instructions that as a side-effect either > enable or disable interrupts, i.e. critical sections are supposed to be > handled this way.
Why would diddling interrupt enable as a branch side effect be a benefit, as compared to the normal explicit disable & enable instructions?
What I find absolutely bizarre (and not at all encouraging for the future of Cell as a general purpose processor as IBM and Sony people have occasionally claimed) is that there is STILL no document in this lot that describes how to handle the very real issues of models for handling the miniscule memory space available to each SPU. I've said it before and will say it again; this is the Achilles' heel of these beasts. In this day and age, people are simply not interested in dicking around with segments, overlays and all that weird crap from the 80's. Sure, they will do it for games; I don't deny the value of this part in games and game-like boxes (PVRs, DTVs, audio-mixing consoles and so on), but a general purpose box (running, presumably, Linux), where I care about all round performance --- I want Apache and MySQL and gcc and perl and php to run fast --- I just don't see it.
My guess is that some of these "workstations" will ship with some very specialized code on them that does one thing and one thing only well, maybe H264 encode, maybe some bio algorithm, and that'll be the face-saving exit strategy from this bizarre claim that never made sense in the first place.
On Fri, 26 Aug 2005 07:51:26 +0000, Zeb wrote: > One significant point is that single precision (32-bit) floating point > arithmeic is only available as round-to-zero mode. This may be fine for > some graphics algorithms, but for large scale computing it just won't do. > If you want round-to-nearest mode on CELL you'll have to go to double > precision, at half the throughput. At that point you are up against > commoner quad processors like POWER, etc.
>>... >>It has a full set of branch instructions that as a side-effect either >>enable or disable interrupts, i.e. critical sections are supposed to be >>handled this way.
> Why would diddling interrupt enable as a branch side effect be a > benefit, as compared to the normal explicit disable & enable > instructions?
It is only a benefit if this pair of instructions must be done a _lot_, which is why I assume it is the intended way of operation.
Terje
-- - <Terje.Mathi...@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
> Noturally, I started reading the SPU asm manual, and that makes it > immediately obvious that this is a cpu directly targeted at MPEG style > video processing:
> absdb Absolute difference of bytes > avgb Average bytes: dest = (a+b+1) >> 1 (MPEG interpolation)
> ct Carry Generate: Target = carry out of (A+B) > addx Add word extended: Target = A+B+(Target & 1)
> Notice the last one! It uses the least significant bit of each part of > the target register as input to an AddWithCarry operation, which means > that you need three read ports.
> This pair of opcodes seems to me to be meant as building blocks for > extended/arbitrary precision calculations.
> It has a full set of branch instructions that as a side-effect either > enable or disable interrupts, i.e. critical sections are supposed to be > handled this way.
> It seems to handle sub-register size operations with a set of opcodes, > where one of a group of GenerateMask operations is used to generate an > input mask for a general shuffle operation. > ... > There's a bunch of generalized three-input FMAC opcodes, all working on > SIMD data, like fnms (T = Acc - (a * b).
> It has fsqest and frest to generate approximate reciprocal square root > and reciprocal lookup values. However, these operations does not seem to > deliver results in a standard format, instead each resulting element > consists of two parts, a base and a step, so that a following fi > (Floating Interpolate) can improve upon the table lookup results.
> I'm guessing you'd then want one NR iteration to get somewhere close to > IEEE single precision.
> The shufb (Shuffle bytes) opcode seems like a small extension to the > Altivec Permute, in that in addition to using 5 bits to select one of 32 > possible input bytes, and can also specify three different immediate > values (0, 0x80 and 0xFF), which would be needed to make it work with > the GenerateMask operations mentioned above.
> All in all a pretty general set of opcodes for SIMD data processing, it > is particularly obvious in the way each of the possible operations has > forms to work on either a set of input data (reg or immediate), or on > it's complement. This saves a lot of bubble-introducing mask setup > operations, but is normally not considered to be required on a regular cpu.
> Terje > -- > - <Terje.Mathi...@hda.hydro.com> > "almost all programming can be viewed as an exercise in caching"
I noticed while reading your response, about CELL's application specific instruction set ( language extensions ) included with using IBM's (Toshiba/IBM) CELL processor, for MPEG and bit slice operations, in contrast,
VLIW SMP MPP FORTH is a hypothetical solution for the MIMD shared memory problem, synchronizing data access AND maintaining memory cache consistency.
These problems where both SIMPLY solved by applied using a ( SMP MPP ) matrix microchip ( VLIW FORTH ) microcode engine architecture, similarly, as in the following references from Mr. Moore and myself,( URLs, *SMP MPP VLIW for machine code Java, Forth, C, Scheme, etc., http://groups.google.com/group/comp.lang.java.machine/msg/b400d03ddc0... , *Java decode alternative, http://groups.google.com/group/comp.lang.java.machine/msg/38236e7c426... ) ( Or, google usenet, ask The Senate, write IBM/Defense, request copies of my notes, etc., for VLIW SMP MPP and FORTH information. )
The essence of VLIW SMP MPP FORTH is an efficient scalable parallel microprocessor architecture, and, importantly, is for manufacturer customized instruction, such like that of the new IBM/Toshiba CELL processor. ( only a few hundred of four thousand instruction openings are defined, anyway, by me.) Those open, undefined instructions, provide an unlimited sub-expression reduction possibility. ( If a vertical market application needs a certain instruction feature, maybe, for IBM or Intel, ( http://groups.google.com/group/comp.sys.ibm.pc.hardware.chips/msg/34d... ), FPU16, VID16, NET16, ..., CPU16s function as the traffic light network for maybe a wide variety of vehicles, ( super scalable architecture )),.
On Fri, 26 Aug 2005 20:54:36 +0200, Terje Mathisen wrote: > Eric P. wrote:
>> Terje Mathisen wrote:
>>>... >>>It has a full set of branch instructions that as a side-effect either >>>enable or disable interrupts, i.e. critical sections are supposed to be >>>handled this way.
>> Why would diddling interrupt enable as a branch side effect be a >> benefit, as compared to the normal explicit disable & enable >> instructions?
> It is only a benefit if this pair of instructions must be done a _lot_, > which is why I assume it is the intended way of operation.
It would also make it easy to do atomic os trap style things pretty easily. When masking off interrupts is separate from branching, you often (well, I've encountered it, anyway) have to muck about to check pipeline latencies to make sure that the interrupts are truly off at the time you branch.
Since the SPEs have unshared memory, interrupt masking is as much as you should need for atomic operations.
Mind you: I didn't think that the SPEs were intended to be doing much multiprogramming themselves: they've got quite heavyweight state (BIG register file) and no memory protection. They're clearly intended for running single tasks to completion.
And then on the other hand, the obvious programming model is to operate entirely within completion call-backs from the DMA engine that's running the off-chip memory access program. Maybe it's reasonable to code small queue manipulation routines without doing a full state swap.
Interesting beastie, anyway. I'd love to see some information about the software architecture that they've obviously got in mind to hold everything together.
On Fri, 26 Aug 2005 13:56:45 +0000, Maynard Handley wrote: > What I find absolutely bizarre (and not at all encouraging for the > future of Cell as a general purpose processor as IBM and Sony people > have occasionally claimed) is that there is STILL no document in this > lot that describes how to handle the very real issues of models for > handling the miniscule memory space available to each SPU. > I've said it before and will say it again; this is the Achilles' heel of > these beasts. In this day and age, people are simply not interested in > dicking around with segments, overlays and all that weird crap from the > 80's. Sure, they will do it for games; I don't deny the value of this > part in games and game-like boxes (PVRs, DTVs, audio-mixing consoles and > so on), but a general purpose box (running, presumably, Linux), where I > care about all round performance --- I want Apache and MySQL and gcc and > perl and php to run fast --- I just don't see it.
> My guess is that some of these "workstations" will ship with some very > specialized code on them that does one thing and one thing only well, > maybe H264 encode, maybe some bio algorithm, and that'll be the > face-saving exit strategy from this bizarre claim that never made sense > in the first place.
They could reasonably ship with an optimised BLAS/LAPAC/FFTPAC library that made good use of them. Then you'd just use it like a vector supercomputer of some sort. Probably get quite good performance on that sort of program.
I saw an article recently where ClearSpeed had done something like this with their "50GFlop" co-processor card: they wrote a library that was compatible with the Intel Performance Primitive (IPP) numeric library. Code written against that would "just work (faster)".
Years and years ago, that was the model for interaction with the various AT&T DSP32C and i860 floating point accellerator cards that were available.
Not as flexible as getting your own inner loops sped-up, but lots of real work can be done anyway.
On 2005-08-26 08:56:45 -0500, Maynard Handley <nam...@name99.org> said:
> What I find absolutely bizarre (and not at all encouraging for the > future of Cell as a general purpose processor as IBM and Sony people > have occasionally claimed) is that there is STILL no document in this > lot that describes how to handle the very real issues of models for > handling the miniscule memory space available to each SPU.
> On Fri, 26 Aug 2005 20:54:36 +0200, Terje Mathisen wrote:
> > Eric P. wrote:
> >> Terje Mathisen wrote:
> >>>... > >>>It has a full set of branch instructions that as a side-effect either > >>>enable or disable interrupts, i.e. critical sections are supposed to be > >>>handled this way.
> >> Why would diddling interrupt enable as a branch side effect be a > >> benefit, as compared to the normal explicit disable & enable > >> instructions?
> > It is only a benefit if this pair of instructions must be done a _lot_, > > which is why I assume it is the intended way of operation.
> It would also make it easy to do atomic os trap style things pretty > easily. When masking off interrupts is separate from branching, you often > (well, I've encountered it, anyway) have to muck about to check pipeline > latencies to make sure that the interrupts are truly off at the time you > branch.
The docs don't say anything about it helping with pipelines. Typically if pipelines need to be drained that is done by the hardware because otherwise it makes the software realllly dependent on a specific hardware implementation. Not good.
It is also very unusual to manage interrupts this way. Usually you want to save/push the current interrupt enable state and disable, then restore the prior state. Yet I see no ability to do this in the SPE. The only time interrupts are ever explicitly enabled is during the boot sequence. Doing explicit enables at any other time makes interrupt subroutines impossible because the lower level routines reenable interrupts when they should not. That is an interrupt management 101 mistake.
So I still don't see why it is designed this way.
> Since the SPEs have unshared memory, interrupt masking is as much as you > should need for atomic operations.
> Mind you: I didn't think that the SPEs were intended to be doing much > multiprogramming themselves: they've got quite heavyweight state (BIG > register file) and no memory protection. They're clearly intended for > running single tasks to completion.
At a minimum a slave needs at least 1 Master triggered interrupt so it can interrupt a running task to terminate it and cause the slave to move to the next item in its work queue. This would happen if a parent thread died in the master cpu or during code development to abort an errant infinite loop. In practice this would be a general Command Message From Master interrupt, and then the message would say what to do.
You would also want debugging and single stepping capabilities and need to be able to force a register bank dump and/or load.
There are also exceptions: floating point, maybe invalid address, maybe integer overflow, stack overflow, etc.
So there are a variety of interrupts and traps even the simplest slave processor needs.
> And then on the other hand, the obvious programming model is to operate > entirely within completion call-backs from the DMA engine that's running > the off-chip memory access program. Maybe it's reasonable to code small > queue manipulation routines without doing a full state swap.
It still needs an absolutlely minimal monitor for control and communications.
> Interesting beastie, anyway. I'd love to see some information about the > software architecture that they've obviously got in mind to hold > everything together.
> What I find absolutely bizarre (and not at all encouraging for the > future of Cell as a general purpose processor as IBM and Sony people > have occasionally claimed) is that there is STILL no document in this > lot that describes how to handle the very real issues of models for > handling the miniscule memory space available to each SPU. > I've said it before and will say it again; this is the Achilles' heel of > these beasts. In this day and age, people are simply not interested in > dicking around with segments, overlays and all that weird crap from the > 80's. Sure, they will do it for games; I don't deny the value of this > part in games and game-like boxes (PVRs, DTVs, audio-mixing consoles and > so on), but a general purpose box (running, presumably, Linux), where I > care about all round performance --- I want Apache and MySQL and gcc and > perl and php to run fast --- I just don't see it.
> My guess is that some of these "workstations" will ship with some very > specialized code on them that does one thing and one thing only well, > maybe H264 encode, maybe some bio algorithm, and that'll be the > face-saving exit strategy from this bizarre claim that never made sense > in the first place.
Sounds pretty similar to the existing Cradle MDSP chips.
>>>... >>>It has a full set of branch instructions that as a side-effect either >>>enable or disable interrupts, i.e. critical sections are supposed to be >>>handled this way.
>>Why would diddling interrupt enable as a branch side effect be a >>benefit, as compared to the normal explicit disable & enable >>instructions?
> It is only a benefit if this pair of instructions must be done a _lot_, > which is why I assume it is the intended way of operation.
For one thing, the "Branch Indirect and Set Link if External Data" can be very useful to reduce the cost of having "safe points" for synchronization with a garbage collector. If you know you will only be interrupted by your GC at specific points, you can do all sorts of "illegal" stuff with your registers between these points.
You get a safe point with one instruction in this case, compared to load, compare, branch on "normal" architectures. The cost is having to reserve one register for the indirect branch address.
>>>> ... >>>> It has a full set of branch instructions that as a side-effect either >>>> enable or disable interrupts, i.e. critical sections are supposed to be >>>> handled this way.
>>> Why would diddling interrupt enable as a branch side effect be a >>> benefit, as compared to the normal explicit disable & enable >>> instructions?
>> It is only a benefit if this pair of instructions must be done a _lot_, >> which is why I assume it is the intended way of operation.
> For one thing, the "Branch Indirect and Set Link if External Data" can > be very useful to reduce the cost of having "safe points" for > synchronization with a garbage collector. If you know you will only be > interrupted by your GC at specific points, you can do all sorts of > "illegal" stuff with your registers between these points.
> You get a safe point with one instruction in this case, compared to > load, compare, branch on "normal" architectures. The cost is having to > reserve one register for the indirect branch address.
RCU, which is a form of GC, has the concept of safe points. They're called quiescent states. And the overhead is even lower, usually just a store into local storage. No call to another function for "safe point" processing.
-- Joe Seigh
When you get lemons, you make lemonade. When you get hardware, you make software.
> At a minimum a slave needs at least 1 Master triggered interrupt > so it can interrupt a running task to terminate it and cause the > slave to move to the next item in its work queue. This would happen > if a parent thread died in the master cpu or during code development > to abort an errant infinite loop. In practice this would be a > general Command Message From Master interrupt, and then the > message would say what to do.
> You would also want debugging and single stepping capabilities > and need to be able to force a register bank dump and/or load.
> There are also exceptions: floating point, maybe invalid address, > maybe integer overflow, stack overflow, etc.
> So there are a variety of interrupts and traps even the simplest > slave processor needs.
On closer inspection, the SPUs are simpler than a slave processor. This is not a classic master slave asymmetric multiprocessor. SPUs do not have exceptions nor does it appear they require their own control program.
The PPE (master) has direct control over each SPU through 3 control/status registers. These allow the PPE to load/read an SPU program counter, run/stop the SPU, and a status register shows the reason why the SPU stopped (e.g. illegal instruction). There is also a way for the PPE to to single step an SPU. SPUs run until there is a problem or the job is complete and just stop. If anything goes wrong, a status bit indicates so and the PPE must diagnose the problem.
If an SPU job wants to take action based on any condition, e.g. floating point underflow, then it must manually test for the condition and branch to handler code.
The docs indicate there is some method for PPE to load and unload a whole register context but don't say what it is. This would only be needed for debugging as SPUs are not intended to context switch, just run a single job to completion.
In such an architecture, that the SPU even supports interrupts seems somewhat anomalous because the PPE is responsible for its control. The architects may be thinking that an SPU can also act as a real time controller and/or IO coprocessor. Needs further investigation.