The next ARM-coders Challange

Richard Earnshaw

unread,

Feb 10, 1994, 5:06:17 AM2/10/94

to

Does anyone know of a faster implementation of ffs () for the ARM than the
one below (15 instructions).

rsb r1, r0, #0 @ 2's complement of r0
and r1, r0, r1 @ r1 = lowest non-zero bit in r0, or 0
movs r2, r1, lsr #16 @ r2 = 0 if top 16 bits zero
movne r2, #16 @ Otherwise r2 = 16
moveqs r1, r1, asl #16 @ if r2 == 0 put lowest 16 bits in top
addne r2, r2, #1 @ if r0 != 0 add 1
tst r1, #0xff000000 @ Start binary search of result....
addne r2, r2, #8
orr r1, r1, r1, asl #8 @ (Top byte now contains bit)
tst r1, #0xf0000000 @ (%11110000)
addne r2, r2, #4
tst r1, #0xcc000000 @ (%11001100)
addne r2, r2, #2
tst r1, #0xaa000000 @ (%10101010)
addne r2, r2, #1

Note: r2 and r0 can be made the same register if the value in r0 is no-longer
required; alternatively r0 and r1 could also be merged by using r2 as the
scratch in the first instruction.

p.a.harris, CAD Centre

unread,

Feb 11, 1994, 5:48:09 AM2/11/94

to

Richard Earnshaw asks:

>Does anyone know of a faster implementation of ffs () for the ARM than the
>one below (15 instructions).

What is the ffs() routine supposed to do?

Peter

Tribbeck J P

unread,

Feb 14, 1994, 5:50:47 PM2/14/94

to

In article I...@cs.utwente.nl, r...@cs.utwente.nl (Richard Earnshaw) writes:
>Does anyone know of a faster implementation of ffs () for the ARM than the
>one below (15 instructions).

Excuse me for being ignorant, but what's the ffs() routine do?

Cheers,

Jason Tribbeck.

Richard Earnshaw

unread,

Feb 15, 1994, 4:32:30 AM2/15/94

to

ffs (x) returns the bit possition of the lowest bit in x that is not zero.

As an example of its use, if x is known to have exactly one bit non-zero,
then log2 (x) = ffs(x) - 1, and is significantly faster to caluculate.

Given that there are 32 possible possitions for the lowest non-zero bit to be
the trick is to try and find implementations that handle more than one bit
per instruction.

Richard.

Oliver Betts

unread,

Feb 15, 1994, 9:26:52 AM2/15/94

to

In article <1994Feb15.0...@infodev.cam.ac.uk>,

Richard Earnshaw <rw...@cl.cam.ac.uk> wrote:
>Given that there are 32 possible possitions for the lowest non-zero bit to be
>the trick is to try and find implementations that handle more than one bit
>per instruction.

Arguably 33 possible positions, since the lowest non-zero bit may be in
any of the 32 bits, or the input may be zero. There are 33 outputs
anyway (ffs(1<<31)=32, ffs(1)=1, ffs(0)=0).

To answer the original question, I haven't found a quicker algorithm.

I'm curious as to why you're interested in optimising this particular
routine - it seems a tad obscure. Or is there a major use I haven't
spotted?

Ol
--
Wanted: Reality - must be in good condition. Can collect from Cambridge area.

Moeroa

unread,

Feb 15, 1994, 11:14:54 AM2/15/94

to

Richard Earnshaw (r...@cs.utwente.nl) wrote:

>Does anyone know of a faster implementation of ffs () for the ARM than the
>one below (15 instructions).
>

> rsb r1, r0, #0 @ 2's complement of r0
> and r1, r0, r1 @ r1 = lowest non-zero bit in r0, or 0
> movs r2, r1, lsr #16 @ r2 = 0 if top 16 bits zero

> ..etc..
> [routine to return the number of the lowest bit set in r0]

Yep. I can do it in 6 instructions!
It bit 0 is set, the routine only takes 3 instructions.

My routine is:

ffs
MOVS r1, r0,ASR#1 MOV r2, #0
BCS out
loop
ADD r2, r2,#1
MOVS r1, r1,ASR#1
BCC loop
out

Simple eh! But on average it's faster than yours..

Why?

Don't forget that 50% of <N>-bit numbers have bit 0 set.
And ~88% f numbers have one of the bottom three bits set.

ie.

---
Bit Fraction that Total so far
have this bit
as lowest set
------- -----------------------------
0 0.5 0.5
1 0.25 0.75
2 0.125 0.875
3 0.0625 0.9375
.
.
31 0.0000000002.. 1.0
-------------------------------------

So for each of the 32 possibilities,
(and ignoring *exact* instruction timings):

Bit Instructions
(b) (n) n*P(b)
------- ------------ ------- ----------
0 3 3 *0.5 = 1.5
1 6 6 *0.25 = 1.5
2 9 9 *0.125 = 1.125
.. ..etc..
31 96 96*2x10^-10 = ~2x10^-8
--------------------------------------------
Total = 5.99999..

So for random values of r0, the average number of
instructions executed is 6.

However, in practice (I tried it), my way is only
about 20% faster. This is for various reasons:
eg. y branch; and a few of your instructions are
conditional.

Paul

David Seal

unread,

Feb 15, 1994, 12:46:16 PM2/15/94

to

In article <1994Feb15.0...@infodev.cam.ac.uk> rw...@cl.cam.ac.uk
(Richard Earnshaw) writes:

>ffs (x) returns the bit possition of the lowest bit in x that is not zero.
>
>As an example of its use, if x is known to have exactly one bit non-zero,
>then log2 (x) = ffs(x) - 1, and is significantly faster to caluculate.
>
>Given that there are 32 possible possitions for the lowest non-zero bit to be
>the trick is to try and find implementations that handle more than one bit
>per instruction.

I believe two clarifications are required:

(a) The "bit position" you refer to is 32 for the most significant bit and 1
for the least significant bit - i.e. one more than the standard ARM bit
numbering. This is necessary to make your identity work.

(b) ffs(0) = 0 (deduced from the code you gave for the function).

Anyway, here's my latest attempt:

; Operand in R0, register R1 is free, R2 addresses a byte table 'Table'
; (defined below)

RSB R1,R0,#0 ;Standard trick to isolate bottom bit in R1,
AND R1,R1,R0 ; or produce zero in R1 if R0 = 0.

ORR R0,R1,R1,LSL #7 ;If R1=X with 0 or 1 bits set, R0 = X * &81
ORR R0,R0,R0,LSL #14 ;R0 = X * &204081
ORR R1,R1,R1,LSL #8 ;R1 = X * &101
ORR R1,R1,R1,LSL #16 ;R1 = X * &1010101
ORR R1,R1,R0,LSL #7 ;So R1 is now original R1 * &11214181

LDRB R1,[R2,R1,LSR #24] ;Look up table entry indexed by top 8 bits
; of R1

; Result in R1

'Table' is a 193-byte table with some entries defined as follows; all other
entries don't matter.

Entry Value
&00 &00
&02 &1A
&04 &1B
&06 &13
&08 &1C
&0A &0C
&0C &14
&10 &1D
&11 &01
&12 &05
&14 &0D
&18 &15
&20 &1E
&21 &09
&22 &02
&24 &06
&28 &0E
&30 &16
&40 &1F
&41 &11
&42 &0A
&44 &03
&48 &07
&50 &0F
&60 &17
&80 &20
&81 &19
&83 &12
&85 &0B
&88 &04
&90 &08
&A0 &10
&C0 &18

This takes 7 S-cycles plus the time for the LDRB. Assuming the ratio of (N
cycle time):(I-cycle time):(S-cycle time) is 2:1:1, this is effectively 11
S-cycles.

Overheads on top of this are the time to address the byte table (which
should be taken outside any critical loop) and cache misses for the byte
table access. Note that with a cache, the ratios drop to 1:1:1, for a time
of 10 S-cycles plus a little bit for the average miss penalty on the byte
table.

The key to this technique is that the top byte of R1 * &11214181 is
different for each of the values (zero and the powers of two) that R1 can
take after the first two instructions. Note that I haven't tried very hard
to optimise the constant &11214181, either from the point of view of finding
one I can multiply R1 by quickly or from the point of view of minimising the
table size (note that a smaller table will generally improve the cache hit
ratio for cached processors). It's quite possible that a cycle or two can be
shaved off the above without greatly increasing the table size.

David Seal
ds...@armltd.co.uk

All opinions are mine only...

Richard Earnshaw

unread,

Feb 15, 1994, 3:17:21 PM2/15/94

to

In article <Paul_Leb...@equinox.gen.nz>, Paul_...@equinox.gen.nz (Moeroa) writes:
|> Richard Earnshaw (r...@cs.utwente.nl) wrote:
|>
|> >Does anyone know of a faster implementation of ffs () for the ARM than the
|> >one below (15 instructions).
|> >
|> > rsb r1, r0, #0 @ 2's complement of r0
|> > and r1, r0, r1 @ r1 = lowest non-zero bit in r0, or 0
|> > movs r2, r1, lsr #16 @ r2 = 0 if top 16 bits zero
|> > ..etc..
|> > [routine to return the number of the lowest bit set in r0]
|>
|>
|> Yep. I can do it in 6 instructions!
|> It bit 0 is set, the routine only takes 3 instructions.
|>
|> My routine is:
|>
|> ffs
|> MOVS r1, r0,ASR#1 MOV r2, #0
|> BCS out
|> loop
|> ADD r2, r2,#1
|> MOVS r1, r1,ASR#1
|> BCC loop
|> out
|>
|>
|> Simple eh! But on average it's faster than yours..
|>
|> Why?

Not necessarily, you have to take into account the places where it might be
used. If I fed the output of a random number generator into your routine
then it would win hands down. But consider the case that when ffs is
normally called is there will rarely be more than one bit set and your
analysis (deleted) becomes inaccurate. A common use of ffs is when you need to
efficiently store a set of event-triggered flags and fire off an action when
a particular event is set; in these circumstances the lowest bit that is
non-zero is as likely to be the top one as it is to be the bottom one, so my
routine then wins hands down. Eg:

switch (ffs (event_msk))
{
case EVENT1:
...
case EVENT2:
...
...
}

|>
|> Paul
|>

Clive Jones

unread,

Feb 15, 1994, 4:45:43 PM2/15/94

to

In article <Paul_Leb...@equinox.gen.nz> Paul_...@equinox.gen.nz (Moeroa) writes:
>Richard Earnshaw (r...@cs.utwente.nl) wrote:
>
> >Does anyone know of a faster implementation of ffs () for the ARM than the
> >one below (15 instructions).
>

>Yep. I can do it in 6 instructions!
>It bit 0 is set, the routine only takes 3 instructions.

I'm sure it must be possible to do this quite well with a hideous
hack. Once you've reduced the field to a byte value, take the pattern
with all bits above the one you want set, and do:

STRB r8,[PC]
...some instruction...
LDMDB r9!,{r8}

...which will subtract 4+4*(number of set bits in the bottom byte of
r8) from r9. It would help if r9 pointed at some low memory that was
certain to be readable, of course. This basically drops the bit
pattern in question into the r0-7 bitfield of an LDM, so that you end
up using it as a bit-counter. r8 is there so that you don't end up
ever doing an LDM with an empty register set (and, indeed, so that
your assembler won't complain!).

Unfortunately, I haven't got the energy to try fitting this, or a
similar technique, into a complete routine at the moment. )-8

--Clive.

p.a.harris, CAD Centre

unread,

Feb 16, 1994, 6:26:01 AM2/16/94

to

In article M...@cs.utwente.nl, r...@cs.utwente.nl (Richard Earnshaw) writes:
>

[ stuff about why his ffs() routine is efficient ... I agree in the case given]

>routine then wins hands down. Eg:
>
> switch (ffs (event_msk))
> {
> case EVENT1:
> ...
> case EVENT2:
> ...
> ...
> }
>
>|>
>|> Paul
>|>

Well, if you're going to use a switch statement to process the result, then
you might as well do:

not_ffs:
RSB r1,r0,#0
AND r0,r0,r1
MOV pc,link

Which will give you 0 or a number with ONLY the least significant set
bit of R0 left set.

Then use a switch statement as above, with EVENT1 defined as 1,
EVENT2 defined as 2,
EVENT3 defined as 4 etc...

Of course if your C compiler will turn a set of case labels into a
jump-table then the original method is faster, but if it uses a branch-and-test
chain then you are probably wasting your time trying to squeeze a few extra
cycles of speed out of your ffs() function anyway.

___________________________________________________________
Peter Harris
Strathclyde University CAD Centre
cll...@strath.ac.uk

David Seal

unread,

Feb 16, 1994, 7:13:12 AM2/16/94

to

In article <32...@armltd.uucp> I wrote:

> ; Operand in R0, register R1 is free, R2 addresses a byte table 'Table'
> ; (defined below)
>
> RSB R1,R0,#0 ;Standard trick to isolate bottom bit in R1,
> AND R1,R1,R0 ; or produce zero in R1 if R0 = 0.
>
> ORR R0,R1,R1,LSL #7 ;If R1=X with 0 or 1 bits set, R0 = X * &81
> ORR R0,R0,R0,LSL #14 ;R0 = X * &204081
> ORR R1,R1,R1,LSL #8 ;R1 = X * &101
> ORR R1,R1,R1,LSL #16 ;R1 = X * &1010101
> ORR R1,R1,R0,LSL #7 ;So R1 is now original R1 * &11214181
>
> LDRB R1,[R2,R1,LSR #24] ;Look up table entry indexed by top 8 bits
> ; of R1
>
> ; Result in R1
>
>'Table' is a 193-byte table with some entries defined as follows; all other

>entries don't matter. ...

I produced a better variant of the same theme last night:

; Operand in R0, register R1 is free, R2 addresses a byte table 'Table'

RSB R1,R0,#0 ;Standard trick to isolate bottom bit in R1,

AND R1,R1,R0 ; or produce zero in R1 if R0 = 0.

ORR R1,R1,R1,LSL #4 ;If R1=X with 0 or 1 bits set, R0 = X * &11
ORR R1,R1,R1,LSL #6 ;R0 = X * &451
RSB R1,R1,R1,LSL #16 ;R1 = X * &0450FBAF

LDRB R1,[R2,R1,LSR #26] ;Look up table entry indexed by top 6 bits
; of R1

; Result in R1

Timing: 6 S-cycles, 1 N-cycle, 1 I-cycle. The overhead of addressing the
byte table should be put outside the critical loop. Alternatively, put the
table close enough to allow it to be addressed with a single ADR
instruction, increasing the cycle count by 1 S-cycle.

Again, the key to this is the fact that the top 6 bits of X * &0450FBAF are
different for each of X = 0, 1, 2, 4, 8, ..., 2^31. So all you need to do is
fill in the corresponding entries of the 64 byte table 'Table' with the
correct result values. In fact, you can get away with a 63 byte table, since
the index &3F is never used.

Also, in article <2jrfq7$5...@taki.nsict.org> cl...@nsict.org (Clive Jones)
writes:

>I'm sure it must be possible to do this quite well with a hideous
>hack. Once you've reduced the field to a byte value, take the pattern
>with all bits above the one you want set, and do:
>
> STRB r8,[PC]
> ...some instruction...
> LDMDB r9!,{r8}

I think this is the perfect illustration of why you shouldn't use
self-modifying code... Quite apart from it being a hideous hack, it doesn't
work. Why not? Because by the time the STRB actually stores the byte in the
LDMDB instruction, that instruction has already been prefetched into the
ARM. So the ARM executes the old version of the instruction - unless you are
"lucky" enough to have an interrupt occur between the STRB and the LDMDB, in
which case that prefetched LDMDB will be discarded and the new LDMDB will be
prefetched again when the interrupt returns. So the behaviour of this code
even changes according to whether an interrupt has occurred between the STRB
and the LDMDB!

You might think you can get around this by putting another instruction
between them - e.g.:

STRB r8,[PC,#4]
...some instruction...

...some instruction...
LDMDB r9!,{r8}

But don't do this either! While it will work for current ARMs, there is no
guarantee that it will continue to do so in future. The amount of
prefetching which is done might change, the order of the various memory
accesses might become undefined (especially on asynchronous implementations
like Amulet), etc.

John Kortink

unread,

Feb 16, 1994, 7:17:02 AM2/16/94

to

> ffs (x) returns the bit possition of the lowest bit in
> x that is not zero.

It's a rather weird concept. But here's a shot for more elegance (I don't
think it can be any shorter than 15 instructions, because you need to fill the
16/8/4/2/1 bits of the count individually) :

r0 = value to check
r1 = bit position

no scratch registers needed

MOV r1,#0
TEQ r1,r0,LSL#16
ORREQ r1,r1,#16
MOVEQ r0,r0,LSR#16
TST r0,#&FF
ORREQ r1,r1,#8
MOVEQ r0,r0,LSR#8
TST r0,#&0F
ORREQ r1,r1,#4
MOVEQ r0,r0,LSR#4
TST r0,#&03
ORREQ r1,r1,#2
MOVEQ r0,r0,LSR#2
TST r0,#&01
ORREQ r1,r1,#1

John Kortink
kor...@cs.utwente.nl | jo...@dialis.hacktic.nl Bjork for goddess !

Michael Williams

unread,

Feb 16, 1994, 9:34:18 AM2/16/94

to

In article <32...@armltd.uucp> ds...@armltd.co.uk (David Seal) writes:
>In article <32...@armltd.uucp> I wrote:
>I produced a better variant of the same theme last night:
>
> ; Operand in R0, register R1 is free, R2 addresses a byte table 'Table'
>
> RSB R1,R0,#0 ;Standard trick to isolate bottom bit in R1,
> AND R1,R1,R0 ; or produce zero in R1 if R0 = 0.
>
> ORR R1,R1,R1,LSL #4 ;If R1=X with 0 or 1 bits set, R0 = X * &11
> ORR R1,R1,R1,LSL #6 ;R0 = X * &451
> RSB R1,R1,R1,LSL #16 ;R1 = X * &0450FBAF
>
> LDRB R1,[R2,R1,LSR #26] ;Look up table entry indexed by top 6 bits
> ; of R1
>
> ; Result in R1

If you intend to use ffs() in a switch then, of course, you can use the
meta-value "R1>>26" to switch on instead. This reduces the code to just
5 S-cycles plus the branch table lookup that the C compiler will generate
for you.

The code:

switch (ffs (event_msk))
{
case EVENT1:
...
case EVENT2:
...
...
}

changed to:

#define meta_ffs(X) (((X) & (0-(X)))*0x0450fbaf)

switch (meta_ffs (event_msk))
{
case meta_EVENT1:
...
case meta_EVENT2:
...
...
}

will generate better code. (Oddly, the compiler generates different code
for X*0x0450fbaf than David's.)

If each of the "..."s are tail-recursive function calls/returns etc. and there
are only a few events then there isn't much point in any of this - it would
be faster to have a tower of ifs. The same is true if you know a particular
event is more frequently called than another (optimise the common case at the
cost of the less common).

Mike.
____________________________________________________________________________
\ x / Michael Williams Advanced RISC Machines Limited
|\/|\/\ mwil...@armltd.co.uk Swaffham Bulbeck, Cambridge, UK
| |(__)"I might well think that Matti, ARM Ltd. couldn't possibly comment."

Michael Williams

unread,

Feb 16, 1994, 9:35:58 AM2/16/94

to

In article <2jsvs9$a...@loch2.cc.strath.ac.uk> cll...@cad.strath.ac.uk writes:
>Of course if your C compiler will turn a set of case labels into a
>jump-table then the original method is faster, but if it uses a branch-and-test
>chain then you are probably wasting your time trying to squeeze a few extra
>cycles of speed out of your ffs() function anyway.

...which is what the Norcroft compiler will do, if the table isn't too
sparse.

thRob

unread,

Feb 16, 1994, 7:32:39 PM2/16/94

to

> > ffs (x) returns the bit possition of the lowest bit in
> > x that is not zero.

The way I would do it is to set up a table of 256 bytes with 0, 1,
2,2, 3,3,3,3, 4,4,4,4,4,4,4,4, etc. ,7,7,7. Then point r2 at it (with
"adr r2, tab" or whatever) and do summat like this in a loop:

mov r1, #0
cmp r0, #1<<16
movhs r0, r0, lsr #16
addhs r1, r1, #16
cmp r0, #1<<8
movhs r0, r0, lsr #8
addhs r1, r1, #8
ldrb r0, [r2, r0]
add r0, r0, r1

This should take 13 cycles or so on an ARM2 (versus 15 for the
original).

Rob ,-------- Humans
/`-------- Chimpanzees
________________________________________/`--------- Slugs
\
`--------- rob...@vlsi.cs.caltech.edu

Oliver Betts

unread,

Feb 17, 1994, 9:23:37 AM2/17/94

to

In article <76141825...@dialis.hacktic.nl>,

John Kortink <jo...@dialis.hacktic.nl> wrote:
>It's a rather weird concept. But here's a shot for more elegance (I don't
>think it can be any shorter than 15 instructions, because you need to fill the
>16/8/4/2/1 bits of the count individually) :
>

>[routine deleted]

From the man page for ffs on the Sun here:

ffs() finds the first bit set in the argument passed it and
returns the index of that bit. Bits are numbered starting
at 1 from the right. A return value of zero indicates that
the value passed is zero.

So ffs(0)=0, ffs(1)=1, ..., ffs(1<<31)=32

John's routine returns 31 for ffs(0) and one too low for other values.

Paul Lebeau's routine goes into an infinite loop for ffs(0) [so much
for a lower average execution time!].

rob...@vlsi.cs.caltech.edu's routine is also incorrect eg ffs(1+(1<<16))
returns 17 (or something >= 16 anyway). I think his routine is trying
to find the highest set bit rather than the lowest, and may be fixable.

Apologies to John, Paul and Rob if I've mis-followed their code (I don't
have an ARM handy).

Olly

thRob

unread,

Feb 19, 1994, 8:31:54 PM2/19/94

to

ol...@mantis.co.uk (Oliver Betts) writes:
>[...]

>From the man page for ffs on the Sun here:
>
> ffs() finds the first bit set in the argument passed it and

>[...]

>rob...@vlsi.cs.caltech.edu's routine is also incorrect eg ffs(1+(1<<16))
>returns 17 (or something >= 16 anyway). I think his routine is trying
>to find the highest set bit rather than the lowest, and may be fixable.
>
>Apologies to John, Paul and Rob if I've mis-followed their code (I don't
>have an ARM handy).

Gulp! You're right. I also read the Sun man page (in spite of the
quote in my reply). I understood "first bit" as "highest bit" and
didn't read carefully enough.

To get the index of the lowest bit I could change the table entries
and use "sub r1,r0,#1: bic r0,r0,r1" initially but then the run-time
would be no better than that of the original code...

Adam Goodfellow

unread,

Feb 22, 1994, 5:49:58 PM2/22/94

to

ds...@armltd.co.uk (David Seal) writes:

It's a very handly hack to stop people single stepping with !DDT :-)

Elatar

+------------------------------------+
| email: ela...@comptech.demon.co.uk |
+------------------------------------+

David Seal

unread,

Feb 24, 1994, 8:19:08 AM2/24/94

to

In article <6QB08...@comptech.demon.co.uk> ela...@comptech.demon.co.uk
(Adam Goodfellow) writes:

>ds...@armltd.co.uk (David Seal) writes:
>
>> >
>> > STRB r8,[PC]
>> > ...some instruction...
>> > LDMDB r9!,{r8}
>>
>> I think this is the perfect illustration of why you shouldn't use
>> self-modifying code... Quite apart from it being a hideous hack, it doesn't
>> work. Why not? Because by the time the STRB actually stores the byte in the
>> LDMDB instruction, that instruction has already been prefetched into the

>> ARM. ...

>
>It's a very handly hack to stop people single stepping with !DDT :-)

:-)

But unfortunately it will occasionally produce the "incorrect" results when
not being single-stepped, since an interrupt might occur between the STRB
and the LDMDB... I suppose you might be able to deal with this by making it
loop back and repeat the test if it gets the incorrect results, on the basis
that it will loop forever when being run under !DDT, and only loop
occasionally if run properly.

However, be warned: while this might well work on current ARMs, there are no
guarantees about future ones. I said in my previous posting that ARM
implementations are possible which do more prefetching, which would also
invalidate STRB r8,[PC,#4]: something: something: LDMDB r9!,{r8}. I should
also say that there are possible ARM implementations which do *less*
prefetching! (It's not likely that they would do less prefetching all the
time, which would badly hit performance. But it's perfectly possible that
one might happen always to have very few instructions prefetched at the
point where it encountered your code.)

Just say no to self-modifying code! :-)

Michael Williams

unread,

Feb 25, 1994, 5:25:45 AM2/25/94

to

In article <6QMyA...@tquest.demon.co.uk> g...@tquest.demon.co.uk (General Mail) writes:
>I wos thinking ;-), maybe some anal-retentive out there might like to
>collate all these ARM - specific asm frags. and put out a cookbook.

Since ARM do in fact have such a cookbook, I think we may take that as a bit
of an insult. Mind you, it doesn't contain most of the "recipes" that have
been given here, though I'm sure David Seal could be shouldered into writing
a couple 8-).

Mike.

FYI: The "Cookbook" forms part of the ARM Software Development Toolkit.