Brent Mashburn
email: (take out the XX) is brent.XX...@intellon.com.
I assume you are looking for a computer arithmetic text, the most
up-to-date I know of is "Computer Arithmetic" from B. Parhami.
You can look for it on http://www.amazon.com
--
Lorenzo Di Gregorio
Try "The PowerPC Compiler Writers Guide", (ISBN 0-9649654-0-2)
particularily the chapters on "Clever Examples" & "Optimal Code
Sequences"
Mayan
--
------------------------------------------------------------------------------
| Mayan Moudgill | These are _MY_ opinions. Any resemblance |
| ma...@watson.ibm.com | to IBM's opinions is purely coincidental |
------------------------------------------------------------------------------
: I assume you are looking for a computer arithmetic text, the most
: up-to-date I know of is "Computer Arithmetic" from B. Parhami.
: You can look for it on http://www.amazon.com
Another possibility is to look for rather old books. I understand that
back when memory was expensive, processors slow, programming was done
in assembler or hex, and kids walked five miles to school uphill both
ways, there were many bit manipulating algorithms published. Actually,
it might be a good idea to recover this work before someone else does
and patents it all.
Hugh---
Or maybe we could share our own personal favorites. Here's mine:
swapping two values without using a temporary location. Requires the
use of exclusive-or. Using C's ^ operator:
a = a ^ b;
b = a ^ b;
a = a ^ b;
I could post the math that shows how it works, but it's more fun to
figure it out yourself...
Regards,
-=Dave
--
Just my (10-010) cents
I can barely speak for myself, so I certainly can't speak for B-Tree.
Change is inevitable. Progress is not.
Personally, I don't think that this rates as the kind of trick
the original poster was looking for. First, it doesn't decrease
the total number of instructions needed for the operation from
the canonical case. Second, it doesn't achieve any arithmetic
goal. Finally, assuming that it is MEMORY location with which
you are concerned, and assuming that your target architectures
are not hopelessly register starved, the following code satisfies
the same goals in a far less obscure manner:
register int t;
...
t=a;
a=b;
b=t;
More on topic, howevr, aren't most of these 'binary tricks'
implemented in modern compilers? At one time you couldn't
rely on your compiler to recognize multiplication or division
by two and convert them to shifts, but I am under the impression
that this is de rigueur today. Similarly, in the days when most
microprocessors didn't implement multiplication or division as
native instructions, the win from using shifts for powers of 2
was much bigger than it is today when multiplies and divides
only take 5 and 20 cycles.
- Jeff Dutky
>In article <87uhc9$5ek$1...@news.gate.net>,
>Brent Mashburn <xbrent....@intellon.com> wrote:
>>We all know that shifting a binary register left or right is a quick way to
>>multiply or divide by 2. Certainly there is a book out there that has this
>>as well as a bunch of other more advanced "binary tricks" for computation.
>>I've tried searching for such a book but haven't had any luck. Does such a
>>book exist? Thanks.
>>
>>Brent Mashburn
>>
>
>Try "The PowerPC Compiler Writers Guide", (ISBN 0-9649654-0-2)
>particularily the chapters on "Clever Examples" & "Optimal Code
>Sequences"
>
The best collection of these I know of is probably the infamous Hakmem
memo from the MIT AI labs of (roughly) 1970 or so. It is available
somewhere on the MIT web site as a GIF; I don't knwo if the textual
version is online. IT had some truely twisted stuff in there.
--
**********************************************
* Allen J. Baum tel. (650)853-6626 *
* Compaq Computer Corp. fax (650)853-6513 *
* 181 Lytton Ave. *
* Palo Alto, CA 95306 ab...@pa.dec.com *
**********************************************
IMO, such "tricks" are no longer useful with modern processors (unless
you are using a truly tiny machine, like an 8 bit embedded proc). On
Pentium processors, for example, a floating point multiply takes less
time than an integer shift. In fact, on the Pentium machines, the actual
number of cycles used by an instruction is pretty much irrelevant. Much
more time is spent reading the instruction from memory than in executing
it. I recently did performance estimates for a new switch we are
designing, and found this to be absolutely true. When I did the model
for the cache, I found that the processor is bus bound, not ALU bound.
So I suggest that you write your code to be *readable* and *modifyable*
and forget about "tricks".
Mike
--
----
char *p="char *p=%c%s%c;main(){printf(p,34,p,34);}";main(){printf(p,34,p,34);}
This message made from 100% recycled bits.
I don't speak for Alcatel <- They make me say that.
>Dave Hansen wrote:
[...swapping w/o temporary...]
>
>Personally, I don't think that this rates as the kind of trick
>the original poster was looking for. First, it doesn't decrease
I probably should have cancelled that article, but because it's so
off-topic for comp.arch. Followups adjusted.
>the total number of instructions needed for the operation from
>the canonical case. Second, it doesn't achieve any arithmetic
>goal. Finally, assuming that it is MEMORY location with which
>you are concerned, and assuming that your target architectures
>are not hopelessly register starved, the following code satisfies
>the same goals in a far less obscure manner:
[...traditional swap w/temporary...]
Well, I certain don't recommend the xor algorithm for general use.
I'd probably put it under the category of "stupid computer tricks."
But it does have its uses. Ignoring your assumption above, it lets
you swap the values in two registers without using another register
and without having to resort to even a stack temporary. I have used
it in assembly code (many years ago) for just that purpose.
ObArch: with the nifty tricks modern processors do these days,
something like
push a
mov a,b
pop b
would probably be just as fast (if not faster).
I also saw (not that long ago) a C macro for changing the endianness
of a 16-bit value that used this trick.
A related trick is the doubly-linked list that uses a single "pointer"
which is the xor of the pointers to the previous and next nodes.
>
>More on topic, howevr, aren't most of these 'binary tricks'
>implemented in modern compilers? At one time you couldn't
I suspect the ones that improve speed would be. The ones that save
space are probably less likely.
>rely on your compiler to recognize multiplication or division
>by two and convert them to shifts, but I am under the impression
>that this is de rigueur today. Similarly, in the days when most
>microprocessors didn't implement multiplication or division as
>native instructions, the win from using shifts for powers of 2
>was much bigger than it is today when multiplies and divides
>only take 5 and 20 cycles.
I believe that's true. But I still think these clever binary tricks
are useful -- Not for optimizing code, but for getting people to
think. Much the way reading a lot can improve one's spelling.
> The best collection of these I know of is probably the infamous Hakmem
> memo from the MIT AI labs of (roughly) 1970 or so. It is available
> somewhere on the MIT web site as a GIF; I don't knwo if the textual
> version is online. IT had some truely twisted stuff in there.
>
http://www.inwap.com/pdp10/hbaker/hakmem/hakmem.html
--
Pablo Bleyer Kocik |"And all this science I don't understand
pbleyer | It's just my job five days a week
@dgf.uchile.cl | A rocket man, a rocket man..." - Elton John
from sys import*;from string import*;a=argv;[s,p,q]=filter(lambda x:x[:1]!=
'-',a);d='-d'in a;e,n=atol(p,16),atol(q,16);l=(len(q)+1)/2;o,inb=l-d,l-1+d
while s:s=stdin.read(inb);s and map(stdout.write,map(lambda i,b=pow(reduce(
lambda x,y:(x<<8L)+y,map(ord,s)),e,n):chr(b>>8*i&255),range(o-1,-1,-1)))
I've put a few of my favorites on this web page:
http://www.caam.rice.edu/~dougm/twiddle
and welcome contributions of the bit twiddling type, if anyone wants to contribute to it.
Doug Moore
(do...@rice.edu)
For the case discussed above (shift by 1), you are wrong. It takes 1
cycle on the Pentium when the operand is a register. An FMUL takes 3
cycles (and one operand has to be in a register). Moreover, the speed
of the FMUL is irrelevant for this example; an IMUL takes 10 cycles,
and a 32-bit divide takes 41 (unsigned) or 46 (signed) cycles.
> In fact, on the Pentium machines, the actual
>number of cycles used by an instruction is pretty much irrelevant. Much
>more time is spent reading the instruction from memory than in executing
>it. I recently did performance estimates for a new switch we are
>designing, and found this to be absolutely true. When I did the model
>for the cache, I found that the processor is bus bound, not ALU bound.
Some programs are memory-limited (although usually from the data, not
the instructions), but most of the programs I use are not (and
this is true for most people I know).
Followups to comp.arch.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
There's much truth in what you say, but some of us don't use modern processors.
Some of use FPGAs and the like and find such "tricks" useful.
tony
This is wrong for all Intel x86 processors, starting with the 8088/8087
combination and going all the way to the PentiumIII.
> In fact, on the Pentium machines, the actual
> number of cycles used by an instruction is pretty much irrelevant. Much
> more time is spent reading the instruction from memory than in executing
This stopped being true with the 286, only the 8088 was so severly
memory-limited.
> it. I recently did performance estimates for a new switch we are
> designing, and found this to be absolutely true. When I did the model
> for the cache, I found that the processor is bus bound, not ALU bound.
Lots of code is data cache bound, some code (TPC benchmarks?) are code
cache limited, but this is much rarer. If your inner loop switching code
is larger than 8 or 16K, then you can get into this situation.
>
> So I suggest that you write your code to be *readable* and *modifyable*
> and forget about "tricks".
Readable, modifyable: Fine.
Tricks are just that: Something you use (and document properly!) when
you need the speed.
Terje
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"
Summer of '97, I was working in a CS lab doing some coding as a highschool
student participating in an NFS-sponsored program to introduce kids to
science. So anyways, the end product was a faster matrix multiply
implementation (using a hybrid algorithm that was still untried at the
time) with the intention to beat the SGI hand-tuned BLAS3 mmult.
To start it all off, the faculty member I was working with introduced
me to IEEE format floats so I could see how the computer does it and why an
algorithm that uses a bunch of adds isn't necessarily any better than one
that uses a bunch of multiplies. Now to get to the topic at hand: next he
had me implement base2 logarithm as a loop with a shift, basically
counting the number of used bits. Wow, what a neat idea (and I think it
qualifies as a binary trick)! The actual loop was unreadable, but the
function name "lg" made it okay. The logarithm code wasn't used in any
tight loops so I never bothered to time it that summer.
Later in life while optimizing some code that used sqrt() I realized
I could do a similar trick, but I was doing serious performance tuning on a
very tiny loop at that point and I discovered that counting the bits with a
for loop is MUCH slower than using a modern FPU for the task. However,
because of the thought I had to put into the meaning of lg(), I was better
able to solve the problem.
So I guess it's useful just because it was neat to think about it
even if the code was unreadable and it was even slower than just using the
FPU.
However, a year or so later I returned to the project and the main
build that the whole group was working on used the libmath logarithm
function that used the FPU...we had floor_lg() and ceil_lg(), both
implemented in terms of floor()/ceil() and log() (and a floating point
division). I was doing some debugging on unusually small cases and
discovered that the floored base 2 logarithm of 4 was 1?!? Well it turns
out that log(4.0)/log(2.0) is 1.999999 on (I think) my Intel Pentium FPU,
but it's 2.000001 on the big SGI that we were doing all the timing on.
floor(1.9999999) -> 1.0, so that's where that problem came from. So I
emailed the rest of the group and we decided to switch to the loop because
it was by nature integral (and we were doing only integer math at that
point...most of the time we were just doing power(2.0,floor_lg(x)) or
similar) and we knew exactly how it worked. Speed wasn't the most
critical concern in that portion of the code (it was used in the parallel
dispatch), so it was a welcome change that eliminated quite a few
previously undebuggable problems, at least on my machine.
So I guess the moral of the story is that even stuff that looks
useless (why would you want to impleemnt logarithm by hand unless it's
faster, hmm?) can save the day in terms of program correctness when used
correctly. And even in cases where it doesn't, KNOWING the stupid trick
makes you a lot smarter. A programmer who knows the answer to every
problem he will ever run into in the workforce is not a programmer, he is
programmed. A useful programmer knows the answer to a zillion problems
he'll never see in real life because finding those answers made him think.
In other words, if new programmers aren't exposed to stupid
useless binary tricks, they will be handicapped. It is imperative that
these things live on, or the whole field will die. It's like reading one
of Knuth's books...most people won't care about a lot of his examples, but
those who have the time and energy and ability to work out the examples
are usually a lot better for it. Nobody makes computers like MIX, but MIX
still teaches plenty.
PS, as per compilers doing it automatically...I spent some time
messing around with gcc-2.7.2.1 and discovered that writing x<<5 compiles
to faster code than x*32. I looked at the asm out and didn't know the
instruction it generated off the top of my head so I never figured out
exactly why. I would guess it has something to do with sign, though.
(somebody who actually knows about this should correct my hasty summary)
http://www.cs.uiowa.edu.edu/~jones/bcd/
The name of the page comes from its original function, a repository for
some schemes for doing BCD arithmetic on machines with no BCD hardware
support. It's grown, and now contains extensive notes on efficient
division of large numbers by 10 with a narrow ALU that can only add and
subtract, and, based on that, notes on doing decimal of binary numbers
under such constraints -- the latter are spinoffs from my use of a
Microchip PIC processor, a really marvelous little Harvard architecture
microcontroller, with all of 422 bytes of free RAM, an 8 bit ALU, and
8K by 14 bits of ROM for instructions.
Doug Jones
jo...@cs.uiowa.edu
> Or maybe we could share our own personal favorites. Here's mine:
> swapping two values without using a temporary location. Requires the
> use of exclusive-or. Using C's ^ operator:
>
> a = a ^ b;
> b = a ^ b;
> a = a ^ b;
That's a totally useless trick, but it also doesn't need XOR:
a = b - a;
b = b - a;
a = a + b;
OR
a = a - b;
b = a + b;
a = b - a;
You *do* require either 3-address instructions or else both SUB and SUBF.
-- Bruce
>In article <38a2f0ad....@192.168.2.34>, dha...@btree.com (Dave
>Hansen) wrote:
>
>> Or maybe we could share our own personal favorites. Here's mine:
>> swapping two values without using a temporary location. Requires the
>> use of exclusive-or. Using C's ^ operator:
>>
>> a = a ^ b;
>> b = a ^ b;
>> a = a ^ b;
>
>That's a totally useless trick, but it also doesn't need XOR:
>
>a = b - a;
>b = b - a;
>a = a + b;
However in C that is only guaranteed to work for unsigned types. If a
and b are signed the addition and subtraction can overflow resulting
in undefined behaviour.
--
-----------------------------------------
Lawrence Kirby | fr...@genesis.demon.co.uk
Wilts, England | 7073...@compuserve.com
-----------------------------------------
>In article <87uhc9$5ek$1...@news.gate.net>,
>Brent Mashburn <xbrent....@intellon.com> wrote:
>)We all know that shifting a binary register left or right is a quick way to
>)multiply or divide by 2. Certainly there is a book out there that has this
>)as well as a bunch of other more advanced "binary tricks" for computation.
>)I've tried searching for such a book but haven't had any luck. Does such a
>)book exist? Thanks.
>
>IMO, such "tricks" are no longer useful with modern processors (unless
>you are using a truly tiny machine, like an 8 bit embedded proc). On
>Pentium processors, for example, a floating point multiply takes less
>time than an integer shift.
As others have noted this isn't true.
>In fact, on the Pentium machines, the actual
>number of cycles used by an instruction is pretty much irrelevant. Much
>more time is spent reading the instruction from memory than in executing
>it.
Only if the code isn't cached. On the whole code caches fairly well.
>I recently did performance estimates for a new switch we are
>designing, and found this to be absolutely true. When I did the model
>for the cache, I found that the processor is bus bound, not ALU bound.
Maybe for that application it is but that is not typical, at least
not as far as the instruction cache is concerned. A switch is probably
moving a lot of data around while only looking at it once so it may well
be bus bound as far as data access is concerned, but probably not for
instruction decoding.
>So I suggest that you write your code to be *readable* and *modifyable*
>and forget about "tricks".
Yes, tricks should be a last resort, at least where they are specific
to a particular piece of hardware.
May I suggest the following code instead:
(double) (((uint64) d) & 0xfff8000000000000);
Mask away all the fractional bits and keep the exponent and (implied)
leading 1 bit.
Does this qualify as a 'binary trick'?
A XOR B = A+B-2*(A AND B)
Then to calculate X%120 (for suitably small numbers)
result=x AND 0x3f+(x AND 0x1c0)>>3
if result>120
result=result-120
The derivation and extension of this to other numbers is left as an
exercise.
John West
>A XOR B = A+B-2*(A AND B)
If XOR is available, then this can be used to average
two unsigned variables A and B when the sum might overflow:
(A+B)/2 = (A AND B) + (A XOR B)/2
If you know A >= B (say) you can use B + (A-B)/2 .
--
E = m c^2. Einstein = Man of the Century. Why the squaring?
Peter-Lawren...@cwi.nl Home: San Rafael, California
Microsoft Research and CWI
Only in binary arithmetic :-)
Thanks for that one - I thought that I had come across most such
tricks by now, but had not seen that!
The earlier remark about all such things being very dependent on
the exact constraints is very true. I have used the XOR trick
to swap locations for real, but that was on the IBM 370 (very
short of registers) in an environment where I couldn't save one
anywhere! In general, it doesn't pay.
However, avoiding spurious overflow DOES pay, and I wish more
languages and compilers would pay attention to it, because the
frequency of spurious overflows is one of the most common
excuses for providing no overflow detection :-(
Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QG, England.
Email: nm...@cam.ac.uk
Tel.: +44 1223 334761 Fax: +44 1223 334679
This trick allows you to determine if any subbyte byte of a larger
word is zero. Say like if you were processing zero terminated strings
64 bits at a time and you wanted to know if the 64 bit word you just
loaded contains the terminating null.
((a-0x0101010101010101)^a)&(~a)&(0x8080808080808080)
This expression is zero when 'a' contains no zero byte and nonzero
when it does. The idea is to subtract 1 from each of the bytes and
then look for bytes where the borrow propogated all the way to the
most significant bit. This isn't original with me, but I can't
remember where I read it.
The other thing I have is a way to eliminate some conditional jumps in
some situations:
if (a>0) a=0
can be replaced with:
a&=a>>63 (again assuming 64 bit words)
and also,
if (a<0) a=0
can be changed to
a&=(-a)>>63
this last probably isn't much of a win. But it does make your code
look a lot cooler.
Finally, I wouldn't metion this last one at all, but there might be
some kids in the audience that don't know it. And it's pretty
important:
For any n = a power of 2: a%n == (a&(n-1))
And the one on the right is probably at least ten times as fast.
I think my favorite hack is for isolating the least-significant non-zero
bit of a number x
(x-1)^x
I've used this or variants thereof several times. Also if you want the
index of the least-significant non-zero bit you can do it by some constant
multiply (or a clever sequence of shifts and adds), followed by a table lookup
on the most-significant byte.
I think JACM circa 1958 had a large section of suggestions sent in by
programmers for saving 1 instruction code on the IBM 650. Or something like
that.
: >In comp.arch Lorenzo Di Gregorio <lorenzo.d...@hl.siemens.de> wrote:
: >: Brent Mashburn wrote:
: >:> Certainly there is a book out there that has this as well as a bunch
: >:> of other more advanced "binary tricks" for computation. I've tried
: >:> searching for such a book but haven't had any luck. Does such a
: >:> book exist? Thanks.
: >
: >: I assume you are looking for a computer arithmetic text, the most
: >: up-to-date I know of is "Computer Arithmetic" from B. Parhami.
: >: You can look for it on http://www.amazon.com
: >
: > Another possibility is to look for rather old books. I understand that
: > back when memory was expensive, processors slow, programming was done
: > in assembler or hex, and kids walked five miles to school uphill both
: > ways, there were many bit manipulating algorithms published. Actually,
: > it might be a good idea to recover this work before someone else does
: > and patents it all.
: Or maybe we could share our own personal favorites. Here's mine:
: swapping two values without using a temporary location. Requires the
: use of exclusive-or. Using C's ^ operator:
: a = a ^ b;
: b = a ^ b;
: a = a ^ b;
: I could post the math that shows how it works, but it's more fun to
: figure it out yourself...
Great trick!!!!
Haven't seen this in any textbooks.
--
Mike,
mik...@whiterose.net
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
>Dave Hansen wrote:
>> Or maybe we could share our own personal favorites. Here's mine:
>> swapping two values without using a temporary location. Requires the
>> use of exclusive-or. Using C's ^ operator:
>>
>> a = a ^ b;
>> b = a ^ b;
>> a = a ^ b;
>While I wouldn't ever use this instruction sequence, you can use the same
>trick to good effect if you want to swap two long blocks of memory.
>I think the IBM 360 instruction set (or one of its later variants) had an
>"XOR block of memory A onto block of memory B", which I suspect must have
>been designed for this purpose.
>
>I think my favorite hack is for isolating the least-significant non-zero
>bit of a number x
> (x-1)^x
1010100 x
1010011 x-1
0000111 (x-1)^x
Which is interesting in its own right but doesn't do exactly what you say.
There is x&(x-1) which sets the lowest order 1 bit to zero
1010100 x
1010011 x-1
1010000 x&(x-1)
and x&-x which zeros all bits other than the lowest order bit (in C
when promoted x has an unsigned type)
1010100 x
0101100 -x
0000100 x&-x
A "pure" bitwise form of x&-x is x&(~x+1), because "negation" of an
unsigned value in C performs the bitwise equivalent of a 2's complement.
>Greg Alexander wrote:
>> point...most of the time we were just doing power(2.0,floor_lg(x)) or
>> similar) and we knew exactly how it worked. Speed wasn't the most
>
>May I suggest the following code instead:
>
>(double) (((uint64) d) & 0xfff8000000000000);
If this will convert the value of d to uint64, mask that integer value
and convert back to double which I don't think is what you wanted.
That would probably be something like
double d;
d = ???; /* Set the value */
*(uint64 *)d &= 0xfff8000000000000;
>Mask away all the fractional bits and keep the exponent and (implied)
>leading 1 bit.
>
>Does this qualify as a 'binary trick'?
Of a most unpleasant variety. :-) This sort of type punning has undefined
behaviour as far as C is concerned, and there are no guarantees in general
that floating point formats are what you expect. Also C has no uint64 type.
This is a good demonstration of just how dangerous a bit of assembly
knowledge can be to a C programmer. ;-)
...
>The other thing I have is a way to eliminate some conditional jumps in
>some situations:
>
>if (a>0) a=0
>can be replaced with:
>a&=a>>63 (again assuming 64 bit words)
This also assumes that right shifts "sign extend" which C doesn't
guarantee.
> if (a<0) a=0
> can be changed to
> a&=(-a)>>63
>
> this last probably isn't much of a win. But it does make your code
> look a lot cooler.
Change it to:
a &= ~(a>>63);
This is better for CPUs with a bit clear instruction.
I'd say modern compilers should do this automatically.
I still see people write:
if (x < 0) x = 0;
if (x > max) x = max;
Replace with:
if ((unsigned) x > max) x = max & ~(x >> 31);
This happens a lot in JPEG/MPEG colorconversion:
x >>= SHIFT; // x now in range -128..383
if (x < 0) x = 0;
if (x > 255) x = 255;
word |= x << POS;
Replace with:
t = 0xFF800000 & (x << (23-SHIFT));
if (t < 0) t = 0x7F800000 & ~x;
word |= t >> (31 - POS);
This compiles on ARM to (SHIFT=8, POS = 24):
ANDS t, mask1, x, LSL #15
BICLT t, mask2, x
ORR word, word, t, LSR # 7
Wilco
Sent via Deja.com http://www.deja.com/
Before you buy.
We weren't using floats.
>IMO, such "tricks" are no longer useful with modern processors (unless
>you are using a truly tiny machine, like an 8 bit embedded proc). On
>Pentium processors, for example, a floating point multiply takes less
>time than an integer shift.
What you are saying is that the designers of the Pentium
ignored very useful integer instructions. Is this good?
A large part of this IS due to those who deliberately have
left out instructions from the languages, causing the
designers not to realize their use.
In fact, on the Pentium machines, the actual
>number of cycles used by an instruction is pretty much irrelevant. Much
>more time is spent reading the instruction from memory than in executing
>it. I recently did performance estimates for a new switch we are
>designing, and found this to be absolutely true. When I did the model
>for the cache, I found that the processor is bus bound, not ALU bound.
But what if that useful hardware instruction now takes
a dozen or more instructions because people like you have
pushed for leaving it out? The x86 design was a poor one
from the beginning, and it was other things which did not
get it to vanish early. The integer part of that machine
needs total redesign, not greatly increasing hardware, but
enabling lots of simple instructions to be done quickly
which now are so hard to carry out that one often uses
alternate algorithms. IEEE only adds to the problem.
>So I suggest that you write your code to be *readable* and *modifyable*
>and forget about "tricks".
Further contributing to the demise of intelligent computing.
We will never get even moderately efficient computing by
designing things for people with room temperature IQs.
--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558
: Mask away all the fractional bits and keep the exponent and (implied)
: leading 1 bit.
So, what is going to happen if you have a denormal in your d?
/Serge.P
--
home page: http://www.cobalt.chem.ucalgary.ca/ps/
why not simply write these as 'a=a>0;' or 'a=a<0;' ? The result
of a compare in C is just an integer so you can easily assign
it to another variable. Heck, I think I've used variations of
this in some programs, in the form 'v=expr*cond;' in place of
'if(cond)v=expr;' because I didn't want to incur a branc delay.
(probably not a win either)
- Jeff Dutky
Do any of the multimedia extensions (SSE/AltiVec/VIS/MVI etc)
provide good support for doing this kind of operation
(detecting end-of-string quickly)?
In general, do the multimedia extensions enable faster or
better "binary tricks"? Things like population count and
permutations could have some interesting uses.
Manoj
This attitude, by the way, is why so many people snicker when they
hear a programmer claim to be a "software engineer".
Engineering is not about "tricks" to improve "efficiency" at the cost of
maintainability. It's about applying well-understood methodology to clearly
defined problems, and it results in a specific solution that will fix the
problem without creating a worse one later.
Software written using "tricks" to get "moderately efficient" code tends to be
barely maintainable and semi-portable at best. And with the current crop of
optimizing compilers, it usually isn't significantly faster. I'll accept that
there are applications where every extranious instruction needs to be weeded
out, but they are few and far between.
bull. engineering can be pure hackery just as much as programming can be.
The little bit of genius that allows you to eliminate one line of code or
one piece of molded plastic is always central to engineering. You don't
want to lose maintainability, but you also don't want to lose the little
tricks. Often it's enough to keep the tricks in functions that you could
rewrite from scratch if you had to (that's what I do, anyways, for all
code).
>Software written using "tricks" to get "moderately efficient" code tends to be
>barely maintainable and semi-portable at best. And with the current crop of
Only if the tricks are used unreasonably. Obviously if the trick is
used in every line of code you write, the compiler should do it, but at
the very least you should know that the tricks exist so when you do need
them you can use them.
>optimizing compilers, it usually isn't significantly faster. I'll accept that
I would never ever bet on this. Today's modern crop of optimizing
compilers are crap compared to where they could be. Seriously, run some
code on gcc someday, see where it gets you. Maybe egcs is better, but
gcc-2.7.2.1 is not something we should be bragging about in terms of
optimizing capacity and it is definitely one of the more widely used
compilers.
>there are applications where every extranious instruction needs to be weeded
>out, but they are few and far between.
Of course, but that doesn't mean you shouldn't ever learn or revel in the
tricks.
>>Greg Alexander wrote:
>>> point...most of the time we were just doing power(2.0,floor_lg(x)) or
>>> similar) and we knew exactly how it worked. Speed wasn't the most
>>May I suggest the following code instead:
>>(double) (((uint64) d) & 0xfff8000000000000);
>If this will convert the value of d to uint64, mask that integer value
>and convert back to double which I don't think is what you wanted.
>That would probably be something like
> double d;
> d = ???; /* Set the value */
> *(uint64 *)d &= 0xfff8000000000000;
>>Mask away all the fractional bits and keep the exponent and (implied)
>>leading 1 bit.
>>Does this qualify as a 'binary trick'?
>Of a most unpleasant variety. :-) This sort of type punning has undefined
>behaviour as far as C is concerned, and there are no guarantees in general
>that floating point formats are what you expect. Also C has no uint64 type.
>This is a good demonstration of just how dangerous a bit of assembly
>knowledge can be to a C programmer. ;-)
Programmers who think that C is at least an approximation
to the ideal language might think this, but things like
this are almost necessary. The list of types should
include what an intelligent user would want, not what a
language designer thinks the user who cannot understand
problems in coding might want. One can do it (badly)
with C structs, or C++ classes, but it is not the same.
If you think that this type of operation is that rare,
every logarithm routine starts out with it. One could
instead use portable C, but the length of time to carry
out the loop introduced may greatly exceed the computing
time for the operation.
>...
>>The other thing I have is a way to eliminate some conditional jumps in
>>some situations:
>>if (a>0) a=0
>>can be replaced with:
>>a&=a>>63 (again assuming 64 bit words)
>This also assumes that right shifts "sign extend" which C doesn't
>guarantee.
This is another example to show that we need the variety
of instructions. This can even be done without complicating
the decoding of instructions, as the unit performing the
instruction can continue the decoding.
In fact, it often does this on the basis of the arguments.
Floating-point units take actions based on signs, whether
the argument is or is not a 0, whether it is "infinite".
or whether it is NaN.
This is not comp.lang.c. In any event, if it doesn't work with signed,
then cast it to unsigned then do it.
--
Paul Hsieh
http://www.pobox.com/~qed/
See: http://www.azillionmonkeys.com/qed/case3.html
I show a few tricks at:
http://www.pobox.com/~qed/optimize.html
http://www.pobox.com/~qed/asmexample.html
>> if (a>0) a=0
>> can be replaced with:
>> a&=a>>63 (again assuming 64 bit words)
>> and also,
>> if (a<0) a=0
>> can be changed to
>> a&=(-a)>>63
>> this last probably isn't much of a win. But it does make your
>> code look a lot cooler.
>why not simply write these as 'a=a>0;' or 'a=a<0;' ? The result
>of a compare in C is just an integer so you can easily assign
>it to another variable. Heck, I think I've used variations of
>this in some programs, in the form 'v=expr*cond;' in place of
>'if(cond)v=expr;' because I didn't want to incur a branc delay.
>(probably not a win either)
But what is the collection of machine instructions used to
carry out the C instruction? The hardware does not normally
generate the Boolean variable in one of the registers, and
in fact, it is done differently on different machines. It
is not always an easy job to get the condition codes as
directly usable integers.
Untrue. On a pentium an integer shift takes has 1 clock latency and 1
clock throughput, floating point multiplies take 5 clocks latency and 2
clock throughput.
On the Athlon its 1 floating point multiply at 4 clock latency and 1
clock throughput. On the other hand, the athlon can perform 3
integer shifts with a latency and throughput of 1 clock.
> [...] In fact, on the Pentium machines, the actual
> number of cycles used by an instruction is pretty much irrelevant.
That's not true either. The schedulers are only so deep, and shorter
clock instrutions will have better expected performance in typical
algorithms. You ingore these things at your peril.
> [...] Much
> more time is spent reading the instruction from memory than in executing
> it.
Absolutely untrue. Instruction prefetch has an typical impact of less
than a clock even on a CISC architecture like the x86.
> [...] I recently did performance estimates for a new switch we are
> designing, and found this to be absolutely true. When I did the model
> for the cache, I found that the processor is bus bound, not ALU bound.
Your application must be atypical, or your analysis flawed.
> So I suggest that you write your code to be *readable* and *modifyable*
> and forget about "tricks".
I suggest you use tricks as appropriate *and* make it readable and
modifyable.
You have find some heavy applications of this trick at:
http://www.pobox.com/~qed/poker.zip
>>>So I suggest that you write your code to be *readable* and *modifyable*
>>>and forget about "tricks".
>> Further contributing to the demise of intelligent computing.
>> We will never get even moderately efficient computing by
>> designing things for people with room temperature IQs.
>This attitude, by the way, is why so many people snicker when they
>hear a programmer claim to be a "software engineer".
>Engineering is not about "tricks" to improve "efficiency" at the cost of
>maintainability. It's about applying well-understood methodology to clearly
>defined problems, and it results in a specific solution that will fix the
>problem without creating a worse one later.
>Software written using "tricks" to get "moderately efficient" code tends to be
>barely maintainable and semi-portable at best. And with the current crop of
>optimizing compilers, it usually isn't significantly faster. I'll accept that
>there are applications where every extranious instruction needs to be weeded
>out, but they are few and far between.
Optimizing compilers optimize only those things which the
compiler writer has thought about. In fact, many will
REFUSE to even use those if the programmer uses gotos
or assembler instructions, even if the code is much worse
without them. BTW, there is an article which shows that,
even in quite simple situations, "structured" programming
is much less clear and takes more instructions than the
simple goto code.
But back to the topic. The most efficient method I
know to generate random variables with density 6x(1-x)
on the unit interval, and it would take even more
complicated binary tricks to improve it, is the
following. Let M be the distance to the next 1
bit in a random bit stream, and N the additional
distance to the next 1 after this. Let K = (M+1)/2.
Replace the K+N-th bit of a uniform (0,1) random
number by the complement of the K-th bit.
This is not an invented problem. How would you
do this in hardware and software?
Nice!
This is identical to a part of the calendar code I posted a few weeks
ago, to verify compiler optimizations.
I used the exact same trick to calculate the number of centuries, if
works because 41/4096 is very slightly larger than 1/100.
Terje
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"
I'd like to respond that my IQ has been tested on more than one
occasion to be greater than 160.
)This attitude, by the way, is why so many people snicker when they
)hear a programmer claim to be a "software engineer".
)
)Engineering is not about "tricks" to improve "efficiency" at the cost of
)maintainability. It's about applying well-understood methodology to clearly
)defined problems, and it results in a specific solution that will fix the
)problem without creating a worse one later.
Mostly, people who are enamored with "tricks" don't know beans about how
to make code efficient. Having been doing real-time programming for
about 15 years now, I think I might have some idea about what it takes
to make efficient code. "Tricks" are not the way.
)Software written using "tricks" to get "moderately efficient" code tends to be
)barely maintainable and semi-portable at best. And with the current crop of
)optimizing compilers, it usually isn't significantly faster. I'll accept that
)there are applications where every extranious instruction needs to be weeded
)out, but they are few and far between.
Even in those, the part of the code which is time intensive is normally
very small.
The way to make efficient code is to pick an efficient algorithm. No
amount of assembly, or tricks, is going to make a bubble sort fast.
Once one has a good algorithm, one should implement it in the most
straightforward and understandable way. Only *after* this has been
shown not to meet time constraints, should one start "tweaking". The
first question to ask is "Did I really pick the right algorithm?" If
you really did, then *instrument* the code, and find where the time is
going. Look carefully at that little bit, and try to tighten up. Resort
to assembler or "tricks" only as a very last last last last LAST resort.
And then write copious comments about the trick explaining exactly *why*
it was used, what considerations entered into the decision, and *how*
the trick works.
Later, someone who needs to change the code will be glad.
Mike
--
----
char *p="char *p=%c%s%c;main(){printf(p,34,p,34);}";main(){printf(p,34,p,34);}
This message made from 100% recycled bits.
I don't speak for Alcatel <- They make me say that.
This only works if you know that your cpu/compiler combination is
capable of generating fast/branchless code for assigning the result of a
comparison to a variable.
This is unfortunately still not true in all cases.
> Heck, I think I've used variations of
> this in some programs, in the form 'v=expr*cond;' in place of
> 'if(cond)v=expr;' because I didn't want to incur a branc delay.
> (probably not a win either)
It would have been much better to make your condition codes be -1 or 0
instead of 1 or 0, because then you could replace that relatively
expensive integer multiplication with a simple mask:
v=expr & cond;
IMHO, this is one the failings of C, it is possible to use unsigned ints
with guaranteed behaviour, but those guarantees are much less strickt
for signed numbers.
Yeah, I know about the problems with one-complement or sign + mantissa
implementations, but today it would have been better to have put the
onus of those systems to generate the same results as a 'normal'
twos-complement machine.
Anyway, in similar circumstances I have used unsigned shifts and
subtraction instead to get the same effect:
mask = 0 - (a >> 31);
With 32-bit unsigned variables, this will always be either 0 or
0xffffffff, but it does cost one extra cycle.
: This attitude, by the way, is why so many people snicker when they
: hear a programmer claim to be a "software engineer".
: Engineering is not about "tricks" to improve "efficiency" at the cost of
: maintainability. It's about applying well-understood methodology to clearly
: defined problems, and it results in a specific solution that will fix the
: problem without creating a worse one later.
: Software written using "tricks" to get "moderately efficient" code tends to be
: barely maintainable and semi-portable at best. And with the current crop of
: optimizing compilers, it usually isn't significantly faster. I'll accept that
: there are applications where every extranious instruction needs to be weeded
: out, but they are few and far between.
Maybe the correct "C" trick is to do inline assembly code. But then
it isn't protable. :)
--
Mike,
mik...@whiterose.net
Yes, they can all do 8/16/32-bit parallel compares, with a result mask,
but not all of them can also communicate the result quickly to the
branching unit:
I.e. Intel blew the chance with MMX by not having a version of the
packed compare opcodes which could set integer flags depending upon
having zero/some/all of the compares being true.
> In general, do the multimedia extensions enable faster or
> better "binary tricks"? Things like population count and
> permutations could have some interesting uses.
They mostly allow the same tricks to be used more simply, or with wider
data paths, both of which tend to make the resulting code faster.
Yes, you're right. This is what I intended to write, do know that
simply casting to int won't work to (it will probably call a (slow)
library routine instead. :-()
>
> >Mask away all the fractional bits and keep the exponent and (implied)
> >leading 1 bit.
> >
> >Does this qualify as a 'binary trick'?
>
> Of a most unpleasant variety. :-) This sort of type punning has undefined
> behaviour as far as C is concerned, and there are no guarantees in general
> that floating point formats are what you expect. Also C has no uint64 type.
Please!
I do know a little bit about C, I use either int64 or i64 as shorthand
for whatever typedef your compiler might need to define a 64-bit wide
integer.
'long long' is far from standard. :-(
> This is a good demonstration of just how dangerous a bit of assembly
> knowledge can be to a C programmer. ;-)
The good part is that if you have enough assembly knowledge, you'll
notice the error immediately when stepping through the machine code in a
debugger, or when reading the asm listing.
>
> --
> -----------------------------------------
> Lawrence Kirby | fr...@genesis.demon.co.uk
> Wilts, England | 7073...@compuserve.com
> -----------------------------------------
--
Ouch, ouch, ouch!
I really hope nobody would ever use this idea today, instead of using
just two temporary registers to load pairs of values from the two arrays
before storing them back to the other!
>
> I think my favorite hack is for isolating the least-significant non-zero
> bit of a number x
> (x-1)^x
> I've used this or variants thereof several times. Also if you want the
> index of the least-significant non-zero bit you can do it by some constant
> multiply (or a clever sequence of shifts and adds), followed by a table lookup
> on the most-significant byte.
I believe this one is in hakmem as well, a power-of-two sized lookup
table must be larger than the number of bits in the sourcce register, so
either 64 or 128 entries would be OK.
Assuming you don't have a fast find_first_set_bit or similar opcode, the
best way today is almost certainly to use the IEEE fp part of the chip.
Load & convert the number to fp, store to memory, and load it back in
and extract the exponent part.
This trick is similar to the fp->int conversion by adding a magic value,
storing the (fp) reslt, and then retrieving the aligned mantissa.
Terje
OK, I assume you couldn't use the (non-existing?) fp hardware to help
out either?
With no useful find_first_set_bit opcode, you'd be speed-limited by the
explicit code to locate that leading bit.
How did you do it? There are a few interesting alternatives!
You'll end up with zero, which is the correct answer in this particular
case (modulo my casting error), right?
You'll get this without suffering any horrible delays casued by software
denormal handling. :-)
This is sadly very true. If you want to write fast & maintable code,
expect to spend quite a bit of time larning what kind of constructs your
different target compilers know how to handle well, and write (still
portable) code that uses those constructs.
> REFUSE to even use those if the programmer uses gotos
> or assembler instructions, even if the code is much worse
> without them. BTW, there is an article which shows that,
> even in quite simple situations, "structured" programming
> is much less clear and takes more instructions than the
> simple goto code.
>
> But back to the topic. The most efficient method I
> know to generate random variables with density 6x(1-x)
> on the unit interval, and it would take even more
> complicated binary tricks to improve it, is the
> following. Let M be the distance to the next 1
> bit in a random bit stream, and N the additional
> distance to the next 1 after this. Let K = (M+1)/2.
> Replace the K+N-th bit of a uniform (0,1) random
> number by the complement of the K-th bit.
>
> This is not an invented problem. How would you
> do this in hardware and software?
Herman, I solved this particular problem for you last year, writing C
code that I could prove would on average run faster than any possible
helper opcode (to locate the next 1 bit) could make your
straight-forward aproach.
This was the result of considering the full problem, not just the subset
problem of finding the M and N distances.
Yes, there is a definite need for some bit-twiddling opcodes, but your
favourite problem is not a good example.
No, problem, just write
v = expr & -cond
BTW, Forth has all-bits-set (-1 on 2s-complement machines) as
canonical truth value, but that doesn't help much, because most
architectures provide just instructions that produce 0 or 1, so the
compiler still has to insert the negation (if the flag is used as a
number or mask, not just for branching).
- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
...
>It would have been much better to make your condition codes be -1 or 0
>instead of 1 or 0, because then you could replace that relatively
>expensive integer multiplication with a simple mask:
>
> v=expr & cond;
Assuming a 2's complement architecture of course.
>ezr...@hotmail.com wrote:
>> The other thing I have is a way to eliminate some conditional
>> jumps in some situations:
>>
>> if (a>0) a=0
>> can be replaced with:
>> a&=a>>63 (again assuming 64 bit words)
>>
>> and also,
>> if (a<0) a=0
>> can be changed to
>> a&=(-a)>>63
>>
>> this last probably isn't much of a win. But it does make your
>> code look a lot cooler.
>>
>
>why not simply write these as 'a=a>0;' or 'a=a<0;' ?
Firstly those don't do the same things. if (a>0) a=0; leaves negative
values in a unchanged whereas a=a<0 sets them to 1. A table demonstrates
this better:
a -4 -3 -2 -1 0 1 2 3 4
if (a>0) a=0; -4 -3 -2 -1 0 0 0 0 0
a=a<0; 1 1 1 1 0 0 0 0 0
if (a<0) a=0; 0 0 0 0 0 1 2 3 4
a=a>0; 0 0 0 0 0 1 1 1 1
>The result
>of a compare in C is just an integer so you can easily assign
>it to another variable.
The issue here was tricks that produce efficient code. Using the
result of a comparison as a value often produces rather rather inefficient
object code.
Heck, I think I've used variations of
>this in some programs, in the form 'v=expr*cond;' in place of
>'if(cond)v=expr;' because I didn't want to incur a branc delay.
You may well have introduced a branch delay in the evaluation of cond,
although good compilers will try to avoid it if they can and doing so
is beneficial. You may also find that the amortised cost of the branch
is less than the cost of the multiplication.
Actually, it _is_ officially standard now, as of late 1999.
- John Hauser
That's the great thing about standards. There are so many to choose
from.
No ... what makes code unmaintainable is if they are implemented
obtusely, or are insufficiently commented. By my experience, no
implementation trick has ever prevented me from understanding a piece of
code so long as it was commented.
> [...] And with the current crop of optimizing compilers, it usually
> isn't significantly faster.
I don't know how many times I've heard this crap. It wasn't true the
first time I heard it and its no more true today. Only unskilled
assembly programmers are unable to beat their compiler.
> [...] I'll accept that there are applications where every extranious
> instruction needs to be weeded out, but they are few and far between.
I claim there is no single multimedia application (video games, mpeg or
MP3 players, audio drivers etc., etc.) anywhere that is does not
significantly benefit from hand coded assembly core routines, or
otherwise extraordinarily "tricky" code (I.e., 2x or greater).
Compilers are great for code that is disk/device limited, or written by
people who otherwise have not correctly isolated the essential
performance bottlenecks, or more importantly, when performance does not
matter (which is the most common case.)
It's a homework exercise in Patterson and Hennessy,
_Computer Organization and Design_, 2nd ed.
...Greg Byrd
myself:
> Actually, it _is_ officially standard now, as of late 1999.
Paul Hsieh:
> That's the great thing about standards. There are so many to choose
> from.
Well in this case, the choice is between old and new ISO standards. It
might be more appropriate to note that official standards can change
over time.
- John Hauser
If you want to be pedantic: zero for false and (~0) (all bits set) for
true.
According to previous posts here, ANSI C defines that (unsigned) (-1) is
indeed all bits set, even on a non-2's complement machine.
MMX is kind of useless for this, and 3DNow! does not address this. SSE
and AMD's MMX-extensions provide an instruction called MOVMSKB which is
perfect for this.
A few applications of this trick can be found at
http://www.pobox.com/~qed/asmexample.html
> The problem of finding out if any individual byte in a large value is zero
> is basically one of doing some simple arithmetic and eliminating the
> carrys between bytes. Which is *exactly* what the SIMD extensions do.
Well, except that the first x86 one (MMX) was pretty useless for non-
predicated comparisons.
> BTW, there is an article which shows that,
> even in quite simple situations, "structured" programming
> is much less clear and takes more instructions than the
> simple goto code.
I'd love to see that. Clarity is in the mind of the beholder, but
compiled code is unassailable. Lambda expressions are about as
"structured" as you can get -- given lamda, if/then and macros you can
synthesise all other structured constructs -- and yet Steele showed as
long ago as 1976 how to compile them into code that any spaghetti
programmer would be proud of.
-- Bruce
> In comp.arch ezr...@hotmail.com wrote:
> > This trick allows you to determine if any subbyte byte of a larger
> > word is zero. Say like if you were processing zero terminated strings
> > 64 bits at a time and you wanted to know if the 64 bit word you just
> > loaded contains the terminating null.
> > ((a-0x0101010101010101)^a)&(~a)&(0x8080808080808080)
>
> Do any of the multimedia extensions (SSE/AltiVec/VIS/MVI etc)
> provide good support for doing this kind of operation
> (detecting end-of-string quickly)?
All of them, I expect.
The problem of finding out if any individual byte in a large value is zero
is basically one of doing some simple arithmetic and eliminating the
carrys between bytes. Which is *exactly* what the SIMD extensions do.
-- Bruce
Why do all the macho "software engineering" types always assume
that this isn't precisely the case? Of course all of that has
been done.
The simple fact is that the interesting problems are the ones
that push the technology. If you want to get your product "out
there" on "this chip", then even after you've done all of the
algorithm analysis and profiling you can, there will be a reason
to push harder somewhere.
--
Andrew
Nope. There is no way it will work at all. The results of the two
can be quite different.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
The problem is not only with machines that use ones-complement or
sign + mantissa, but also with machines that have only a logical
shift, not an arithmetic shift.
>Lawrence Kirby wrote:
>>
>> In article <38A474...@hda.hydro.com>
>> Terje.M...@hda.hydro.com "Terje Mathisen" writes:
>>
>> ...
>>
>> >It would have been much better to make your condition codes be -1 or 0
>> >instead of 1 or 0, because then you could replace that relatively
>> >expensive integer multiplication with a simple mask:
>> >
>> > v=expr & cond;
>>
>> Assuming a 2's complement architecture of course.
>
>If you want to be pedantic: zero for false and (~0) (all bits set) for
>true.
If I wanted to be pedantic I would note that in C99 ~0 could produce
a trap representation in C99. Even in C90 on 1's complement implementations
it would represent 0, which the implementation could convert straight
back to an all-bits-zero repreentation. If you want to do bitwise
operations safely then unsigned integers are it.
>According to previous posts here, ANSI C defines that (unsigned) (-1) is
>indeed all bits set, even on a non-2's complement machine.
Yes. 2's complement etc. are representations for signed integers so
are irrelevant to how unsigned integers are represented. The conversion
to unsigned is defined in terms of value, not underlying bit pattern.
As I noted, these architectures are quickly becoming (near-)extinct, so
_today_ it would IMHO have been better to require it to work as if
implemented with arithmetic shifts on a 2's complement cpu.
As noted elsewhere in this thread, this can be done almost as fast with
unsigned (logical) shifts when the shift count is (register_length - 1),
for other numbers it becomes a bit more complicated, but still possible,
which is the basis for my assertion that a new language defined today
should do so.
unsigned asr(unsigned n, unsigned shift_count)
{
unsigned mask;
mask = 0 - (n >> 31); // 0 or all bits set
if (shift_count >= 31) return mask;
// The previous line could be replaced with a simple masking of the
shift count!
mask <<= 32 - shift_count; // mask contains all the shifted sign bits
return (n >> shift_count) | mask;
>Assuming you don't have a fast find_first_set_bit or similar opcode, the
>best way today is almost certainly to use the IEEE fp part of the chip.
>Load & convert the number to fp, store to memory, and load it back in
>and extract the exponent part.
>This trick is similar to the fp->int conversion by adding a magic value,
>storing the (fp) reslt, and then retrieving the aligned mantissa.
Because this operation is so slow, one should consider revising
the algorithm to avoid this. With the lack of communication
between integer and floating registers, if a loop uses the
loop index as a numerical argument for floating, a separate
copy of the value should be kept in the floating registers.
--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558
The article I have seen is
Frank Rubin: "'GOTO Considered Harmful' Considered Harmful"
Communications of the ACM, vol. 30, no. 3 (March 1987), pp. 195-6.
It is short and simple.
Consider the following type of code; this is not artificial.
There is a block of code which exits with output to produce
a bit mask, the idea is that one wants to generate the minimum
of k uniform random numbers; the bit mask approach uses few
random bits to produce a mask to be used to force 0's. This
mask is produced by using random input to find the first
place where the k numbers are not all equal, put that bit
in the mask, reduce the number to those 0 at the point, and
proceed until 1 is reached, upon which exit occurs. There
is information from the first block which can be used to
speed up the process.
Now the code I use has the structure
Case_high:
Case_5:
Case_4:
Case_3:
Case_2:
Exit:
This is the control structure only; there is more code.
Now unless k>5, rather rare, the first block transfers
to the appropriate numbered case (Exit is k=1), and
whenever the decision process is performed within one
case, a goto is used to continue.
The particular order of the cases is also carefully
chosen. It is not unusual to have a fall-through
from a case to the next one.
With a structured version, the case index would always
have to be used as a variable, instead of merely a
location in the code, and the switch statement would
have to carry out a conditional transfer after the
conclusion of a case. This adds at least two transfers
at each change of case, as well as using a variable
which the goto version never has to look at. No matter
what machine is being used, this cannot match the code
using gotos.
So you're saying implementing a single function/algorithm could be better
in an unstructured fashion. But I assume even goto fiends would then put
that inside of a function because you watn to write hte whole program
functionally no matter who you are. I think GOTO considered harmful is
all about writing programs, not implementing every single independent
algorithm.
At any rate, I daresay that the example is artificial because it
is very difficult to follow. If it were such a common case that GOTO is
better than structure, Frank Rubin would have come up with a more trivial
example. I have seen a few other cases demonstrated where goto is
simpler, and in those cases even extreme function programmers use goto.
In other words, 'GOTO considered harmful' is harmless.
But even worse is the fact, which cannot be ignored, that
what is a good algorithm on one machine can be a poor one
on another. In fact, what is a good algorithm on one
machine using operations not covered by a HLL may be
a poor one if only HLL-understood operations are used.
It may even be that the compiler "optimizations" make
assumptions which cause even worse code to be generated
than without those attempts. The algorithm designer
knows what is going on, but if the language-compiler
combination cannot do it well, what is the result?
>>>> BTW, there is an article which shows that,
>>>> even in quite simple situations, "structured" programming
>>>> is much less clear and takes more instructions than the
>>>> simple goto code.
>>>I'd love to see that. Clarity is in the mind of the beholder, but
>>>compiled code is unassailable. Lambda expressions are about as
>>>"structured" as you can get -- given lamda, if/then and macros you can
>>>synthesise all other structured constructs -- and yet Steele showed as
>>>long ago as 1976 how to compile them into code that any spaghetti
>>>programmer would be proud of.
>>The article I have seen is
>>Frank Rubin: "'GOTO Considered Harmful' Considered Harmful"
>>Communications of the ACM, vol. 30, no. 3 (March 1987), pp. 195-6.
>>It is short and simple.
[MY example deleted.]
>So you're saying implementing a single function/algorithm could be better
>in an unstructured fashion. But I assume even goto fiends would then put
>that inside of a function because you watn to write hte whole program
>functionally no matter who you are. I think GOTO considered harmful is
>all about writing programs, not implementing every single independent
>algorithm.
> At any rate, I daresay that the example is artificial because it
>is very difficult to follow. If it were such a common case that GOTO is
>better than structure, Frank Rubin would have come up with a more trivial
>example. I have seen a few other cases demonstrated where goto is
>simpler, and in those cases even extreme function programmers use goto.
> In other words, 'GOTO considered harmful' is harmless.
This was my example, not Frank Rubin's. His examples
consisted of short simple Fortran segments, where there
would not be that great a performance hit using structured
programming, but where the structured versions were longer
and considerably harder to understand. I did not keep a
copy of his rather short note.
There are many other cases where one part of a program knows
where to go, and SHOULD just go there, without invoking
several unnecessary instructions and conditional transfers.
This may involve going into the "middle" of a block of code,
as in my example.
The Alpha's BCMPGE instruction can be used to do this, as well as some
other nifty string manipulation tricks, making strcmp(), strtolower(),
etc really damn fast -- not that it matters these days; memory access is
the bottleneck now for most string-manipulation-intensive applications,
not in-core manipulation of data.
-- TTK
>
>
> >optimizing compilers, it usually isn't significantly faster. I'll accept that
>
> I would never ever bet on this. Today's modern crop of optimizing
> compilers are crap compared to where they could be. Seriously, run some
> code on gcc someday, see where it gets you. Maybe egcs is better, but
> gcc-2.7.2.1 is not something we should be bragging about in terms of
> optimizing capacity and it is definitely one of the more widely used
> compilers.
Well most should be using gcc 2.95 at least now. So if you're using 2.7.2 you
probably should upgrade (unless you're using some platform that isn't supported by
this version?)... 2.95 is much better.
IBM, DEC/Compaq, and SGI have vendor compilers which are probably closer to what
people consider "optimizing" compilers.
Interestingly at a talk at PACT last year some guy from IBM was commenting that
upon analysis most users of their compilers don't even the -O option. What good is
using an optimizing compiler when optimizations aren't turned on?
KSG
(Really should have stuck to Guinness last night...)
What he's saying isn't even close to true so perhaps it isn't a good place
from which to commence your argument.
[...]
: Further contributing to the demise of intelligent computing.
: We will never get even moderately efficient computing by
: designing things for people with room temperature IQs.
I am reminded why your arguments were never very well received in
comp.arch. You start with untruth and misunderstanding and proceed to tell
two or three entire disciplines that they are working against
"intelligence."
On a more constructive note, there has been a ressurgence in research on
domain specific languages and languages which are extensible to support
domain specific techniques. (Where in this context, domain specific means
that there is a lot of detailed knowledge about how to do things that
derives from the problem being solved rather than general programming
practice.) I like Todd Veldhuizen's expression template work a
lot. (http://extreme.indiana.edu/~tveldhui/index.html) Open C++ is an
extensible C++ compiler. There is equivalent work for Java. Etc.
Hey, Intel has even done a higher level assembler for IA64. Things are
looking up!
-Z-
I was just pointing out that if something as popular and, yes, modern as
gcc-2.7.2.1 doesn't optimize these common things, it's absurd to assume
that "oh, every compiler worth its salt does that." Maybe every compiler
worth its salt does, but you shouldn't assume too much of the compiler.
Of course I should upgrade, but 2.7.2.1 works and is installed. :)
>IBM, DEC/Compaq, and SGI have vendor compilers which are probably closer to what
>people consider "optimizing" compilers.
I've seen good evidence that SGI compilers are great and I haven't
personally used anything from IBM or DEC, but I can inform you for certain
that Sun's compiler sucks worse than gcc in this department (at least on
some code) and it is also extraordinarily popular. Maybe optimization-
concerned types shouldn't worry about what's popular but rather with what's
good, but often these optimization-concerned types are giving tips about
general programmer style. While I would like to assume that the compiler
will automatically compile *2 into a shift, I'm never going to bet on it on
a PC-targetted compiler, at least.
In other words, this isn't a call for action or even a bash, just
a gentle reminder that everyone who ever said "Oh any compiler does that,
you don't need to do that by hand" is likely wrong even though, for
readability, it's still usually best to write code assuming that the
compiler will optimize it (in other words, I usually write *2 even on
systems where I know the compiler won't optimize it to a shift). I sure
wish we could assume that all compilers on the market were up to date in
terms of the research that's been put into various optimizations, but many
(most?) are lacking.
(This one is not on my home page yet -- just saving myself some
embarassment.)
// This is part of a chess program that I wrote for my own entertainment.
// There are some nice little goto's in it. The basic idea is to
// canonicallize the chess move order so that from any one given move
// this routine will produce the next move in the list, without actually
// generating a list. This is important for performance considerations
// since the Alpha-Beta algorithm has a common "early out" conditions.
// There are three goto's in this one. Tell me how to get rid of them
// while retaining the performance and readability of the code (some of
// you might now consider this readable -- its a work in progress, leave
// me alone.)
/************************ Generating moves ****************************/
int NextMoveGen( int * sqFrom, int * sqTo, int * pprom ) {
int side, frm, to, p, i, j, k, l, n, d, prom, pawn;
side = Board.ToMove;
frm = *sqFrom;
to = *sqTo;
p = Board.playfield[ frm ];
/* Precalculate promotion conditions */
prom=sqA2&(~7);
pawn=BP;
if( side==0 ) {
prom = sqA7&(~7);
pawn = WP;
}
/* If the from square is unoccupied, search for a square that
actually contains a player's piece */
while( p == NP || ((p&side) != side) ) {
NextFromSquare:;
frm++;
if( frm >= BoardSize ) return 0;
*sqFrom = frm;
*pprom = ((WQ & ~1) | side) + 2;
to = -1;
p = Board.playfield[ frm ];
}
if( p==pawn ) {
d = frm & (~7);
if( d == prom ) { // promotion.
l = 8;
if( side==1 ) l = BoardSize;
if( to<l-8 ) to = l-8;
while( to<l ) {
if( PlausibleMove[p][frm][to] && TestLegal(frm,to) ) {
*sqTo = to;
break;
}
NextPromoteTargetSquare:;
to++;
*pprom = ((WQ & ~1) | side) + 2;
}
if( to>=l ) goto NextFromSquare;
*pprom -= 2;
if( (*pprom)<(WN & BN) ) goto NextPromoteTargetSquare;
*sqTo = to;
return 1;
}
}
if( to==-1 ) to=frm;
d = PlausibleMove[p][frm][to];
d = ((d+32) & ~SpecialMove) - 32;
/* Inverse function converts move dir back into dir index */
l = ((byte *)(&DirBaseGrid[2][2]))[-d];
while(1) {
/* This loop right here has been flagged by VTUNE */
while( d==0 ) {
do {
l++;
if( l>16 ) goto NextFromSquare;
to = frm;
/* If (p,d) is illegal combination then cycle through the
next Dirs combination */
} while( PieceDirCheck[p][l]==0 );
d = Dirs[l];
}
/* Check for next move in current direction. */
k = (to+d) & (BoardSize-1);
/* This comparison has been flagged by VTUNE */
if( ((PlausibleMove[p][to][k] - d) & ~SpecialMove)==0 ) {
/* If piece is not King this should always be legal, and
therefore calling TestLegal should be redundant. */
if( TestLegal(frm,k) ) {
*sqTo = k;
return 1;
}
}
d = 0;
---------------------
/* If the from square is unoccupied, search for a square that
actually contains a player's piece */
newflag = 0;
while(1) {
while( p == NP || ((p&side) != side || newflag) ) {
NextFromSquare:;
newflag = 0;
frm++;
if( frm >= BoardSize ) return 0;
*sqFrom = frm;
*pprom = ((WQ & ~1) | side) + 2;
to = -1;
p = Board.playfield[ frm ];
}
if( p==pawn ) {
d = frm & (~7);
if( d == prom ) { // promotion.
l = 8;
if( side==1 ) l = BoardSize;
if( to<l-8 ) to = l-8;
while( to<l ) {
if( PlausibleMove[p][frm][to] && TestLegal(frm,to) ) {
*sqTo = to;
break;
}
NextPromoteTargetSquare:;
to++;
*pprom = ((WQ & ~1) | side) + 2;
}
/** The goto here is removed **/
if( to>=l )
{
newflag = 1;
continue;
}
*pprom -= 2;
if( (*pprom)<(WN & BN) ) goto NextPromoteTargetSquare;
*sqTo = to;
return 1;
}
}
}
--
Peace,
KSG
Droppin' Science in REAL AUDIO on KSDT Sundays 8pm-10pm (PST)
ARCHIVED SHOWS: http://scw.ucsd.edu/droppinscience
Personal: http://www.cs.ucsd.edu/~kgatlin
The testing I did last month (using the calendar code) indicates that
all the architectures we tried (x86, MIPS, Alpha, Sun, HP, i860, ARM,
IBM/360) will optimize unsigned mul & div of powers of two into shifts.
Except for the Sun Sparc compilers, all the systems had at least one
(i.e. modern) compiler which would also replace constant division (by a
non-power-of-two) with the corresponding reciprocal multiplication &
shift.
OTOH, every single compiler did require something like a -O flag to
actually do this. :-)
I've finally taken a look at those pages, after figuring out that the
URL was bad: .edu.edu was not a good domain. :-(
http://www.cs.uiowa.edu/~jones/bcd/
works much better.
>
> The name of the page comes from its original function, a repository for
> some schemes for doing BCD arithmetic on machines with no BCD hardware
> support. It's grown, and now contains extensive notes on efficient
> division of large numbers by 10 with a narrow ALU that can only add and
I like the nibble-based conversion between base16 and base10 numbers!
> subtract, and, based on that, notes on doing decimal of binary numbers
> under such constraints -- the latter are spinoffs from my use of a
> Microchip PIC processor, a really marvelous little Harvard architecture
> microcontroller, with all of 422 bytes of free RAM, an 8 bit ALU, and
> 8K by 14 bits of ROM for instructions.
In similar circumstances, I'm really partial to using the 1/100
approximation 41 / 4096.
With inputs of more than 3 digits, this can result in a wrong answer,
but since it will be too large, the subsequent back-multiplication and
subtraction will detect it easily, and allow a fixup based on the carry
flag.
The intermediate radix-100 numbers can then be split into 2 digits using
103 / 1024 as an 'exact approximation' to 1/10 over the 0..99 range.
This approximation is accurate enough that there is no need for
back-multiplication and subtraction, the remainder can instead be
calculated like this:
// Get leading digit:
n *= 103; // n = (((((n << 1) + n) << 2) + n) << 3) - n;
d1 = n >> 10;
// Mask away the first digit
n &= 1023;
// Multiply the remainder by 10/2, and shift down:
n += n << 2; // n = n * 5;
d2 = n >> 9;
Similar ideas to this allows me to convert any 32-bit unsigned number to
ascii in about 50 cycles on a P6, a cpu on which a single DIV
instruction would take 40 cycles.
This is possible by using just 3 or 4 MUL operations, the rest is very
affordable shift/mask/lea opcodes.
This is a really interesting approach!
It makes it possible to do sw MPEG2 averaging (really motion
compensation) exactly, instead of effectively giving away half a bit
when working on 8 bytes simultaneously.
The only complication is that AFAIR MPEG2 specifies that this averaging
should round up instead of truncating, so I would have to modify the
algorithm a bit:
a = A & B; // Identical bits end up unmodified
x = A ^ B; // Different bits gets shifted down
i = x & 1; // Was the bottom bit unequal?
mpeg2average = a + (x >> 1) + i;
I don't think it is possible to simplify this any further, unless you
can work with negated numbers all the way:
A' = 255 - A; B' = 255 - B;
Ave' = (A' & B') + (A' ^ B');
Ave = 255 - Ave';
Really?
Ok, the jump at the end is a bit wasteful and the register allocation could
be better, but the multiplication looks fine. No -O.
fred:~% gcc -v
Reading specs from /usr/lib/gcc-lib/i386-unknown-linux/egcs-2.91.66/specs
gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)
fred:~% cat t5.c
int f(int i)
{
return i * 9;
}
fred:~% gcc -S t5.c
fred:~% cat t5.s
.file "t5.c"
.version "01.01"
gcc2_compiled.:
.text
.align 4
.globl f
.type f,@function
f:
pushl %ebp
movl %esp,%ebp
movl 8(%ebp),%eax
movl %eax,%edx
sall $3,%edx
leal (%eax,%edx),%ecx
movl %ecx,%eax
jmp .L1
.p2align 4,,7
.L1:
leave
ret
.Lfe1:
.size f,.Lfe1-f
.ident "GCC: (GNU) egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)"
fred:~%
--
This is like TV. I don't like TV.
>Jeffrey S. Dutky wrote:
>> Heck, I think I've used variations of
>> this in some programs, in the form 'v=expr*cond;' in place of
>> 'if(cond)v=expr;' because I didn't want to incur a branc delay.
>> (probably not a win either)
>It would have been much better to make your condition codes be -1 or 0
>instead of 1 or 0, because then you could replace that relatively
>expensive integer multiplication with a simple mask:
> v=expr & cond;
Which I did all the time in BBC BASIC. I have always felt that -1/0
was superior to 1/0 for booleans.
Torben Mogensen (tor...@diku.dk)
Well, with this version you have made it re-evaluate the
while( p == NP || ((p&side) != side || newflag) ) {
condition. I don't remember if this is correct, but there is a very
large risk here that it isn't. Furthermore, no matter how smart the
compiler is, it would not be able to transform your code back to my code
because this comparison cannot be determined (p is indeed modified in the
code.)
> [...] It's probably as readable, but not as efficient. Of course a famous
> president once said of the type of change I've made, "This is the sort of
> 'C' up with which I will not put."
Exactly. You can get try to rid of the gotos but what is the point? It
doesn't help any.
For a language that is inherently 2's complement based that is fine.
For a language that isn't that means you get different bit patterns
depending on what platform the code happens to be eing run on. To
evaluate a feature or design decision you have to do so in context,
not in isolation. For example many people criticise C's string format
for being inefficient (for strlen) and not being able to represent
null characters as part of the string data. However they work well
generally and fit in extremely well with C's arrays and general
type system.
> >It would have been much better to make your condition codes be -1 or 0
> >instead of 1 or 0, because then you could replace that relatively
> >expensive integer multiplication with a simple mask:
> > v=expr & cond;
> Which I did all the time in BBC BASIC. I have always felt that -1/0
> was superior to 1/0 for booleans.
v = expr & ~(cond-1);
Given that evaluating (cond) already involves opcodes, on most machines, that
set flags and require either a branch or something like "~(((reg>>n)&1)-1)"
to convert from the flag into -1/0... I'm not sure that's going to be a win
in a compiled language.
--
In hoc signo hack, Peter da Silva <pe...@baileynm.com>
`-_-' Ar rug tú barróg ar do mhactíre inniu?
'U`
"I *am* $PHB" -- Skud.
Sorry, I guess I wasn't clear enough:
The relevant optimization was to replace a division with the
corresponding reciprocal multiplication, i.e. if you had code like this:
unsigned div_11(unsigned i)
{
return i / 11;
}
it should generate code effectively similar to this:
div_11:
mov eax,[esp+4]
mov edx,3123612579 ; ceil(2^35 / 11)
mul edx
mov eax,edx
shr eax,3
ret
gcc 2.7.2.3, as installed on a brand-new FreeBSD-stable system, is not
capable of figuring out this.
It will generate a DIV instruction instead, even though this results in
code that takes about an order of magnitude longer.
egcs 1.1 generates the mul/shr combination, regardless of -O. Without
-O it does some scary stack/register shuffling though. gcc 2.7.2.3 is
over 4 years old BTW. The reason why I found your original statement
hard to believe is that such optimizations (div -> shift, simple constant
elimination etc.) usually[1] run on the expression tree directly after the
parsing. Because it is not useful to compile C without simple constant
elimination on the statement level this pass is usually always run.
-Andi (who is looking forward to gcc 2.96, because that will be the
first gcc to do scheduling for x86[2])
[1] At least on all C compiler I've source for so far.
[2] The gcc snapshots generate ~30% faster code for some of my programs
with heavy integer inner loops, and it constantly beats the Fujitsu
C compiler.
Paul Hsieh wrote:
> KSG wrote:
> > I think getting rid of one of them can be done the way I've done here. I'm
> > not 100% sure it works though; actually there's a good chance that it
> > doesn't (as with all my code).
>
> Well, with this version you have made it re-evaluate the
>
> while( p == NP || ((p&side) != side || newflag) ) {
>
> condition. I don't remember if this is correct, but there is a very
> large risk here that it isn't. Furthermore, no matter how smart the
> compiler is, it would not be able to transform your code back to my code
> because this comparison cannot be determined (p is indeed modified in the
> code.)
I threw the 'newflag' in there. The rest of the expression looked side-effect
free so I thought "no foul, no harm".
KSG
> > [...] And with the current crop of optimizing compilers, it usually
> > isn't significantly faster.
>
> I don't know how many times I've heard this crap. It wasn't true the
> first time I heard it and its no more true today. Only unskilled
> assembly programmers are unable to beat their compiler.
When G*d's Optimizing Compiler is finally released there therefore will
be no more skilled assembly programmers, since none will be able to beat
it.
In the absence of GOC your self-fertilizing[1] assertion is irrefutable,
in that it's a sideways definition of "skilled assembly programmer". Who
are scarcer than hen's teeth. I know - I am one.
[1] For suitable values of "fertilizer"
Ed "will spot tautologies for food" Kaulakis
>Greg Alexander wrote:
>> In other words, this isn't a call for action or even a bash, just
>> a gentle reminder that everyone who ever said "Oh any compiler does that,
>> you don't need to do that by hand" is likely wrong even though, for
>> readability, it's still usually best to write code assuming that the
>> compiler will optimize it (in other words, I usually write *2 even on
>> systems where I know the compiler won't optimize it to a shift). I sure
>> wish we could assume that all compilers on the market were up to date in
>> terms of the research that's been put into various optimizations, but many
>> (most?) are lacking.
>
>The testing I did last month (using the calendar code) indicates that
>all the architectures we tried (x86, MIPS, Alpha, Sun, HP, i860, ARM,
>IBM/360) will optimize unsigned mul & div of powers of two into shifts.
>
>Except for the Sun Sparc compilers, all the systems had at least one
>(i.e. modern) compiler which would also replace constant division (by a
>non-power-of-two) with the corresponding reciprocal multiplication &
>shift.
>
>OTOH, every single compiler did require something like a -O flag to
>actually do this. :-)
>
If we want to talk "most compilers" by usage, Metrowerks on the mac, MrC
on the mac and MS VC all do this sort of thing, to the extent that it is
simply considered obviously part of what is offered, not something to make
a song and dance about. In addition to the obvious things like the above,
certainly MW and MrC do integer constant division through using
multiplication by magic numbers, and convert certain types of "if" type
operations into bit twiddling
(eg f=(g==0)). In all three cases, when they consider optimization they
are considering substantially more sophisticated transformations.
As for only doing this stuff with the -O flag on, MW is largely that way
for the very good reason that when you are debugging (especially when are
walking through MacsBug assembly on one machine while looking at the code
on the another machine) you damn well want as few transformations on the
code as possible to complicate it. If the code is written as a multiply or
a divide, you want to be able to see a multiply or a divide in the
assembly to make sure you're still in sync.
Maynard