/* * A while back I posted a solution to the "count of bits set in a * long" question, suggesting that HACKMEM 169 was a very fast way to do it. * This program demonstrates the technique. * * The HACKMEM 169 method uses a modulo to sum the digits. * A number of people have asked me "how can a division be faster than a loop?" * This program demonstrates how. The modulo is a special case, * and can be unwound into a tight loop if your machine does not * have a fast divide opcode (although many modern machines have a 1 cycle * divide opcode, except for we-really-believe-our-press-hype risc * manufacturers). * * This program was run on a 386/20 under unix with the following results: * Func sec/iter scaled * nbits 0.00018310547 1.000 hackmem 169 * nbits1 0.0006980896 3.812 hackmem 169 no % operator * nbits2 0.0016822815 9.188 count lsb's loop * nbits3 0.0037384033 20.417 * nbits4 0.0034370422 18.771 * nbits5 0.0037765503 20.625 * nbits6 0.0035247803 19.250 */
/* * This function is used to calibrate the timing mechanism. * This way we can subtract the loop and call overheads. */ int calibrate(n) register unsigned long n; { return 0;
}
/* * This function counts the number of bits in a long. * It is limited to 63 bit longs, but a minor mod can cope with 511 bits. * * It is so magic, an explanation is required: * Consider a 3 bit number as being * 4a+2b+c * if we shift it right 1 bit, we have * 2a+b * subtracting this from the original gives * 2a+b+c * if we shift the original 2 bits right we get * a * and so with another subtraction we have * a+b+c * which is the number of bits in the original number. * Suitable masking allows the sums of the octal digits in a * 32 bit bumber to appear in each octal digit. This isn't much help * unless we can get all of them summed together. * This can be done by modulo arithmetic (sum the digits in a number by molulo * the base of the number minus one) the old "cating out nines" trick they * taught in school before calculators were invented. * Now, using mod 7 wont help us, because our number will very likely have * more than 7 bits set. So add the octal digits together to get base64 * digits, and use modulo 63. * (Those of you with 64 bit machines need to add 3 octal digits together to * get base512 digits, and use mod 511.) * * This is HACKMEM 169, as used in X11 sources. */
int nbits(n) register unsigned long n; { register unsigned long tmp;
/* * This is the same as the above, but does not use the % operator. * Most modern machines have clockless division, and so the modulo is as * expensive as, say, an addition. */ int nbits1(n) register unsigned long n; { register unsigned long tmp;
/* * This function counts the bits in a long. * * It removes the lsb and counting the number of times round the loop. * The expression (n & -n) yields the lsb of a number, * but it only works on 2's compliment machines. */ int nbits2(n) register unsigned long n; { register int count;
for (count = 0; n; n -= (n & -n)) count++; return count;
}
/* * This function counts the bits in a long. * * It works by shifting the number left and testing the top bit. * On many machines shift is expensive, so it uses a cheap addition instead. */ int nbits3(n) register unsigned long n; { register int count;
for (count = 0; n; n += n) if (n & ~(~(unsigned long)0 >> 1)) count++; return count;
}
/* * This function counts the bits in a long. * * It works by shifting the number down and testing the bottom bit. */ int nbits4(n) register unsigned long n; { register int count;
for (count = 0; n; n >>= 1) if (n & 1) count++; return count;
}
/* * This function counts the bits in a long. * * It works by masking each bit. * This is the second most intuitively obvious method, * and is independent of the number of bits in the long. */ int nbits5(n) register unsigned long n; { register int count; register unsigned long mask;
count = 0; for (mask = 1; mask; mask += mask) if (n & mask) count++; return count;
}
/* * This function counts the bits in a long. * * It works by masking each bit. * This is the most intuitively obvious method, * but how do you a priori know how many bits in the long? * (except for ''sizeof(long) * CHAR_BITS'' expression) */ int nbits6(n) register unsigned long n; { register int count; register int bit;
count = 0; for (bit = 0; bit < 32; ++bit) if (n & ((unsigned long)1 << bit)) count++; return count;
}
/* * build a long random number. * The standard rand() returns a 15bit number. */ unsigned long bigrand() { return ( ((unsigned long)(rand() & 0xFFF) << 24) | ((unsigned long)(rand() & 0xFFF) << 12) | (unsigned long)(rand() & 0xFFF) );
}
/* * Run the test many times. * You will need a running time in the 10s of seconds * for an accurate result. * * To give a fair comparison, the random number generator * is seeded the same for each measurement. * * Return value is seconds per iteration. */ #define REPEAT (1L<<18)
double measure(func) int (*func)(); { long j; struct tms start; struct tms finish;
main() { double calibration; double best; int j, k, m;
/* * run a few tests to make sure they all agree */ srand(getpid()); for (j = 0; j < 10; ++j) { unsigned long n; int bad;
n = bigrand(); for (k = 0; k < SIZEOF(table); ++k) table[k].trial = table[k].func(n); bad = 0; for (k = 0; k < SIZEOF(table); ++k) for (m = 0; m < SIZEOF(table); ++m) if (table[k].trial != table[m].trial) bad = 1; if (bad) { printf("wrong:\t%08lX", n); for (k = 0; k < SIZEOF(table); ++k) printf("\t%d", table[k].trial); printf("\n"); } }
/* * calibrate the timing mechanism */ calibration = measure(calibrate);
/* * time them all, keeping track of the best (smallest) */ for (j = 0; j < SIZEOF(table); ++j) { table[j].measurement = measure(table[j].func) - calibration; if (!j || table[j].measurement < best) best = table[j].measurement; } for (j = 0; j < SIZEOF(table); ++j) { printf ( "%s\t%10.8g\t%6.3f\n", table[j].name, table[j].measurement, table[j].measurement / best ); } exit(0);
}
Regards Peter Miller UUCP uunet!munnari!pdact.pd.necisa.oz!pmiller ARPA pmiller%pdact.pd.necisa.oz@australia /\/\* CSNET pmiller%pdact.pd.necisa.oz@au ACSnet pmil...@pdact.pd.necisa.oz Disclaimer: The views expressed here are personal and do not necessarily reflect the view of my employer or the views or my colleagues. D
In article <8...@neccan.oz> pe...@neccan.oz (Peter Miller) sends out a program to time various bit-counting functions. I decided to `improve' the program a bit :-) and add a number of additional functions. I have run the resulting program (which I will append below) on a number of architectures. The `mod' version is never the fastest of the hackmem variants. The fastest function is always table lookup, except on the Sun-4.
(I find the last two results particularly interesting since his code as posted uses functions nbits3 and nbits4 to calculate the times printed for nbits5 and nbits6, never evaluating 5 and 6 at all.)
Here is a short key to names: my name (Peter's name) technique used ---------------------- -------------- hackmemmod (nbits) HACKMEM 169, using % operator hackmemloop (nbits1) HACKMEM 169, using loop to implement % hackmemunrolled (-) HACKMEM 169, with 5-step % (loop unrolled) rmlsbsub (nbits2) remove lsb with `n -= (n & -n)' rmlsbmask (-) remove lsb with `n &= n - 1' testlsb (nbits4) test n&1, then n>>=1 testmsb (nbits3) test n&MSB, then n+=n (rather than n<<=1) testsignandshift (-) test n<0, then n<<=1 testeachbit (nbits5, untimed) test n&mask, then mask+=mask testeachbit1shl (nbits6, ") test n&(1<<bit) for bit in [0..32) tableshift nbits[n>>24] + nbits[(n>>16)&255] ... tableuchar set p=&n, nbits[p[0]] + nbits[p[1]] ...
-------------------------------------------------- SparcStation 1+ (SPARC, private memory bus) These times look a bit odd, and suggest that the compiler could use some work. function time ratio hackmemmod 1.4972687e-06 12.077 hackmemloop 7.3432922e-07 5.923 hackmemunrolled 1.1825562e-06 9.538 rmlsbsub 1.3351440e-07 1.077 rmlsbmask 1.4305115e-07 1.154 testlsb 1.4305115e-07 1.154 testmsb 1.3351440e-07 1.077 testsignandshift 1.2397766e-07 1.000 testeachbit 6.6280365e-06 53.462 testeachbit1shl 9.2029572e-06 74.231 tableshift 7.2479248e-07 5.846 tableuchar 1.1539459e-06 9.308
There are some noteworthy points above, particularly the example with the 8600, where the choice of compiler makes a great deal of difference. The `testlsb' loop runs much faster under gcc, which eliminated one `tstl' instruction, yet some of the other loops run more slowly.
So that others can run their own tests, here is the program. I added several functions, renamed them all, changed the timing mechanism on BSD machines, and reformatted things a bit. I changed the random number selection mechanism to be relatively machine independent (assuming they all have the same `rand' function) and to use values that are `more random'. I also changed the output format (among other things, it uses `%#g'; the Sun SparcStation printf does not handle this correctly, and I added trailing zeroes manually above). Oh yes, I also had to make it run longer on the MIPS and Sparc machines to get reliable times.
(You can tell which functions I added, as they are generally uncommented. :-) )
#ifdef FD_SETSIZE # define USE_GETRUSAGE # include <sys/time.h> # include <sys/resource.h> #else # include <sys/times.h> #endif
#define SIZEOF(a) (sizeof(a)/sizeof(a[0]))
/* * This function is used to calibrate the timing mechanism. * This way we can subtract the loop and call overheads. */ int calibrate(n) register unsigned long n; { return 0;
}
/* * This function counts the number of bits in a long. * It is limited to 63 bit longs, but a minor mod can cope with 511 bits. * * It is so magic, an explanation is required: * Consider a 3 bit number as being * 4a+2b+c * if we shift it right 1 bit, we have * 2a+b * subtracting this from the original gives * 2a+b+c * if we shift the original 2 bits right we get * a * and so with another subtraction we have * a+b+c * which is the number of bits in the original number. * Suitable masking allows the sums of the octal digits in a * 32 bit bumber to appear in each octal digit. This isn't much help * unless we can get all of them summed together. * This can be done by modulo arithmetic (sum the digits in a number by molulo * the base of the number minus one) the old "cating out nines" trick they * taught in school before calculators were invented. * Now, using mod 7 wont help us, because our number will very likely have * more than 7 bits set. So add the octal digits together to get base64 * digits, and use modulo 63. * (Those of you with 64 bit machines need to add 3 octal digits together to * get base512 digits, and use mod 511.) * * This is HACKMEM 169, as used in X11 sources. */ int t0_hackmemmod(n) register unsigned long n; { register unsigned long tmp;
/* * This is the same as the above, but does not use the % operator. * Most modern machines have clockless division, and so the modulo is as * expensive as, say, an addition. */ int t1_hackmemloop(n) register unsigned long n; { register unsigned long tmp;
/* * Above, without using while loop. It takes at most 5 iterations of the * loop, so we just do all 5 in-line. The final result is never 63 * (this is assumed above as well). */ int t2_hackmemunrolled(n) register unsigned long n; { register unsigned long tmp;
pe...@neccan.oz (Peter Miller) writes: >/* > * A while back I posted a solution to the "count of bits set in a > * long" question, suggesting that HACKMEM 169 was a very fast way to do it. > * This program demonstrates the technique. > * > * The HACKMEM 169 method uses a modulo to sum the digits. > * A number of people have asked me "how can a division be faster than a loop?" > * This program demonstrates how. The modulo is a special case, > * and can be unwound into a tight loop if your machine does not > * have a fast divide opcode (although many modern machines have a 1 cycle > * divide opcode, except for we-really-believe-our-press-hype risc > * manufacturers). > * > * This program was run on a 386/20 under unix with the following results: > * Func sec/iter scaled > * nbits 0.00018310547 1.000 hackmem 169 > * nbits1 0.0006980896 3.812 hackmem 169 no % operator > * nbits2 0.0016822815 9.188 count lsb's loop > * nbits3 0.0037384033 20.417 > * nbits4 0.0034370422 18.771 > * nbits5 0.0037765503 20.625 > * nbits6 0.0035247803 19.250 > */ > ..... A program follows this.
To this program I have added a bug fix (the function table was wrong) and another function (nbit7) which is MY FAVORITE WAY of counting bits. The patches are at the end of this posting....
Anyway, I got much better results on both the 386 (which has a multiply) and on a SPARC (which doesn't) than any of the functions in the original program.
The nbit7 technique is a simple routine which shifts and sums bits in place until all bits have summed in the lower 8 bits of the word. The implementation here requires 32 bit longs, but it is easily extensible to 64 bits with a mask change and the additional shift shown.
Enjoy,
------------- Remainder of Posting is a Cdiff Patchfile ----------
Something is seriously wrong with either these numbers or that computer. I've tried a couple of these on my Mac II, using Think C. Note that 1) Apple is not noted for squeezing the last drop of performance out of the 68020, and 2) Think C has not stressed optimal code generation.
I get 36 usec for rmlsub on longs with all 32 bits set, 20 usec if 16 bits are set, and 5 usec of no bits are set. Note that the Sun is getting, in the average case for rmlsub, 20 times worse performance than I am getting for the worst case.
In article <26...@mimsy.umd.edu> I posted a modified version of Peter Miller's bit-count timing program. Unfortunately, my version had a bug that effectively makes the results worthless (depending on compiler vagaries; it `just happens' to compile to working code on some systems).
Remarkably few people noted the bug (or rather, sent me mail about it). Two people sent me additional functions, however. Arthur Olson added four (all variants on table lookup); Mike Weaver added one. Gene Olson has also posted a different modified version of Peter Miller's original program; this contains a better version of the latter.
So, here are new results for the new set of (now 17) functions on the same set of machines, along with the new program. Noteworthy facts:
- hackmem invariably loses to some other technique, always including Gene Olson's, though sometimes not by much; - the fastest is usually (but not always) one of the table lookups; - tiny differences in ratios (less than .1) are usually insignificant.
name technique used ---- -------------- hackmemmod HACKMEM 169, using % operator hackmemloop HACKMEM 169, using loop to implement % hackmemunrolled HACKMEM 169, with 5-step % (loop unrolled) rmlsbsub remove lsb with `n -= (n & -n)' rmlsbmask remove lsb with `n &= n - 1' testlsb test n&1, then n>>=1 testmsb test n&MSB, then n+=n (rather than n<<=1) testsignandshift test n<0, then n<<=1 testeachbit test n&mask, then mask+=mask testeachbit1shl test n&(1<<bit) for bit in [0..32) tableshift nbits[n>>24] + nbits[(n>>16)&255] ... tableuchar set p=&n, nbits[p[0]] + nbits[p[1]] ... tableshiftcast nbits[n>>24] + nbits[(u_char)(n>>16)] ... itableshift same as tableshift, but table datatype is int itableuchar ditto itableshiftcast ditto (note, nbits table is `char') sumbits Gene Olson's function (see code for comments)
/* * bct - bitcount timing * * Runs a bunch of different functions-to-count-bits-in-a-longword * (many assume 32 bits per long) with a timer around each loop, and * tries to calcuate the time used. */ #include <stdio.h> #include <sys/types.h>
#ifdef FD_SETSIZE # define USE_GETRUSAGE # include <sys/time.h> # include <sys/resource.h> #else # include <sys/times.h> #endif
#define SIZEOF(a) (sizeof(a)/sizeof(a[0]))
/* * This function is used to calibrate the timing mechanism. * This way we can subtract the loop and call overheads. */ int calibrate(n) register unsigned long n; { return 0;
}
/* * This function counts the number of bits in a long. * It is limited to 63 bit longs, but a minor mod can cope with 511 bits. * * It is so magic, an explanation is required: * Consider a 3 bit number as being * 4a+2b+c * if we shift it right 1 bit, we have * 2a+b * subtracting this from the original gives * 2a+b+c * if we shift the original 2 bits right we get * a * and so with another subtraction we have * a+b+c * which is the number of bits in the original number. * Suitable masking allows the sums of the octal digits in a * 32 bit number to appear in each octal digit. This isn't much help * unless we can get all of them summed together. * This can be done by modulo arithmetic (sum the digits in a number by molulo * the base of the number minus one) the old "casting out nines" trick they * taught in school before calculators were invented. * Now, using mod 7 wont help us, because our number will very likely have * more than 7 bits set. So add the octal digits together to get base64 * digits, and use modulo 63. * (Those of you with 64 bit machines need to add 3 octal digits together to * get base512 digits, and use mod 511.) * * This is HACKMEM 169, as used in X11 sources. */ int t0_hackmemmod(n) register unsigned long n; { register unsigned long tmp;
/* * This is the same as the above, but does not use the % operator. * Most modern machines have clockless division, and so the modulo is as * expensive as, say, an addition. */ int t1_hackmemloop(n) register unsigned long n; { register unsigned long tmp;
ch...@mimsy.umd.edu (Chris Torek) writes: >In article <8...@neccan.oz> pe...@neccan.oz (Peter Miller) sends out a >program to time various bit-counting functions. I decided to `improve' >the program a bit :-) and add a number of additional functions. I have >run the resulting program (which I will append below) on a number of >architectures. The `mod' version is never the fastest of the hackmem >variants. The fastest function is always table lookup, except on the >Sun-4.
I also posted an "improved" version of this program that got horribly erratic and odd results.
Looking at all this closely, it seems the erratic results on this "test" are due two 3 problems:
1) The original program had a bug in the procedure table which tested the wrong procedures. Both Chris and I fixed this.
2) The original program neglected to return a value in procedure bigrand(). On the Sun4, this means all the "random" data was zero. This kinda skewed the results.
I just fixed this by putting in the return().
3) The bigrand() procedure was called every invokation of the test procedure. Since for the fast routines it takes *MUCH* longer to generate these numbers than to count the bits, the noise level is extreme.
I just fixed this by creating an array once, and then referencing it. The test runs about 6 times faster now, and it is much more consistent.
So the bottom line is that most of the results posted so far are largely nonsense. So, in the spirit of throwing good net bandwidth after bad, I am posting the blasted thing again. I hope this discussion is useful to *someone*.
Earlier I had added a test now called "suminplace" to the list which I believed was faster than any other way of doing it. (Except maybe a 16-bit lookup table). It turns out, that an 8-bit lookup table beats it.
Blast the torpedos, I still like my method best. Maybe it will run faster on another machine or another compiler. It certainly is smaller.
Perhaps Chris will find it worth his time to run the thing on his list of machines.
-------------------- Yet another posting ----------------------
#include <stdio.h> #include <sys/types.h>
#ifdef FD_SETSIZE # define USE_GETRUSAGE # include <sys/time.h> # include <sys/resource.h> #else # include <sys/times.h> #endif
#define SIZEOF(a) (sizeof(a)/sizeof(a[0]))
#define NUMRAND 16384 unsigned long rr[NUMRAND] ;
/* * This function is used to calibrate the timing mechanism. * This way we can subtract the loop and call overheads. */ int calibrate(n) register unsigned long n; { return 0;
}
/* * This function counts the number of bits in a long. * It is limited to 63 bit longs, but a minor mod can cope with 511 bits. * * It is so magic, an explanation is required: * Consider a 3 bit number as being * 4a+2b+c * if we shift it right 1 bit, we have * 2a+b * subtracting this from the original gives * 2a+b+c * if we shift the original 2 bits right we get * a * and so with another subtraction we have * a+b+c * which is the number of bits in the original number. * Suitable masking allows the sums of the octal digits in a * 32 bit bumber to appear in each octal digit. This isn't much help * unless we can get all of them summed together. * This can be done by modulo arithmetic (sum the digits in a number by molulo * the base of the number minus one) the old "cating out nines" trick they * taught in school before calculators were invented. * Now, using mod 7 wont help us, because our number will very likely have * more than 7 bits set. So add the octal digits together to get base64 * digits, and use modulo 63. * (Those of you with 64 bit machines need to add 3 octal digits together to * get base512 digits, and use mod 511.) * * This is HACKMEM 169, as used in X11 sources. */ int t0_hackmemmod(n) register unsigned long n; { register unsigned long tmp;
/* * This is the same as the above, but does not use the % operator. * Most modern machines have clockless division, and so the modulo is as * expensive as, say, an addition. */ int t1_hackmemloop(n) register unsigned long n; { register unsigned long tmp;
/* * Above, without using while loop. It takes at most 5 iterations of the * loop, so we just do all 5 in-line. The final result is never 63 * (this is assumed above as well). */ int t2_hackmemunrolled(n) register unsigned long n; { register unsigned long tmp;
/* * This function counts the bits in a long. * * It removes the lsb and counting the number of times round the loop. * The expression (n & -n) yields the lsb of a number, * but it only works on 2's compliment machines. */ int t3_rmlsbsub(n) register unsigned long n; { register int count;
for (count = 0; n; n -= (n & -n)) count++; return count;
}
int t4_rmlsbmask(n) register unsigned long n; { register int count;
for (count = 0; n; count++) n &= n - 1; /* take away lsb */ return (count);
}
/* * This function counts the bits in a long. * * It works by shifting the number down and testing the bottom bit. */ int t5_testlsb(n) register unsigned long n; { register int count;
for (count = 0; n; n >>= 1) if (n & 1) count++; return count;
}
/* * This function counts the bits in a long. * * It works by shifting the number left and testing the top bit. * On many machines shift is expensive, so it uses a cheap addition instead. */ int t6_testmsb(n) register unsigned long n; { register int count;
for (count = 0; n; n += n) if (n & ~(~(unsigned long)0 >> 1)) count++; return count;
}
int t7_testsignandshift(n) register unsigned long n; { register int count;
for (count = 0; n; n <<= 1) if ((long)n < 0) count++; return (count);
}
/* * This function counts the bits in a long. * * It works by masking each bit. * This is the second most intuitively obvious method, * and is independent of the number of bits in the long. */ int t8_testeachbit(n) register unsigned long n; { register int count; register unsigned long mask;
count = 0; for (mask = 1; mask; mask += mask) if (n & mask) count++; return count;
}
/* * This function counts the bits in a long. * * It works by masking each bit. * This is the most intuitively obvious method, * but how do you a priori know how many bits in the long? * (except for ''sizeof(long) * CHAR_BITS'' expression) */ int t9_testeachbit1shl(n) register unsigned long n; { register int count; register int bit;
count = 0; for (bit = 0; bit < 32; ++bit) if (n & ((unsigned long)1 << bit)) count++; return count;
I took Gene Olson's test program and stuck Torek's routines into it, as well as adding routines to count bits by using a table with a 16-bit index, and a slightly sped-up version of HAKMEM 169 (`hackmemallfold'). The 16-bit table lookup appears to be fastest, at least on a Sun-3/50 and a Sparcstation 1. `hackmemallfold' and `sumbits' both use a total of 16 arithmetic, logical, and shift operations. However, if gcc is used `hackmemallfold' is apparently faster (because gcc refuses to put constants in registers?)
name technique used ---- -------------- hackmemmod HACKMEM 169, using % operator hackmemloop HACKMEM 169, using loop to implement % hackmemunrolled HACKMEM 169, with 5-step % (loop unrolled) hackmemallfold HACKMEM 169, no %, fold to the bitter end rmlsbsub remove lsb with `n -= (n & -n)' rmlsbmask remove lsb with `n &= n - 1' testlsb test n&1, then n>>=1 testmsb test n&MSB, then n+=n (rather than n<<=1) testsignandshift test n<0, then n<<=1 testeachbit test n&mask, then mask+=mask testeachbit1shl test n&(1<<bit) for bit in [0..32) tableshift nbits[n>>24] + nbits[(n>>16)&255] ... tableuchar set p=&n, nbits[p[0]] + nbits[p[1]] ... tableshiftcast nbits[n>>24] + nbits[(u_char)(n>>16)] ... itableshift same as tableshift, but table datatype is int itableuchar ditto itableshiftcast ditto (note, nbits table is `char') 16tableshift nbits16[n>>16] + nbits16[n&65535] 16tableushort set p=&n, nbits16[p[0]] + nbits16[p[1]] 16tableshiftcast nbits16[n>>16] + nbits16[(u_short)(n)] 16itableshift same as 16tableshift, but table datatype is int 16itableushort ditto 16itableshiftcast ditto sumbits Gene Olson's function (see code for comments)
#ifdef FD_SETSIZE # define USE_GETRUSAGE # include <sys/time.h> # include <sys/resource.h> #else # include <sys/times.h> #endif
#define SIZEOF(a) (sizeof(a)/sizeof(a[0]))
#define NUMRAND 16384 unsigned long rr[NUMRAND];
/* * This function is used to calibrate the timing mechanism. * This way we can subtract the loop and call overheads. */ int calibrate(n) register unsigned long n; { return 0;
}
/* * This function counts the number of bits in a long. * It is limited to 63 bit longs, but a minor mod can cope with 511 bits. * * It is so magic, an explanation is required: * Consider a 3 bit number as being * 4a+2b+c * if we shift it right 1 bit, we have * 2a+b * subtracting this from the original gives * 2a+b+c * if we shift the original 2 bits right we get * a * and so with another subtraction we have * a+b+c * which is the number of bits in the original number. * Suitable masking allows the sums of the octal digits in a * 32 bit number to appear in each octal digit. This isn't much help * unless we can get all of them summed together. * This can be done by modulo arithmetic (sum the digits in a number by molulo * the base of the number minus one) the old "casting out nines" trick they * taught in school before calculators were invented. * Now, using mod 7 wont help us, because our number will very likely have * more than 7 bits set. So add the octal digits together to get base64 * digits, and use modulo 63. * (Those of you with 64 bit machines need to add 3 octal digits together to * get base512 digits, and use mod 511.) * * This is HACKMEM 169, as used in X11 sources. */ int t0_hackmemmod(n) register unsigned long n; { register unsigned long tmp;
/* * This is the same as the above, but does not use the % operator. * Most modern machines have clockless division, and so the modulo is as * expensive as, say, [sic] an addition. */ int t1_hackmemloop(n) register unsigned long n; { register unsigned long tmp;
/* * Above, without using while loop. It takes at most 5 iterations of the * loop, so we just do all 5 in-line. The final result is never 63 * (this is assumed above as well). */ int t2_hackmemunrolled(n) register unsigned long n; { register unsigned long tmp;
/* * As above, but no modulus nor simulation thereof. Instead, * we continue shifting and masking to the bitter end. */ int t3_hackmemallfold(n) register unsigned long n; { register unsigned long tmp;
In article <34...@cup.portal.com> t...@cup.portal.com (Tim W Smith) writes:
>... I've tried a couple of these on my Mac II, using Think C. Note >that 1) Apple is not noted for squeezing the last drop of performance >out of the 68020, and 2) Think C has not stressed optimal code generation. >I get [... numeric value deleted, but] the Sun is getting ... >20 times worse performance. [Why?]
Two obvious possibilities: you may have been running the buggy version of the tests (no value returned from bigrand()), or---more likely--- you have a System V machine with no reasonable way of obtaining CPU resource usage. (This is not to say that no System V machine has a reasonable facility a la getrusage().) The `times' call used in the original bit-counting program returns time values in `clock ticks'. There is no reliable way to discover how long a `clock tick' is. It is often 1/60th of a second, but often 1/50th or 1/100th or some other value. On some system you can `#include <sys/param.h>' and get a manifest constant called `HZ' that tells you the default divisor for 1---that is, the number such that HZ ticks elapse every second---but this is merely the default; the actual running system might be using some other value (and might even change it `at whim', so to speak). On other systems you can fish the actual value of hz out of the system. The method is highly machine-dependent, and may give you the wrong answer (e.g., under Ultrix on a DECstation 3100, hz is normally 256, yet the values from times() are always in 1/60th of a second, because times() is actually implemented using getrusage() with a hardcoded `60 hz' inside the times() function.)
This is why I put the getrusage() call into the program. getrusage() fills in a machine- and system-independent structure with values in seconds and microseconds. The microseconds field is typically inaccurate (being a step function with a grain other than 1), but its units are known, unlike those for times().
If you have a BSD-based system, the values reported by the timing function in the bit-count timing program have some chance of being useful; if not, they are much more chancy. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750) Domain: ch...@cs.umd.edu Path: uunet!mimsy!chris
Something is seriously wrong with this bit-counting benchmark. I'm running on a Motorola Delta 3600 (25MHz 68030), and my time are 2 orders of magnitude slower than those reported for a Sun 3/50!!!!
In order to get this to compile on my SYSV machine, I had to add the line:
#undef FD_SETSIZE
because there's no resource.h on this machine. Oh yes, it's gcc -O.
-- _ Kevin D. Quitt demott!kdq k...@demott.com DeMott Electronics Co. 14707 Keswick St. Van Nuys, CA 91405-1266 VOICE (818) 988-4975 FAX (818) 997-1190 MODEM (818) 997-4496 PEP last
When benchmarking something like this, a good thing to do is to do a back-of-the-envelope estimate of what you expect. For example, the Mac II runs at 16 MHz. Assume a few wait states, and loop overhead, and stuff like that will limit us to 1 "useful" instruction every 8 clocks, so we get 2 "useful" MIPS. The operation x &= x-1 will take three operations. At 2 MIPS, an operation is 1/2 uSec, so we have 1.5 uSec per bit. The loop runs an average of 16 times, so we expect that we can do it in 24 uSec.
The measured time was something like 20 uSec, so we pass.
If you see numbers in the hundreds of microseconds, you can be almost sure that something is wrong with the program.