Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bit magic

697 views
Skip to first unread message

Matt Taylor

unread,
Jun 25, 2003, 11:35:07 PM6/25/03
to
I've been thinking about the problem of quickly finding the
least-significant one bit. I've come up with a general formula (a big
guess) that seems to work in all cases:

int table[1 << bits] = {...};

int lsb(int x)
{
// Create mask of ones based on least significant one in x.
x ^= (x - 1);
return table[(x * magic) >> (bits - log2(bits))];
}

Thus far I have found solutions for magic when working in 16-bits,
32-bits, and 64-bits. Here are some misc. questions:
1) does every power of 2 length of integer (i.e. 8, 16, 32, 64, ...)
have a magic number that satisfies the above?
2) is there any way to find the magic number without using brute
force?
3) is there any pattern to the table that the high bits index?

I really don't understand how/why this works, so if anyone has any
insight, I would appreciate it.

-Matt

Ben Peddell

unread,
Jun 26, 2003, 11:42:39 AM6/26/03
to

ok.
below, ? means don't care, and '...' means repeat _x_ to int length

(right hand side in binary)
When (x & 1 == 1), x ^ (x - 1) = ?1 ^ (?1 - 1)
= ?1 ^ ?0
= 1b
When (x & 2 == 2), x ^ (x - 1) = ?10 ^ (?10 - 1)
= ?10 ^ ?01
= 11b
....
When (x & 256 == 256), x ^ (x - 1) = ?100000000 ^ (?100000000 - 1)
= ?100000000 ^ ?011111111
= 111111111b
....
When (x == 0), x ^ (x - 1) = 0 ^ 0 - 1
= 0 ^ _1_...
= _1_...
Unfortunately, this clashes with:
When (x & (2 ** (bits - 1)) != 0), x ^ (x - 1) = 1_0_... ^ (1_0_... - 1)
= 1_0_... ^ 0_1_...
= _1_...

To get the number of ones above, just do a population count.
BTW, your lsb() is the x86 BSF (Bit Scan Forward).

e.g. (pop count from amd 22007.pdf)

int lsb (int x){
x = ((x ^ (x - 1)) >> 1) ^ (x << (nbits - 1));
x -= ((x >> 1) & 0x_5_...;
x = (x & 0x_3_...) + ((x >> 2) & 0x_3_...);
x = (x + (x >> 4)) & 0x_0F_...;
x = (x * 0x_01_...) >> (nbits - 8);
return x;
}

i.e. for 32-bit:

int lsb (int x){
x = ((x ^ (x - 1)) >> 1) ^ (x << 31);
x -= ((x >> 1) & 0x55555555;
x = (x & 0x33333333) + (x >> 2) & 0x33333333);
x = (x + (x >> 4)) & 0x0F0F0F0F;
x = (x * 0x01010101) >> 24;
return x;
}

Michael Brown

unread,
Jun 26, 2003, 5:11:04 PM6/26/03
to
"Ben Peddell" <killer.l...@bigpond.com> wrote in message
news:amEKa.1701$p8.7...@newsfeeds.bigpond.com...

<snip>

Yes, but he's doing a population count of ...000111... though a
multiplication (with an implied truncation of upper bits presumably) and a
right shift. How this works, I'm not too sure ...

--
Michael Brown
www.emboss.co.nz : OOS/RSI software and more :)
Add michael@ to emboss.co.nz - My inbox is always open


Ben Peddell

unread,
Jun 27, 2003, 1:01:52 AM6/27/03
to
Michael Brown wrote:

<snip>

>
> Yes, but he's doing a population count of ...000111... though a
> multiplication (with an implied truncation of upper bits presumably) and a
> right shift. How this works, I'm not too sure ...
>

Have a look at
<http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf>
"AMD Athlon Processor x86 Code Optimization Guide"
Chapter 8, page 136-139 (pages 156-159 in the viewer)
"Efficient Implementation of Population Count Function"

<snip>

| Function popcount() implements a branchless computation of
| the population count. It is based on a O(log(n)) algorithm that
| successively groups the bits into groups of 2, 4, 5, 16, and 32,
| while maintaining a count of set bits in each group The
| algorithms consist of the following steps:
|
| 1) Partition the integer into groups of two bits. Compute the
| population count for each 2-bit group and store the result in the
| 2-bit group. This calls for the following transformation to be
| performed for each 2-bit group:
| 00b -> 00b
| 01b -> 01b
| 10b -> 01b
| 11b -> 10b
| If the original value of a 2-bit group is v, then the new value will
| be v - (v >> 1).
<snip>
| 2) Add the population count of adjacent 2-bit group and store the
| sum to the 4-bit group resulting from merging these adjacent
| 2-bit groups.
<snip>
| Each 4-bit field now has value 0000b, 0001b, 0010b, 0011b, or
| 0100b.
|
| 3) For the first time, the value in each k-bit field is small enough
| that adding two k-bit fields results in a value that still fits in the
| k-bit field. Thus the following computation is performed:
| y = (x + (x >> 4)) & 0x0F0F0F0F
<snip>
| 4) The four 4-bit sums can now be rapidly accumulated by means
| of a multiply with a "magic" multiplier. This can be derived
| from looking at the following chart of partial products:
| 0p0q0r0s * 01010101 =
| :0p0q0r0s
| 0p:0q0r0s
| 0p0q:0r0s
| 0p0q0r:0s
| 000pxxww:vvuutt0s
| Here p, q, r, and s are the 4-bit sums from the previous step, and
| vv is the final result in which we are interested. Thus, the final
| result:
| z = (y * 0x01010101) >> 24
<snip>


Michael Brown

unread,
Jun 27, 2003, 4:15:09 AM6/27/03
to
"Ben Peddell" <killer.l...@bigpond.com> wrote in message
news:t3QKa.1897$p8.8...@newsfeeds.bigpond.com...

> Michael Brown wrote:
>
> <snip>
>
> >
> > Yes, but he's doing a population count of ...000111... though a
> > multiplication (with an implied truncation of upper bits presumably) and
a
> > right shift. How this works, I'm not too sure ...
> >
>
> Have a look at
> <http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf>
> "AMD Athlon Processor x86 Code Optimization Guide"
> Chapter 8, page 136-139 (pages 156-159 in the viewer)
> "Efficient Implementation of Population Count Function"
<snip>

That's a general purpose population count function. The code that you
snipped looked like:

> > > int table[1 << bits] = {...};
> > >
> > > int lsb(int x)
> > > {
> > > // Create mask of ones based on least significant one in x.
> > > x ^= (x - 1);
> > > return table[(x * magic) >> (bits - log2(bits))];
> > > }

He doing a *single* multiplication, a truncation of upper bits, and a left
shift, not multiple multiplies like the AMD docs showed. This is the bit I
don't fully get how it works.

Ben Peddell

unread,
Jun 27, 2003, 9:54:38 PM6/27/03
to

OK.
First, the x ^= (x - 1) does what I showed you - sets the lower n+1 bits
when the lower n bits are clear.

His magic number thing does work. With him going the slightly faster
way, x basically acts as a key to get a code out of the magic number.
Unfortunately, it is extremely difficult to create a magic number for
this, as the entire number is used in getting the code.

If he had x = (x ^ (x - 1)) + 1; then it would have been much easier to
calculate the magic number, as x would select an n-bit section of the
magic number. The magic number would need to have no two n-bit sections
the same.

As to his original post, I don't think that there is any general
algorithm to calculate this easier latter magic number, let alone the
more difficult former one. Also, there are likely many correct magic
numbers for both the former and the latter.


Ben Peddell

unread,
Jun 28, 2003, 10:29:25 AM6/28/03
to

If you change the "x ^= (x - 1)" to "x = (x ^ (x - 1)) + 1", then you
can put the magic numbers together quite easily by hand.

Here's a few that took me a couple of minutes for me to compile by hand:

00011101b (0x1B) for 8 bits
0000111100101101b (0x0F2D) for 16 bits.
00000111110001001100101011011101b (0x07C4CADD) for 32 bits.
0000001111110000100011000101001110011010101110111101001011011001b
(0x03F08C539ABBD2D9) for 64 bits

Notice:

000____111____0____1
0000___1111___00___10___110___1
00000__11111__000__100__1100__1010__11__0__11101
000000_111111_0000_1000_11000_10100_111_00_11010101110111101001011011001

OK. So, I broke the pattern with the 64-bit number. But you see what I'm
getting at, don't you?

When constructing them, ensure that with each bit you add that the
resulting n-bit pattern does not conflict with any other n-bit pattern
(where n is log2(bits))
A small sub-program (algorithm?) could be used to compile these
patterns, as basically no trial-and-error is required.
A tiny bit of thinking ahead may be required though.

Matt Taylor

unread,
Jun 29, 2003, 12:25:40 AM6/29/03
to
"Michael Brown" <s...@signature.below> wrote in message news:<84TKa.53119$JA5.9...@news.xtra.co.nz>...

The AMD docs give you a generic popcount function. I would not call
the algorithm I posted a popcount function. It maps 2^n - 1 to n for
n=0..(sizeof(x) * 8). The upper 5 bits (in the 32-bit version) end up
being unique for all 2^n - 1, so I use this to identify what n was
originally.

I should have also mentioned my starting assumption: assert(x != 0);

-Matt

Matt Taylor

unread,
Jun 29, 2003, 12:17:17 AM6/29/03
to
Ben Peddell <killer.l...@bigpond.com> wrote in message news:<Wp6La.2358$p8.1...@newsfeeds.bigpond.com>...

For mine, I have yet to find multiple keys which produce this unique
table. I am mostly curious as to whether there is even a guarantee
that such a key exists. I don't see how my multipliers can be derived
except trial & error.

I am aware that I'm doing the same thing as a bsf, but the case that I
am particularly interested in is 64-bits, and I can extract the LSB
faster in the average case using my method rather than 2 bsf
instructions.

For 64-bits, I do things a little differently simply because a 64-bit
multiply is slow. I start out with x ^= (x - 1) just like normal which
generates a key equal to 2^n - 1 (where n is the index of the LSB, 1
is index 0). Now fold the 64-bit key into 32-bits by xoring the high
32-bits with the low 32-bits. To convey the general idea, here's what
it looks like with 8 bits:

LSB Key
00000001 - 0000
00000010 - 0001
00000100 - 0011
00001000 - 0111
00010000 - 1111
00100000 - 1110
01000000 - 1100
10000000 - 1000

Perhaps informally you would describe it by saying that the first four
entries shift in ones and the second four shift in zeros.

Now that I have a 32-bit key, I proceed as normal for 32-bits:
multiply by my magic number, shift left by 26 places, and index my
table.

-Matt

Gerd Isenberg

unread,
Jul 11, 2003, 7:00:22 AM7/11/03
to
pa...@tampabay.rr.com (Matt Taylor) wrote in message news:<7a851118.03062...@posting.google.com>...

Hi Matt,

great idea - congrats.

I immediatly played around with your code.
Few remarks, using unsigned mul seems to be give some more magics.
One gets even nice results with single isolated lsb using x & -x.
And your table is a bit huge ;-)


int table[1 << bits] = {...};

I guess your routine is fastest on x86-64, specially if you reset the found bit.

bsf rax,[bb]
btr [bb],rax

are both still vector path ;-)
mul reg64 is double direct path.
About the math - i have some clue....

Regards,
Gerd


some code snippets for msvc:

typedef unsigned __int64 BitBoard;

// used to find magic
// if you pass single populated bitboards,
// the mask creation is not necessary!!
int lsb(BitBoard singlebit, BitBoard magic)
{
return (int)((singlebit * magic) >> (64 - 6));
}

void findMagic64()
{
int count[64];
int bitpos;
int foundcount = 0;
bool found;
BitBoard magicBase = 0x03f08c5392cd5dbd;
BitBoard magic;

for ( magic = magicBase + 2750736; magic >= magicBase; magic--)
{
found = true;
for (bitpos = 0; bitpos < 64; bitpos++)
count[bitpos] = -1;
for (bitpos = 0; bitpos < 64 && found; bitpos++)
{
int bit = lsb(bitmask[bitpos], magic);
found = ( count[bit] < 0 ); // already used ?
count[bit] = bitpos;
}
if ( found )
{
foundcount++;
printf("\n");
for (bitpos = 63; bitpos >= 0; bitpos--)
{
printf("%c", (magic & bitmask[bitpos])
? '1' : '0');
if ( (bitpos & 7) == 0 )
printf(" ");
}

printf("\nconst BitBoard magig = 0x%08x%08x;\nconst int table[64] =\n{",
(int)(magic>>32),
(int)(magic&0xffffffff));
for (bitpos = 0; bitpos < 64; bitpos++)
{
if ( (bitpos & 7) == 0 )
printf("\n ");
printf("%2d, ", count[bitpos]);
}
printf("\n};\n");
}
}
printf("\nmagics found %02d\n", foundcount);
}


int bitScanAndReset(BitBoard &bb)
{
BitBoard lsb = bb & -(__int64)bb;
bb ^= lsb; // reset the found bit
return table[(lsb * magic) >> (64 - 6)];
}


>
> -Matt

0 new messages