Re: Finding set bit in qword.

spam...@crayne.org

unread,

Feb 16, 2006, 3:11:57 AM2/16/06

to

>Hi Guys,
>
>Just wondering what the quickest algorithm is to find a set bit within
>a qword, and also that only 1 bit is set within the qword?
>
>The function, should return the number of the bit (minus 1), and if no
>or more than 1 bit is set, then return 0.
>
>
Do you mean return the index of the set bit when there is only one bit
is set?

<snip>

>Worst case, that the loop runs through 64 times (eg bit 63 set), but is
>there a way to improve the function.
>
>PS. r0 = rax, r1 = rbx. (target = AMD64)
>
>
>
I have no idea AMD64 can using BSF in qword or not?
You can try this algorithm

if r1 == 0
return 0 ;no bit set
else if( (r1 -1) & r1) != 0
return 0 ;more than one bit is set
else
BSF r0, r1

Hong

Grumble

unread,

Feb 16, 2006, 5:08:45 PM2/16/06

to

X wrote:
> Just wondering what the quickest algorithm is to find a set bit
> within a qword, and also that only 1 bit is set within the qword?

Consider the expression (n & (n-1))

BSF reg64, reg64 takes 9 cycles on Hammer.

BSR reg64, reg64 takes 11 cycles.

spam...@crayne.org

unread,

Feb 16, 2006, 4:40:35 PM2/16/06

to

Y...@t.Hon9 wrote:
> spam...@crayne.org wrote:
>
> >Hi Guys,
> >
> >Just wondering what the quickest algorithm is to find a set bit within
> >a qword, and also that only 1 bit is set within the qword?

To the OP: You might Google this newsgroup. That question has been
asked a couple of times in the past. I don't remember the answer, but
it involves subtracting one, which should yield a set of one bits all
the up to, but not including the one set bit. A few other operations,
and you've got your answer (is there only one set bit?).

> I have no idea AMD64 can using BSF in qword or not?

Sure it does. Indeed, in memory, the lowly 386 can do this. But BSF on
memory is *not* fast.

> You can try this algorithm
>
> if r1 == 0
> return 0 ;no bit set
> else if( (r1 -1) & r1) != 0
> return 0 ;more than one bit is set
> else
> BSF r0, r1

Past posts have been something like this. But they didn't use the BSF
instruction (which is slow). Instead, the bit mask was somehow used to
compute the index, IIRC. I may be wrong on that.
Cheers,
Randy Hyde

Maarten Kronenburg

unread,

Feb 16, 2006, 1:55:01 PM2/16/06

to

BSF and BSR should also work for 64-bit, see
Intel Instruction Set Reference on
http://developer.intel.com/design/Pentium4/documentation.htm
and
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24592.pdf
The bit count of the first bit is 0 (zero).
When the argument is zero, the zero flag is set.
BSF and BSR are much faster than programming
it in whatever high level language.
But BSF and BSR are slower than MUL, so your
trick will in general not yield faster code, unless
your multiplication value is always the same constant.

<spam...@crayne.org> wrote in message
news:1140062288.4...@f14g2000cwb.googlegroups.com...

> Hi Guys,
>
> Just wondering what the quickest algorithm is to find a set bit within
> a qword, and also that only 1 bit is set within the qword?
>

> The function, should return the number of the bit (minus 1), and if no
> or more than 1 bit is set, then return 0.
>

> (This is for a quick test to do quick multiply function, so if a shift
> can replace multiply, then use the shift instead. So if I need to
> multiply by 16, then shift 4, instead of doing a mul. Part of of
> rewriting my compiler in asm).
>
> So far I have:
> v_size_is_p2:
> push r1
> xor r0, r0
> mov r1, qword [r6+_B0_v_size_is_p2__size]
> test r1, r1
> je .B0_END_BLOCK_0000101
> .B0_END_BLOCK_0000100:
> shr r1, 1
> jnc .B0_END_BLOCK_0000102
> test r1, r1
> je .B0_END_BLOCK_0000103
> pop r1
> xor r0, r0
> ret
> jmp .B0_END_BLOCK_0000104
> .B0_END_BLOCK_0000103:
> sub r0, 1
> .B0_END_BLOCK_0000104:
> .B0_END_BLOCK_0000102:
> add r0, 1
> test r1, r1
> jne .B0_END_BLOCK_0000100
> .B0_END_BLOCK_0000101:
> pop r1
> ret
>
> or in b0:
> proc v_size_is_p2(_size){
> push r1;
> r0 = r0 ^ r0; // Zero our count
> r1 = _size; // Get our variable size
> while(r1){ // While our size is still above 0
> r1 = r1 >> 1; // Shift it to the right 1 bit
> if(%CARRY){ // If the carry flag is set then we encountered
> // a bit
> if(r1){ // If there are still other bits set, then our
> //size isn't power of 2
> pop r1;
> r0 = r0 ^ r0;
> return(r0); // Return 0 in that case
> } else {
> r0 = r0 - 1; // Since this is the only bit, do the 1 off
> // adjustment
> // This is becuase we want the number of 0's
> // before our 1,
> // not the position of the 1.
> }
> }
> r0 = r0 + 1; // Increase our loop count
> // if there are no more bits in our size, we exit
> // the loop
> }
> pop r1;
> return(r0); // Exit with our loop count.

spam...@crayne.org

unread,

Feb 16, 2006, 8:06:07 PM2/16/06

to

Thanks to everyone that posted the solution...

--
Darran (aka Chewy509) brought to you by Google Groups!

Gerd Isenberg

unread,

Feb 17, 2006, 6:18:03 AM2/17/06

to

Yes, multiplication of the isolated bit with a De Bruijn sequence looks
like a fast bsf-replacement for x86-64:

http://supertech.csail.mit.edu/papers/debruijn.pdf

A small collection of my bitScan-repository, sorry for posting C ;-)

typedef unsigned __int64 BitBoard; // unsigned long long

const BitBoard magic = 0x03f79d71b4cb0a89;
const unsigned int magictable[64] =
{
0, 1, 48, 2, 57, 49, 28, 3,
61, 58, 50, 42, 38, 29, 17, 4,
62, 55, 59, 36, 53, 51, 43, 22,
45, 39, 33, 30, 24, 18, 12, 5,
63, 47, 56, 27, 60, 41, 37, 16,
54, 35, 52, 21, 44, 32, 23, 11,
46, 26, 40, 15, 34, 20, 31, 10,
25, 14, 19, 9, 13, 8, 7, 6,
};

unsigned int bitScanForward (BitBoard b) {
return magictable[((b&-b)*magic) >> 58];
}

Matt Taylor's folding trick is also quite nice, specially for 32-bit
mode:

// precondition bb != 0
int bitScanForward(BitBoard bb)
{
bb ^= (bb - 1);
unsigned int foldedLSB = ((int)bb) ^ ((int)(bb>>32));
return lookUp[(foldedLSB * 0x78291ACF) >> (32-6)]; // range is 0..63
}

double conversion - i guess not that portable due to the bitfield
union:

// return index 0..63 of LS1B
// -1023 if passing zero
int bitScanForward(BitBoard bb)
{
union {
double d;
struct {
unsigned int mantissal : 32;
unsigned int mantissah : 20;
unsigned int exponent : 11;
unsigned int sign : 1;
};
} ud;
ud.d = (double)(bb & -bb); // isolated LS1B to double
return ud.exponent - 1023;
}

also for bsr:

// return index 0..63 of MS1B
// -1023 if passing zero
int bitScanReverse(BitBoard bb)
{
union {
double d;
struct {
unsigned int mantissal : 32;
unsigned int mantissah : 20;
unsigned int exponent : 11;
unsigned int sign : 1;
};
} ud;
ud.d = (double)(bb & ~(bb >> 32)); // avoid rounding error
return ud.exponent - 1023;
}

- Gerd

Rod Pemberton

unread,

Feb 18, 2006, 4:53:10 PM2/18/06

to

"Gerd Isenberg" <spam...@crayne.org> wrote in message
news:1140175083.8...@z14g2000cwz.googlegroups.com...

This one comes from Larry Osterman's of Microsoft blog :

{
bitcount = 0;
while (value != 0)
{
bitcount += 1;
value = value & (value - 1);
}
return bitcount;
}

http://blogs.msdn.com/larryosterman/archive/2005/10/13/480758.aspx

Rod Pemberton

Gerd Isenberg

unread,

Feb 20, 2006, 4:05:01 AM2/20/06

to

Yes, the well known popcounting approach - only efficient for rarely
populated sets - the branchless SWAR-Approach is faster.
To become on topic, popcounting might be used for bitscanning as well
if counting leading (or trailing) ones:

bsf ::= 63 - popCount64 (x ^ -x);

- Gerd

Matt

unread,

Feb 20, 2006, 8:51:40 AM2/20/06

to

Gerd Isenberg wrote:
> Yes, multiplication of the isolated bit with a De Bruijn sequence
> looks like a fast bsf-replacement for x86-64:

On K8, the 32-bit BSF instruction requires 8 cycles, and the 64-bit BSF
instruction requires 9 cycles. The P4 family has similar timings in 32-bit,
though I am unsure about 64-bit. I do not know about the Pentium-M, but it
may well keep the BSF/BSR circuit that is to be found in other P6-core
chips. BSF/BSR ran in 2 cycles on those CPUs. Pentium-M does not support
64-bit, but a 64-bit BSF can be easily emulated using cmp+cmovcc+bsf.

Having looked at this problem extensively, I am not sure that it is possible
to beat the 64-bit BSF instruction. The best I ever did was match it. First
of all, on *any* recent Intel CPU the BSF/BSR instructions will be *much*
faster. Why? Because the multiply itself is slower than the BSF/BSR
instructions. Other approaches such as SWAR also break down because the P4
handles flags very poorly. Again, I have not tested Pentium-M, but I'm
pretty sure it has the BSF/BSR circuit, and in that case these instructions
are nearly as cheap as ALU operations. Therefore it only makes sense to try
and optimize this on an AMD CPU.

On Athlon, the timings work out such that the 32-bit De Bruijn sequence is a
hair slower than the BSF/BSR instructions assuming that you get a cache hit.
BSF runs in 8 cycles. The index takes 7 cycles to compute if you are lucky,
and the table look-up is an additional 3 cycles. Walter Faxon's table lookup
routine is different but has the same characteristic timing. When you go to
64-bits, the additional overhead makes the two approaches roughly identical.
The folding trick, which may simply happen to be a matter of coincidence
(?), saves the 64-bit multiply and has timing similar to that of the
microcode. If you inline any of these routines, you *may* realize a small
savings over the microcode simply because these routines have an IPC close
to 1.

On the Opteron and other K8-generation CPUs, multiply becomes cheaper, and
IIRC it works out that the folding trick is still preferable even though the
registers are 64-bits wide; a 32x32 multiply is still faster than a 64x64
multiply. With the folding trick, the timing is 9 cycles -- the same as the
microcoded instruction, but the code is much larger.

When we were brainstorming, I also tested two other techniques, but they
will never beat the aforementioned approaches. One was SWAR, and though it
was fast, it wasn't fast enough. IIRC it was about 50% slower. With
additional registers on x86-64, it may be possible to squeeze a couple more
cycles out, but it will still be really slow and really large in comparison.
Each iteration is an and+cmp+adc -- the bare minimum. At maximum efficiency,
this is 6 cycles throughput with a latency of 8, and there are also 2 cycles
of set-up time which may or may not be able to be hidden around a memory
access. The other technique is the popcounting technique. My memory fails,
but I don't recall it being particularly fast either.

-Matt