Find first set bit after specified index

ttt

unread,

Sep 23, 2009, 6:21:28 PM9/23/09

to

Hi!
I want to find the first set bit in a 64bit number, that has the index
greater or equal to a certain start value.
Example: for the number 0x8080 and start index = 10, the first set bit
is at index 15 (the one found at index 7 is before the start index,
thus invalid).
The program needs to run on a Intel 32bit CPU, so in order to handle
the 64 bit number I use two BSF and some conditional instructions to
get the index.

This is the code (C, not assembly):
static inline int SearchForward(__int64 mask, int start)
int index = 0;

if(start < 32) {
// Check the first 32 bits.
int data = *((int *)&mask) & (0xFFFFFFFF - ((1 << start) -
1));
if(_BitScanForward((unsigned long *)&index, data)) {
return index;
}
else if(_BitScanForward((unsigned long *)&index, *((int *)
&mask + 1))) {
return index + 32; // Add 32 because we are in the last 32
bits.
}
}
else {
// We need to check only the last 32 bits.
int data = *((int *)&mask + 1) & (0xFFFFFFFF - ((1 << (start -
32)) - 1));
if(_BitScanReverse((unsigned long *)&index, data)) {
return index + 32; // Found!
}
}

return -1; // Not found.
}

My question: Is there a way to make this code branchless? And would it
perform faster than my version?
I tried a branchless version, but it was more than twice as slow as
this one... maybe I've done something wrong.
Thanks in advance!

Rod Pemberton

unread,

Sep 24, 2009, 7:47:15 AM9/24/09

to

"ttt" <lgra...@MUNGED.microcosmotalk.com> wrote in message
news:4aba9f68$0$5665$9a6e...@unlimited.newshosting.com...

>
> I want to find the first set bit in a 64bit number, that has the index
> greater or equal to a certain start value.
> Example: for the number 0x8080 and start index = 10, the first set bit
> is at index 15 (the one found at index 7 is before the start index,
> thus invalid).
> The program needs to run on a Intel 32bit CPU, so in order to handle
> the 64 bit number I use two BSF and some conditional instructions to
> get the index.

Although you asked a specific question below, I think your code may have
some other serious problems. So, I'm going to ask some. I hope they don't
sound harsh. They aren't meant to be.

> static inline int SearchForward(__int64 mask, int start)

{

> int index = 0;
>
> if(start < 32) {
> // Check the first 32 bits.
> int data = *((int *)&mask) & (0xFFFFFFFF - ((1 << start) -
> 1));
> if(_BitScanForward((unsigned long *)&index, data)) {

Uh... You said you wanted the "first set bit in a 64bit number". By that,
do you mean the "most significant bit" which is set or the "least
significant bit"? The reason I ask is you're using BSF's (LSB) in this
(32-bit) section, but using a BSR (MSB) in the section below (64-bit ?).
Which is correct? It should be the same instruction to check the set bits
for both sections, yes?

> return index;
> }
> else if(_BitScanForward((unsigned long *)&index, *((int *)
> &mask + 1))) {

I'm unfamiliar with _BitScanForward() and what it returns. But, I'm fairly
sure your if-else is doing something unwanted. It looks to me like you need
to re-compute "data" with an adjusted start value for the mask. Yes?

I.e., the else-if only works if _BitScanForward() can return a zero
(true?) - indicating no set bits were found (yes?). If nothing is found in
the masked "lower" int, you then check the unmasked "upper" int, or
vice-versa (What's the int order in __int64?...). Is that really what you
want to do? If start is 12, you'll check only 12-bits out of the "lower"
int, but if nothing is found in those bits you'll *also* check the entire
"upper" int. Yes? So, you just want to skip checking for set bits in bits
12-31 of the 64-bit value? e.g., your checking these bits (in hex) for
start=12:

1) 0x0000000000000FFF - check if bit set in 0xFFF...
2) 0xFFFFFFFF00000000 - if no set bits in 0xFFF, check upper 32-bits...

> return index + 32; // Add 32 because we are in the last 32
> bits.
> }
> }
> else {

64-bit code section?...

> // We need to check only the last 32 bits.

Huh? Why? What if the "first set bit" is in the "upper" int? In the
32-bit section above, you check "lower" int, and then if no bit is set, you
check the "upper" int... But, here you're only checking *one* of the two
int's in an __int64? Why? Is the code above or code here incorrect?

> int data = *((int *)&mask + 1) & (0xFFFFFFFF - ((1 << (start -
> 32)) - 1));

I'll assume you've got the int order of the two int's in an __int64 correct.
I.e., isn't the second (32-bit) int, i.e., &mask[1], in an __int64 the upper
32-bits of an __int64? My brain is screaming that it's reversed. That
doesn't mean I'm correct... You just need to thoroughly check.

> if(_BitScanReverse((unsigned long *)&index, data)) {
> return index + 32; // Found!
> }
> }
>
> return -1; // Not found.
> }
>

...

> My question: Is there a way to make this code branchless?

You could remove the branch which selects between 32-bit code and 64-bit
code with #ifdef's. A simply done is example below. As for the branch in
the 32-bit section, you'll need to correct your code first to find out.

#define BITS32
/* define BITS 64 */

#ifdef BITS32
/* 32-bit code */
#endif
#ifdef BITS64
/* 64-bit code */
#endif

Your compiler may have defines which indicate which mode is under
compilation, already defined for you. If you don't know what they are, you
can probably find them here:
http://predef.sourceforge.net/precomp.html

HTH,

Rod Pemberton

Wolfgang Kern

unread,

Sep 24, 2009, 11:50:16 AM9/24/09

to

"ttt" wrote:

> Hi!

Hello,

> I want to find the first set bit in a 64bit number, that has the index
> greater or equal to a certain start value.
> Example: for the number 0x8080 and start index = 10, the first set bit
> is at index 15 (the one found at index 7 is before the start index,
> thus invalid).
> The program needs to run on a Intel 32bit CPU, so in order to handle
> the 64 bit number I use two BSF and some conditional instructions to
> get the index.

[snipped C ...]

> My question: Is there a way to make this code branchless?

Yeah, CMPXCHG, CMOVcc and even SETcc can save on conditional branches.

> And would it perform faster than my version?

Can't tell what your compiler makes out of your source, some
compilers may figure the idea behind and do it the smart way,
but I daubt this will work on intrisincs and ASM-parts.

> I tried a branchless version, but it was more than twice as slow as
> this one... maybe I've done something wrong.

You can always blame your compiler for things like this :)

> Thanks in advance!

BSF (even a bit slow) seem to be the rigth instruction for this
task if you don't have the latest SSE instructions available.
Not sure yet if fastest solution, but I'd use BTS mem,r
(works on more than just 32 bits) followed by two NEG mem or use
XMM instructions to create the 64 bit AND-mask from start-index.
__
wolfgang

Jerry Coffin

unread,

Sep 24, 2009, 2:27:41 PM9/24/09

to

In article <4aba9f68$0$5665$9a6e...@unlimited.newshosting.com>,
lgra...@MUNGED.microcosmotalk.com says...

>
> Hi!
> I want to find the first set bit in a 64bit number, that has the index
> greater or equal to a certain start value.

As Rod already pointed out, it's not clear whether you're looking for
the least significant or most significant bit that's set.

Personally, I'd probably start with an SHLD or SHRD (as appropriate)
to simply get rid of the part of the number you don't care about (if
you have a specified number of bits, this will probably be a bit
easier than generating a mask to AND out the bits you don't care
about).

That simplifies the scan, since you always scan from the "beginning"
of the result. Assuming you're starting from the LSB, a first attempt
at the code would look something like this:

; Warning: I just wrote this off the top of my head. It has not been
; tested at all!
;
; expects:
; high word to be scanned in edx
; low word in ebx
; number of bits to ignore in ecx
;
mov esi, ecx
shrd ebx, edx, cl

bsf eax, ebx
jnz done
bsf eax, edx
jz not_found
add eax, 32
done:
add eax, esi
ret
not_found:
mov eax, -1
ret

This obviously isn't branchless, but its two branches are still
substantially fewer than I'd expect from the C code you posted. I
suspect getting rid of all the branches will take quite a bit more
work without gaining a great deal of speed.

--
Later,
Jerry.

Terje Mathisen

unread,

Oct 4, 2009, 2:12:22 PM10/4/09

to

Jerry Coffin wrote:
> In article<4aba9f68$0$5665$9a6e...@unlimited.newshosting.com>,
> lgra...@MUNGED.microcosmotalk.com says...
>>
>> Hi!
>> I want to find the first set bit in a 64bit number, that has the index
>> greater or equal to a certain start value.
>
> As Rod already pointed out, it's not clear whether you're looking for
> the least significant or most significant bit that's set.
>
> Personally, I'd probably start with an SHLD or SHRD (as appropriate)
> to simply get rid of the part of the number you don't care about (if
> you have a specified number of bits, this will probably be a bit
> easier than generating a mask to AND out the bits you don't care
> about).
>
> That simplifies the scan, since you always scan from the "beginning"
> of the result. Assuming you're starting from the LSB, a first attempt
> at the code would look something like this:
>
> ; Warning: I just wrote this off the top of my head. It has not been
> ; tested at all!

BTDT, have lots of buggy code online to prove it. :-(

Anyway, the proper solution is of course to use 64-bit mode...

> ;
> ; expects:
> ; high word to be scanned in edx
> ; low word in ebx
> ; number of bits to ignore in ecx
> ;
> mov esi, ecx
> shrd ebx, edx, cl

This only works if the number of bits to ignore is less than 32!

If you replace the line above with

cmp ecx,32
mov eax,0
cmovae ebx,edx
cmovae edx,eax
shrd ebx,edx,cl

then it should work.

>
> bsf eax, ebx
> jnz done
> bsf eax, edx
> jz not_found
> add eax, 32
> done:
> add eax, esi
> ret
> not_found:
> mov eax, -1
> ret
>
> This obviously isn't branchless, but its two branches are still
> substantially fewer than I'd expect from the C code you posted. I
> suspect getting rid of all the branches will take quite a bit more
> work without gaining a great deal of speed.

Almost certainly true.

The cost of branches depend totally on how predictable they are, so you
have to check your input data!

You can handle the initial masking with a simple table lookup:

and ebx,keepmask[ecx*8]
and edx,keepmask[ecx*8+4]

If the first AND results in zero, then you know that there are no bits
in the first word, and so you don't need the second half at all.

Otherwise only the second word matters:

; expects:
; high word to be scanned in edx
; low word in ebx
; number of bits to ignore in ecx

xor esi,esi ;; Word offset, 0 or 32
and ebx,keepmask[ecx*8] ;; ECX has # of bits to ignore
jnz first_word

;; The first word is zero after masking, so we must test the second:
mov eax,-1 ;; EAX = 1, i.e. assume not found:
and edx,keepmask[ecx*8+4]
jz done

;; OK, we know that there is at least one one-zero bit in
;; the second word:
mov ebx,edx
mov esi,32
first_word:
bsf eax,ebx
add eax,esi
done:
ret

This code is probably close to optimal unless the 64-entry table will
cost too much L1 cache space.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Jerry Coffin

unread,

Oct 9, 2009, 6:20:53 PM10/9/09

to

In article <4ac8e586$0$5682$9a6e...@unlimited.newshosting.com>,
Terje.M...@MUNGED.microcosmotalk.com says...

[ ... ]

> > ; Warning: I just wrote this off the top of my head. It has not
> > ; been tested at all!
>
> BTDT, have lots of buggy code online to prove it. :-(

Ah, a nice diplomatic way of warning that my code was buggy! :-)

> Anyway, the proper solution is of course to use 64-bit mode...

Well, yes, that's clearly the preferred way to go at this point.

> This only works if the number of bits to ignore is less than 32!

Oops -- quite right.

> If you replace the line above with
>
> cmp ecx,32
> mov eax,0
> cmovae ebx,edx
> cmovae edx,eax
> shrd ebx,edx,cl
>
> then it should work.

I guess you could use:
xor eax, eax
cmp ecx, 32
cmovae ebx, edx
cmovae edx, eax
shrd ebx, edx, cl

but either way it kind of ruins what was a nice, clean approach. Sad
how reality ruins code that _would_ have been really nice...

[ ... ]

> > This obviously isn't branchless, but its two branches are still
> > substantially fewer than I'd expect from the C code you posted. I
> > suspect getting rid of all the branches will take quite a bit more
> > work without gaining a great deal of speed.
>
> Almost certainly true.
>
> The cost of branches depend totally on how predictable they are, so you
> have to check your input data!
>
> You can handle the initial masking with a simple table lookup:

Quite true, though with a good possibility of a cache miss unless
you're using this in a pretty tight loop. You can do a lot in the CPU
in the time it takes to read one item from main memory.

[ ... ]

> This code is probably close to optimal unless the 64-entry table
> will cost too much L1 cache space.

Yes. I'm afraid I still tend to think in terms of older processors,
when bsf and (especialy BSR) were quite slow, and the farther they
had to search, the slower they got. Those same processors had barrel
shifters though, so shifting instead of masking was a big win. With a
modern processor, however, you're almost certainly better off letting
bs[f|r] do its thing, with the simplest code around it you can
manage.

--
Later,
Jerry.

James Harris

unread,

Oct 10, 2009, 1:35:49 AM10/10/09

to

On 4 Oct, 19:12, Terje Mathisen

<Terje.Mathi...@MUNGED.microcosmotalk.com> wrote:
> Jerry Coffin wrote:

> > In article<4aba9f68$0$5665$9a6e1...@unlimited.newshosting.com>,
> > lgrat...@MUNGED.microcosmotalk.com says...

>
> >> Hi!
> >> I want to find the first set bit in a 64bit number, that has the index
> >> greater or equal to a certain start value.

...

> > ;
> > ; expects:
> > ; =A0high word to be scanned in edx
> > ; =A0low word in ebx
> > ; =A0number of bits to ignore in ecx
> > ;
> > =A0 =A0mov esi, ecx
> > =A0 =A0shrd ebx, edx, cl

>
> This only works if the number of bits to ignore is less than 32!
>
> If you replace the line above with
>

> =A0 =A0 =A0 =A0 cmp ecx,32
> =A0 =A0 =A0 =A0 mov eax,0
> =A0 =A0 =A0 =A0 cmovae ebx,edx
> =A0 =A0 =A0 =A0 cmovae edx,eax
> =A0 =A0 =A0 =A0 shrd ebx,edx,cl
>
> then it should work.

That sets up ebx but isn't a shift also needed to set up edx? If, for
example, ecx is 8 and the double shift clears ebx the code below will
still need edx to have been shifted.

> > =A0 =A0bsf eax, ebx
> > =A0 =A0jnz done
> > =A0 =A0bsf eax, edx
> > =A0 =A0jz not_found
> > =A0 =A0add eax, 32
> > done:
> > =A0 =A0add eax, esi
> > =A0 =A0ret
> > not_found:
> > =A0 =A0mov eax, -1
> > =A0 =A0ret

>
> > This obviously isn't branchless, but its two branches are still
> > substantially fewer than I'd expect from the C code you posted. I
> > suspect getting rid of all the branches will take quite a bit more
> > work without gaining a great deal of speed.
>
> Almost certainly true.

And more... IIRC from another discussion either here or in
alt.lang.asm branches were found to be faster than the conditional
instructions. In the case in point the OP found that although he
thought his data were random the version of the code with branches ran
quicker than that with cmov etc. Of course, it may depend on both the
data and the model of CPU but it seems from the discussion that
there's a distinction between non-organised and truly random.

Bottom line: don't be surprised if code with branches is faster than
code with conditional moves and sets, even if the input data looks
random.

>
> The cost of branches depend totally on how predictable they are, so you
> have to check your input data!

Come back branches, all is forgiven!

James