efficient approach to check a set bit

sanjib

unread,

Dec 4, 2009, 2:12:44 AM12/4/09

to

Friends,
I need a help to clear about following query. Seeking the help from
the group to solve and clear my doubts.

1. What might be an efficient and fastest way to check set bits of an
integer ? Suppose I have one bit set out of 32 bits of an integer.
(I can think of K&R approach i.e. based on iterations which is
equal to the no of set bits in an integer.)

Thanks in advanced

Regards
Sanjib

pete

unread,

Dec 4, 2009, 2:48:15 AM12/4/09

to

For unsigned int U:

#define READ_UBIT(U, N) (1u & (U) >> (N))

--
pete

Paul

unread,

Dec 4, 2009, 4:16:16 AM12/4/09

to

"sanjib" <sanji...@gmail.com> wrote in message
news:46ccbb20-9bed-4afa...@u36g2000prn.googlegroups.com...

If you definitely only have one bit set, you could try using a switch/
case, and hope that the compiler makes it a fast one.

Otherwise, you could try some sort of binary search, or perhaps
cast to an array of four bytes and check each one is not zero,
before you loop through it.

I'm not sure this saves much time for a 32 bit integer though.

P.

Riccardo Manfrin

unread,

Dec 4, 2009, 8:17:28 AM12/4/09

to

>> 1. What might be an efficient and fastest way to check set bits of an
>> integer ? Suppose I have one bit set out of 32 bits of an integer.
>> (I can think of K&R approach i.e. based on iterations which is
>> equal to the no of set bits in an integer.)

From a theoretical point of view:
you're searching withing a sparse space (an array in this case). Lots of
methods exist to do the job.
What comes to my mind is adaptations of sorting algorithms: e.g.:

0) N = 16
1) take your int and shift it by N bits right
2) is the value > 0 ?
3) YES => (N = N/2; get back to 1) )
4) (NO) shift you int by N bits [left]
5) is the value > 0 ?
6) YES => (N = N/2; get back to 4) )

Pseudocode errors aside, summing up all the shifts you get to the
position of the set bit within less than log2(32) = 5 checks...
Or maybe I'm just saying bullshit as usual.. at least I tried ;)
R

Ben Bacarisse

unread,

Dec 4, 2009, 8:48:53 AM12/4/09

to

"Paul" <-> writes:

> "sanjib" <sanji...@gmail.com> wrote in message
> news:46ccbb20-9bed-4afa...@u36g2000prn.googlegroups.com...
>> Friends,
>> I need a help to clear about following query. Seeking the help from
>> the group to solve and clear my doubts.
>>
>> 1. What might be an efficient and fastest way to check set bits of an
>> integer ? Suppose I have one bit set out of 32 bits of an integer.
>> (I can think of K&R approach i.e. based on iterations which is
>> equal to the no of set bits in an integer.)
>
> If you definitely only have one bit set, you could try using a switch/
> case, and hope that the compiler makes it a fast one.

If there really is only one bit set another possibility is:

int bit_set(uint32_t x)
{
return !(0x55555555 & x) +
(!(0x33333333 & x) << 1) +
(!(0x0f0f0f0f & x) << 2) +
(!(0x00ff00ff & x) << 3) +
(!(0x0000ffff & x) << 4);
}

but i don't think that is what the OP wants. I think they want to
iterate through all the 1s in an unsigned int (well, lets hope it's
unsigned). I think the K&R reference might be to using x & (~x + 1).
The two can be combined of course to print the numbers corresponding
to bits set:

while (x) {
uint32_t low_bit = x & (~x + 1);
printf(" %d", bit_set(low_bit));
x &= ~low_bit;
}

--
Ben.

Mark Dickinson

unread,

Dec 4, 2009, 9:17:16 AM12/4/09

to

On Dec 4, 9:16 am, "Paul" <-> wrote:
> "sanjib" <sanjibkd...@gmail.com> wrote in message

> > 1. What might be an efficient and fastest way to check set bits of an
> > integer ? Suppose I have one bit set out of 32 bits of an integer.
> > (I can think of K&R approach i.e. based on iterations which is
> > equal to the no of set bits in an integer.)
>
> If you definitely only have one bit set, you could try using a switch/
> case, and hope that the compiler makes it a fast one.

From the evil tricks department, assuming you really do have exactly
one bit set: the values 1, 2, 4, ..., 1<<31 are all distinct modulo
37,
so simply reduce modulo 37 and then use a lookup table.

--
Mark

mohangupta13

unread,

Dec 4, 2009, 10:48:56 AM12/4/09

to

On Dec 4, 6:48 pm, Ben Bacarisse <ben.use...@bsb.me.uk> wrote:
> "Paul" <-> writes:
> > "sanjib" <sanjibkd...@gmail.com> wrote in message

Ben can you kindly explain how all this stuff
is working, it looks quite arcane to me (still a young programmer ).
Thanks
Mohan

sanjib

unread,

Dec 4, 2009, 11:20:15 AM12/4/09

to

On Dec 4, 2:16 pm, "Paul" <-> wrote:
> "sanjib" <sanjibkd...@gmail.com> wrote in message

> > Sanjib- Hide quoted text -
>
> - Show quoted text -

Hi Paul
Will you please kindly explain about the use of binary search
technique for this.(I am Still a learner)
Thanks in advanced.

sanjib

Richard Bos

unread,

Dec 4, 2009, 11:30:59 AM12/4/09

to

Mark Dickinson <dick...@gmail.com> wrote:

> On Dec 4, 9:16=A0am, "Paul" <-> wrote:
> > "sanjib" <sanjibkd...@gmail.com> wrote in message
> > > 1. What might be an efficient and fastest way to check set bits of an
> > > integer ? Suppose I have one bit set out of 32 bits of an integer.

> > > =A0 (I can think of K&R approach i.e. based on iterations which is

> > > equal to the no of set bits in an integer.)
> >
> > If you definitely only have one bit set, you could try using a switch/
> > case, and hope that the compiler makes it a fast one.
>
> From the evil tricks department, assuming you really do have exactly
> one bit set: the values 1, 2, 4, ..., 1<<31 are all distinct modulo
> 37, so simply reduce modulo 37 and then use a lookup table.

That's a _very_ evil trick. I like it.

Richard

bartc

unread,

Dec 4, 2009, 11:31:32 AM12/4/09

to

"sanjib" <sanji...@gmail.com> wrote in message
news:46ccbb20-9bed-4afa...@u36g2000prn.googlegroups.com...

This is another approach if you have an integer that you know has exactly 1
bit set:

#include <math.h>
int findsetbit(unsigned int a) {
static double ilg2=1.0/log(2);
return round(log(a)*ilg2);
}

I doubt it's fast or efficient though.

Will the number in general have any number of bits set?

And how do you want the results presented? Anything involving an array will
probably require more work in setting up and traversing, let alone doing
anything with the information, than simply shifting and masking the integer
to get at the positions.

Probably the most compact way of recording the bit positions is to use a map
of '1' bits; in other words, do nothing!

--
Bartc

Antoninus Twink

unread,

Dec 4, 2009, 12:40:25 PM12/4/09

to

On 4 Dec 2009 at 16:30, Richard Bos wrote:

> Mark Dickinson <dick...@gmail.com> wrote:
>> From the evil tricks department, assuming you really do have exactly
>> one bit set: the values 1, 2, 4, ..., 1<<31 are all distinct modulo
>> 37, so simply reduce modulo 37 and then use a lookup table.
>
> That's a _very_ evil trick. I like it.

It's all very clever - great if you're looking to impress geeky chicks
- but in practice I'd be extremely surprised if it beats Ben
Becarisse's bit-twiddling solution, which just has a few ands, shifts
and adds.

Even if you're doing lots of these at once, so that the LUT will usually
be available in cache, an integer division and a memory access will
probably be expensive compared to a dozen or so basic arithmetic and
logical operations.

Ben Bacarisse

unread,

Dec 4, 2009, 12:46:26 PM12/4/09

to

ral...@xs4all.nl (Richard Bos) writes:

And, as it happens, if you need to do the same for 64 bits, the
smallest modulus is 67 -- the 7 making both 37 and 67 reasonably easy
to remember.

In case anyone cares, the array contents would be:

0, 0, 1, 39, 2, 15, 40, 23, 3, 12, 16, 59, 41, 19, 24, 54,
4, 0, 13, 10, 17, 62, 60, 28, 42, 30, 20, 51, 25, 44, 55, 47,
5, 32, 0, 38, 14, 22, 11, 58, 18, 53, 63, 9, 61, 27, 29, 50,
43, 46, 31, 37, 21, 57, 52, 8, 26, 49, 45, 36, 56, 7, 48, 35,
6, 34, 33

--
Ben.

Willem

unread,

Dec 4, 2009, 1:22:23 PM12/4/09

to

Ben Bacarisse wrote:
) And, as it happens, if you need to do the same for 64 bits, the
) smallest modulus is 67 -- the 7 making both 37 and 67 reasonably easy
) to remember.

Question for the mathematically inclined: Does it work iff the
modulus has no factors smaller than the number of bits you need ?

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Antoninus Twink

unread,

Dec 4, 2009, 1:30:22 PM12/4/09

to

On 4 Dec 2009 at 18:22, Willem wrote:
> Question for the mathematically inclined: Does it work iff the
> modulus has no factors smaller than the number of bits you need ?

What do you mean by "it"?

The "trick" is: given k, find a small integer m such that 1, 2, 4, ...,
2^k all have distinct reductions modulo m. Such an m always exists, of
course (e.g. m = 2^k is a not very helpful example).

Is your question when m can be chosen to be not much larger than k, as
in the k = 31 and k = 63 examples in this thread? I don't immediately
see a way to make a statement along those lines that's both precise and
meaningful.

Ben Bacarisse

unread,

Dec 4, 2009, 1:35:42 PM12/4/09

to

mohangupta13 <mohang...@gmail.com> writes:

[asking for an explanation...]

> On Dec 4, 6:48 pm, Ben Bacarisse <ben.use...@bsb.me.uk> wrote:
>> "Paul" <-> writes:
>> > "sanjib" <sanjibkd...@gmail.com> wrote in message
>> >news:46ccbb20-9bed-4afa...@u36g2000prn.googlegroups.com...
>> >> Friends,
>> >> I need a help to clear about following query. Seeking the help from
>> >> the group to solve and clear my doubts.
>>
>> >> 1. What might be an efficient and fastest way to check set bits of an
>> >> integer ? Suppose I have one bit set out of 32 bits of an integer.
>> >> (I can think of K&R approach i.e. based on iterations which is
>> >> equal to the no of set bits in an integer.)
>>
>> > If you definitely only have one bit set, you could try using a switch/
>> > case, and hope that the compiler makes it a fast one.
>>
>> If there really is only one bit set another possibility is:
>>
>> int bit_set(uint32_t x)
>> {
>> return !(0x55555555 & x) +
>> (!(0x33333333 & x) << 1) +
>> (!(0x0f0f0f0f & x) << 2) +
>> (!(0x00ff00ff & x) << 3) +
>> (!(0x0000ffff & x) << 4);
>> }

This is, in essence, the binary search that has been described in
words but with the loop unrolled.

Start with the last term. 0x0000ffff & x is non zero when the bit in
x is in the bottom half. The ! inverts this test and gives 1 when the
bit is in the top half and 0 otherwise. The << 4 turn this 1/0 into
16 or 0 and, lo, all this xs whose bit is in the top half should
return a number > 16.

The previous term does the same but looking for the bit in the top
half of either half. The result being to add 8 or 0 to the sum
depending on which quart of the bits the single set bit is in.

Each of the other terms adds another component to the sum depending on
finer and finder details about position of the bit we are looking for.

>> but i don't think that is what the OP wants. I think they want to
>> iterate through all the 1s in an unsigned int (well, lets hope it's
>> unsigned). I think the K&R reference might be to using x & (~x + 1).
>> The two can be combined of course to print the numbers corresponding
>> to bits set:
>>
>> while (x) {
>> uint32_t low_bit = x & (~x + 1);
>> printf(" %d", bit_set(low_bit));
>> x &= ~low_bit;
>> }

~x flips all the (value) bits in the unsigned int. A run on zeros at
the least significant end of x turns into a run of 1s. For example,
if x is 100011000, ~x is 011100111. Adding 1 to this will put the
zeros back again and leave a 1 where the least significant bit was
set: 011101000. The "and" of x and this value will have only one bit
set -- the least significant one: 000001000. After printing the
positional number of this bit (3) the x &= ~low_bit clears that bit
from x (~low_bit == 111110111 so x & ~low_bit is 100010000) so the
loop can then find the next least significant bit.

<snip>

> Ben can you kindly explain how all this stuff
> is working, it looks quite arcane to me (still a young programmer ).

I hope that helps.

--
Ben.

bartc

unread,

Dec 4, 2009, 2:38:52 PM12/4/09

to

"Antoninus Twink" <nos...@nospam.invalid> wrote in message
news:slrnhhiic9...@nospam.invalid...

On my machine the modulo 37 method was 2 to 5 times as slow as the shift and
mask one.

--
bartc

Mark Dickinson

unread,

Dec 4, 2009, 2:57:21 PM12/4/09

to

On Dec 4, 6:22 pm, Willem <wil...@stack.nl> wrote:
> Ben Bacarisse wrote:
>
> ) And, as it happens, if you need to do the same for 64 bits, the
> ) smallest modulus is 67 -- the 7 making both 37 and 67 reasonably easy
> ) to remember.
>
> Question for the mathematically inclined: Does it work iff the
> modulus has no factors smaller than the number of bits you need ?

No. As an example, there are only 20 distinct powers of 2 modulo 41,
so this wouldn't work if 37 were replaced with 41.

For 1, 2, 4, ..., 2^(n-1) to be distinct modulo p all you need is that
the order of 2 modulo p is at least n. A nice sufficient condition is
that p is a prime > n and 2 is a primitive root modulo p (in other
words, 1, 2, ..., 2^(p-1) are all distinct modulo p). Since
(conjecturally) around 37.4% of primes p have this property[1], you
generally don't have to look very far beyond n before finding one.

--
Mark

[1] http://en.wikipedia.org/wiki/Artin's_conjecture_on_primitive_roots

Mark Dickinson

unread,

Dec 4, 2009, 3:03:15 PM12/4/09

to

On Dec 4, 5:40 pm, Antoninus Twink <nos...@nospam.invalid> wrote:
> It's all very clever - great if you're looking to impress geeky chicks
> - but in practice I'd be extremely surprised if it beats Ben
> Becarisse's bit-twiddling solution, which just has a few ands, shifts
> and adds.
>
> Even if you're doing lots of these at once, so that the LUT will usually
> be available in cache, an integer division and a memory access will
> probably be expensive compared to a dozen or so basic arithmetic and
> logical operations.

Hmm, yes. Oh well. I'll file it in the 'fun but useless' folder,
then.

Mark

Beej Jorgensen

unread,

Dec 4, 2009, 4:33:54 PM12/4/09

to

Antoninus Twink <nos...@nospam.invalid> wrote:
>- but in practice I'd be extremely surprised if it beats Ben
>Becarisse's bit-twiddling solution, which just has a few ands, shifts
>and adds.

Relative runtimes on a quick-and-dirty benchmark on my Athlon64 (I'm
presuming the LUT was cached):

unoptimized -O2
Ben: 1.0 1.0
Mod37: 1.4 0.7

Looking at the assembly, gcc replaced the mod 37 with some hacker
goodies (not a div).

-Beej

Ben Bacarisse

unread,

Dec 4, 2009, 4:45:56 PM12/4/09

to

Mark Dickinson <dick...@gmail.com> writes:

I timed my bit-twiddling version against the mod 37 table look-up
solution and mine was (almost) a factor of 2 slower on a x86_64 laptop
(gcc 4.4.1). The look-up was faster by the same factor at various
optimisation levels.

--
Ben.

Antoninus Twink

unread,

Dec 4, 2009, 5:25:39 PM12/4/09

to

On 4 Dec 2009 at 21:33, Beej Jorgensen wrote:
> unoptimized -O2
> Ben: 1.0 1.0
> Mod37: 1.4 0.7
>
> Looking at the assembly, gcc replaced the mod 37 with some hacker
> goodies (not a div).

Clever old gcc!

If it was doing an integer division (as it surely is in the unoptimized
version), that's probably 50 cycles plus, which would easily outweigh
Ben's few masks and shifts.

Assuming your test consists of calling the routine thousands of times in
succession, the LUT will easily fit in L1 cache, so we're only talking a
couple of cycles for each memory access once it's been cached the first
time.

Beej Jorgensen

unread,

Dec 4, 2009, 7:53:11 PM12/4/09

to

Antoninus Twink <nos...@nospam.invalid> wrote:
>If it was doing an integer division (as it surely is in the unoptimized
>version)

There's a mult in the unoptimized version, but surprisingly no div.
My x86-fu is weak, so I can't really decode it, but I know there are no
DIVs in there. :) Lotsa moves, a mult, four shifts, two adds, and two
subtracts.

Most fun is the use of the constant 0xBACF914D in the mod calculation.
They multiply that by the value to lookup, and discard the low 32-bits
of the result. Then from that they subtract the value to lookup, and
shift the result right by one. From there, things start to get strange,
until, 11 instructions later, they magically end up with an index into
the lookup table.

In short, I'm very glad that people specialize in exactly this thing. :)

>Assuming your test consists of calling the routine thousands of times in
>succession, the LUT will easily fit in L1 cache, so we're only talking a
>couple of cycles for each memory access once it's been cached the first
>time.

Yup.

-Beej

Ian Collins

unread,

Dec 4, 2009, 8:59:30 PM12/4/09

to

Beej Jorgensen wrote:
> Antoninus Twink <nos...@nospam.invalid> wrote:
>> If it was doing an integer division (as it surely is in the unoptimized
>> version)
>
> There's a mult in the unoptimized version, but surprisingly no div.
> My x86-fu is weak, so I can't really decode it, but I know there are no
> DIVs in there. :) Lotsa moves, a mult, four shifts, two adds, and two
> subtracts.

Sun CC does much the same.

> Most fun is the use of the constant 0xBACF914D in the mod calculation.
> They multiply that by the value to lookup, and discard the low 32-bits
> of the result. Then from that they subtract the value to lookup, and
> shift the result right by one. From there, things start to get strange,
> until, 11 instructions later, they magically end up with an index into
> the lookup table.

Yes, same constant, similar jiggery pokery.

> In short, I'm very glad that people specialize in exactly this thing. :)

!

--
Ian Collins

James Dow Allen

unread,

Dec 5, 2009, 5:16:56 AM12/5/09

to

On Dec 4, 9:17 pm, Mark Dickinson <dicki...@gmail.com> wrote:
> From the evil tricks department, assuming you really do have exactly
> one bit set: the values 1, 2, 4, ..., 1<<31 are all distinct modulo
> 37, so simply reduce modulo 37 and then use a lookup table.

This is shown (in a slightly disguised form) in
Bit Twiddling Hacks
By Sean Eron Anderson
Anderson discovered it but "found it was invented earlier
by Reiser, according to Hacker's Delight."

Anderson also shows another way using multiply instead
of divide, though *not* the replace-divide-with-multiply
mentioned downthread.

Speaking of that replace-divide-with-multiply; it
doesn't work when the divisor is variable, but
what if it's variable at compile-time
but fixed early during initialization? Is any
comp.programmer as masochisticly perfectionistic
as I might once have been to handle *those* cases
with optimized constant division?

James Dow Allen

bartc

unread,

Dec 5, 2009, 6:10:23 AM12/5/09

to

"Beej Jorgensen" <be...@beej.us> wrote in message
news:hfbv82$se4$1...@news.eternal-september.org...

gcc is always giving problems when trying to benchmark things. Sometimes it
optimises out the very thing you're trying to measure the speed of.

In my own tests I had to discard the results that gave Ben's code a runtime
of 0.

I think if there exists some code that does the equivalent of the Mod37
method, but is faster, then perhaps that should become the evil trick
instead.

--
Bartc

James Dow Allen

unread,

Dec 5, 2009, 6:23:39 AM12/5/09

to

On Dec 5, 6:10 pm, "bartc" <ba...@freeuk.com> wrote:
> "Beej Jorgensen" <b...@beej.us> wrote in message

> > Looking at the assembly, gcc replaced the mod 37 with some hacker
> > goodies (not a div).
>
> gcc is always giving problems when trying to benchmark things. Sometimes it
> optimises out the very thing you're trying to measure the speed of.

But if there's a way to speed up mod 37, isn't that what
you'd want to measure the speed of?

Anyway, it's often not hard to defeat gcc's optimizations, either
by changing -O level, or just making a constancy less visible.

As I posted in this thread an hour ago, Anderson *does*
offer an alternative to the mod 37 trick; I don't know
about relative speed.

James

Mark Dickinson

unread,

Dec 5, 2009, 6:44:50 AM12/5/09

to

On Dec 5, 12:53 am, Beej Jorgensen <b...@beej.us> wrote:
> Most fun is the use of the constant 0xBACF914D in the mod calculation.
> They multiply that by the value to lookup, and discard the low 32-bits
> of the result. Then from that they subtract the value to lookup, and
> shift the result right by one. From there, things start to get strange,
> until, 11 instructions later, they magically end up with an index into
> the lookup table.

Very interesting. My first guess was that 0xBACF914D gives the most
significant 32 bits of the reciprocal of 37, so that the division by
37 can be replaced with a multiplication by its reciprocal. But it
turns out to be more interesting than that: the constant is ceiling
(2^32 * 27 / 37). With gcc-4.2 -O3 on x86-64, I get the following.
(The original value, call it 'x', starts out in register edi.)

movl $-1160801971, %edx
movl %edi, %eax
mull %edx

edx now contains something close to (x*2^32*27/37) >> 32 = x * 27 /
32. (-1160801971 + 2^32 = 0xBACF914D.)

movl %edi, %eax
subl %edx, %eax
shrl %eax
addl %eax, %edx
shrl $5, %edx

Put x - x*27/37 = x*10/37 in eax; shift 1 to get x*5/37; add to the
original x*27/37 to get x*32/37; shift 5 places to get x/37. There
seems to be a possibility that the result is out by 1 or 2 by this
stage; presumably a careful analysis would show that this always
works out exactly. We've now got the quotient floor(x/37) in edx.

leal (%rdx,%rdx,8), %eax
leal (%rdx,%rax,4), %eax
subl %eax, %edi

Addition chain to multiply the quotient by 37 (1st step puts 9*edx in
eax; second step puts edx + 4*eax = 37*edx in eax.) Then subtract to
get remainder in edi: x%37 = x-37*floor(x/37).

leaq _dlog(%rip), %rax
movzbl (%rdi,%rax), %eax

Do the lookup.

> In short, I'm very glad that people specialize in exactly this thing. :)

Me too!

--
Mark

Mark Dickinson

unread,

Dec 5, 2009, 6:59:00 AM12/5/09

to

On Dec 5, 10:16 am, James Dow Allen <jdallen2...@yahoo.com> wrote:
> On Dec 4, 9:17 pm, Mark Dickinson <dicki...@gmail.com> wrote:
>
> > From the evil tricks department, assuming you really do have exactly
> > one bit set: the values 1, 2, 4, ..., 1<<31 are all distinct modulo
> > 37, so simply reduce modulo 37 and then use a lookup table.
>
> This is shown (in a slightly disguised form) in
> Bit Twiddling Hacks
> By Sean Eron Anderson
> Anderson discovered it but "found it was invented earlier
> by Reiser, according to Hacker's Delight."

Thanks for the reference. I have no memory of where I got this from,
but I guess it must have been some source derived from one of the
above. I just know that the constant '37' has long been embedded
somewhere in my brain for this purpose.

> [...]

> Speaking of that replace-divide-with-multiply; it
> doesn't work when the divisor is variable, but
> what if it's variable at compile-time
> but fixed early during initialization? Is any
> comp.programmer as masochisticly perfectionistic
> as I might once have been to handle *those* cases
> with optimized constant division?

I believe GMP does something a bit like this: precalculating inverses
(at runtime) for divisors that are likely to be used more than once.
It makes sense for multi-limb divisions, where you end up repeatedly
dividing by (something related to) the top limb of the divisor.

--
Mark

Willem

unread,

Dec 5, 2009, 7:17:54 AM12/5/09

to

James Dow Allen wrote:
) On Dec 5, 6:10�pm, "bartc" <ba...@freeuk.com> wrote:
)> "Beej Jorgensen" <b...@beej.us> wrote in message
)> > Looking at the assembly, gcc replaced the mod 37 with some hacker
)> > goodies (not a div).
)>
)> gcc is always giving problems when trying to benchmark things. Sometimes it
)> optimises out the very thing you're trying to measure the speed of.
)
) But if there's a way to speed up mod 37, isn't that what
) you'd want to measure the speed of?

If the mod37 gets sped up to some weird combination of unsigned multiply
and some shifts and adds, perhaps there is a way to do it with just the
unsigned multiply ? Perhaps there is some multiplication factor f and
constant m so that ((2^x)*f (mod 2^m)) is distinct for x from 0 to 32 ?

Mark Dickinson

unread,

Dec 5, 2009, 7:51:31 AM12/5/09

to

On Dec 5, 10:16 am, James Dow Allen <jdallen2...@yahoo.com> wrote:

> Anderson also shows another way using multiply instead
> of divide, though *not* the replace-divide-with-multiply
> mentioned downthread.

I just checked this out: it's *much* nicer (simpler, faster,
neater, more satisfying) than the mod 37 hack!

http://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightMultLookup

It seems to depend only on the observation that in the 32-bit
binary number:

x = 00000111011111001011010100110001

the 32 5-bit cyclic subsequences are all distinct, so the
top 5 bits of x << i are different for each i in the range
0 <= i < 32.

--
Mark

Phil Carmody

unread,

Dec 5, 2009, 7:56:56 AM12/5/09

to

If you use it, try to use a variation which avoids UB.

Reduction modulo 37 may not be a fast operation, so a
multiply and shift version may be better. A 5-bit LFSR
is your friend.

Phil
--
Any true emperor never needs to wear clothes. -- Devany on r.a.s.f1

Phil Carmody

unread,

Dec 5, 2009, 7:58:55 AM12/5/09

to

Willem <wil...@stack.nl> writes:
> Ben Bacarisse wrote:
> ) And, as it happens, if you need to do the same for 64 bits, the
> ) smallest modulus is 67 -- the 7 making both 37 and 67 reasonably easy
> ) to remember.
>
> Question for the mathematically inclined: Does it work iff the
> modulus has no factors smaller than the number of bits you need ?

No.

Kaz Kylheku

unread,

Dec 5, 2009, 1:33:41 PM12/5/09

to

On 2009-12-05, bartc <ba...@freeuk.com> wrote:
>
> "Beej Jorgensen" <be...@beej.us> wrote in message
> news:hfbv82$se4$1...@news.eternal-september.org...
>> Antoninus Twink <nos...@nospam.invalid> wrote:
>>>- but in practice I'd be extremely surprised if it beats Ben
>>>Becarisse's bit-twiddling solution, which just has a few ands, shifts
>>>and adds.
>>
>> Relative runtimes on a quick-and-dirty benchmark on my Athlon64 (I'm
>> presuming the LUT was cached):
>>
>> unoptimized -O2
>> Ben: 1.0 1.0
>> Mod37: 1.4 0.7
>>
>> Looking at the assembly, gcc replaced the mod 37 with some hacker
>> goodies (not a div).
>
> gcc is always giving problems when trying to benchmark things. Sometimes it
> optimises out the very thing you're trying to measure the speed of.

So what is exactly that you are trying to benchmark? Your CPU's
instructions for computing a quotient and remainder?

If that's what you are after, then C is the wrong interface,
don't you think. Write the exact instructions you want and
benchmark that.

Write down a specification which describes as precisely as possible the
entity that you are trying to benchmark, and then ensure that the
benchmark program actually measures that entity.

Ian Collins

unread,

Dec 5, 2009, 2:06:48 PM12/5/09

to

or an optimiser that does the trick for you.

--
Ian Collins

bartc

unread,

Dec 5, 2009, 2:47:57 PM12/5/09

to

"Kaz Kylheku" <kkyl...@gmail.com> wrote in message
news:200912051...@gmail.com...

> On 2009-12-05, bartc <ba...@freeuk.com> wrote:

>> gcc is always giving problems when trying to benchmark things. Sometimes
>> it
>> optimises out the very thing you're trying to measure the speed of.
>
> So what is exactly that you are trying to benchmark? Your CPU's
> instructions for computing a quotient and remainder?

In this case, whether a solution involving a mod operation and a table
lookup is, in general, faster or slower than one using a bunch of shifts and
masks.

I want to find out the efficiency of my C compiler in generating code for
mods, lookups, shifts and masks, not how clever it is in eliminating those
operations!

Since a simple benchmark often involves repeating a simple operation
millions of times, gcc especially tries to avoid those millions of
iterations, often resulting in zero runtimes.

Sure I can turn off optimisation completely, but that gives misleading
results too.

(I work on my own language projects and I spend a lot of time comparing
runtimes against C. If I have a recursive benchmark, I want to see how my
call/return code stacks up the call/return code that a good C compiler will
generate. Having the C compiler remove call/return operations is not really
very helpful. So it does tail recursion elimination; I could do that too,
but that's not what I'm testing...)

> If that's what you are after, then C is the wrong interface,
> don't you think. Write the exact instructions you want and
> benchmark that.

On my cpu, there's an instruction BSF that does exactly this task, return
the bit index of the only (or lowest) set bit.

A good test is how well these C routines compare against that.

Using a benchmark which finds the set bit for 1,2,4,8,,,,42949672296 in
turn, executing the inner loop body ten times to reduce the effect of loop
overheads, and doing the whole thing 10 million times, took 670ms using
inline assembler.

Using C compilers other than gcc, took various times from 3500ms to 6600ms
(using the mask/shift code).

gcc 3.4.5 with -O0 took 7100ms.

With -O1, -O2 and -O3 it took just 62ms. That's incredible! gcc even
with -O1 outperforming tight inline assembler by ten to one. And an
optimiser which speeds things up by over 100 times!

That's why care has to be taken especially with gcc which can be too clever
for it's own good.

--
Bartc

bartc

unread,

Dec 5, 2009, 3:20:49 PM12/5/09

to

"bartc" <ba...@freeuk.com> wrote in message
news:NNySm.12248$Ym4....@text.news.virginmedia.com...

>
> "Kaz Kylheku" <kkyl...@gmail.com> wrote in message
> news:200912051...@gmail.com...
>> On 2009-12-05, bartc <ba...@freeuk.com> wrote:
>
>>> gcc is always giving problems when trying to benchmark things. Sometimes
>>> it
>>> optimises out the very thing you're trying to measure the speed of.
>>
>> So what is exactly that you are trying to benchmark? Your CPU's
>> instructions for computing a quotient and remainder?

>> If that's what you are after, then C is the wrong interface,

>> don't you think. Write the exact instructions you want and
>> benchmark that.
>
> On my cpu, there's an instruction BSF that does exactly this task, return
> the bit index of the only (or lowest) set bit.
>
> A good test is how well these C routines compare against that.
>
> Using a benchmark which finds the set bit for 1,2,4,8,,,,42949672296 in
> turn, executing the inner loop body ten times to reduce the effect of loop
> overheads, and doing the whole thing 10 million times, took 670ms using
> inline assembler.

I meant just one million times; my machine is certainly not that fast.

> Using C compilers other than gcc, took various times from 3500ms to 6600ms
> (using the mask/shift code).

(I have an interpreted language which has a built-in operator .firstbit. The
same test took 4000ms, comparable with a normal C compiler. This shows the
advantage of having these kinds of ops built-in and known to the
language/compiler. The implementation uses, of course, the x86 BSF
instruction.)

--
Bartc

Beej Jorgensen

unread,

Dec 5, 2009, 5:52:01 PM12/5/09

to

bartc <ba...@freeuk.com> wrote:
>gcc is always giving problems when trying to benchmark things.
>Sometimes it optimises out the very thing you're trying to measure the
>speed of.

Yeah--I defeated this by feeding an accumulated summary value out of the
program at the end of the run; I also checked the assembly output to
make the code was still there.

But it's hard to get C to give a legit benchmark of some approach or
another which I why I labeled it "quick and dirty". As usual, the best
you can really take away from those numbers is that "sometimes
source-level C optimizations don't work out the way you think they
will."

-Beej

Nick

unread,

Dec 6, 2009, 5:58:36 AM12/6/09

to

Beej Jorgensen <be...@beej.us> writes:

> Antoninus Twink <nos...@nospam.invalid> wrote:
>>If it was doing an integer division (as it surely is in the unoptimized
>>version)
>
> There's a mult in the unoptimized version, but surprisingly no div.
> My x86-fu is weak, so I can't really decode it, but I know there are no
> DIVs in there. :) Lotsa moves, a mult, four shifts, two adds, and two
> subtracts.
>
> Most fun is the use of the constant 0xBACF914D in the mod calculation.
> They multiply that by the value to lookup, and discard the low 32-bits
> of the result. Then from that they subtract the value to lookup, and
> shift the result right by one. From there, things start to get strange,
> until, 11 instructions later, they magically end up with an index into
> the lookup table.
>
> In short, I'm very glad that people specialize in exactly this thing. :)

So we have an arcane arithmetical hack at source code level which is
optimised by the compiler into a different and equally arcane hack at
machine code level.

Wonderful!
--
Online waterways route planner: http://canalplan.org.uk
development version: http://canalplan.eu

Phil Carmody

unread,

Dec 6, 2009, 9:50:26 AM12/6/09

to

Oh, I don't mean reduction modulo 37 using a multiply, I mean a
completely different { 2^i: i \in 0..32 } -> { 0 .. 31 } mapping.

For example, taking an 8-bit example, multiply by a magic constant
like 0b0001110100, which has every single possible 3-bit subsequence,
which can be extracted from a fixed location after the multiplication.

Ben Bacarisse

unread,

Dec 6, 2009, 12:28:35 PM12/6/09

to

Phil Carmody <thefatphi...@yahoo.co.uk> writes:

> Ian Collins <ian-...@hotmail.com> writes:
>> Phil Carmody wrote:
>> > ral...@xs4all.nl (Richard Bos) writes:
>> >> Mark Dickinson <dick...@gmail.com> wrote:
>> >>
>> >>> On Dec 4, 9:16=A0am, "Paul" <-> wrote:
>> >>>> "sanjib" <sanjibkd...@gmail.com> wrote in message
>> >>>>> 1. What might be an efficient and fastest way to check set bits of an
>> >>>>> integer ? Suppose I have one bit set out of 32 bits of an integer.
>> >>>>> =A0 (I can think of K&R approach i.e. based on iterations which is
>> >>>>> equal to the no of set bits in an integer.)
>> >>>> If you definitely only have one bit set, you could try using a switch/
>> >>>> case, and hope that the compiler makes it a fast one.
>> >>> From the evil tricks department, assuming you really do have exactly
>> >>> one bit set: the values 1, 2, 4, ..., 1<<31 are all distinct modulo
>> >>> 37, so simply reduce modulo 37 and then use a lookup table.
>> >> That's a _very_ evil trick. I like it.
>> > If you use it, try to use a variation which avoids UB.
>> > Reduction modulo 37 may not be a fast operation, so a
>> > multiply and shift version may be better. A 5-bit LFSR
>> > is your friend.
>>
>> or an optimiser that does the trick for you.
>
> Oh, I don't mean reduction modulo 37 using a multiply, I mean a
> completely different { 2^i: i \in 0..32 } -> { 0 .. 31 } mapping.

I think you mean i \in 0..31.

> For example, taking an 8-bit example, multiply by a magic constant
> like 0b0001110100, which has every single possible 3-bit subsequence,
> which can be extracted from a fixed location after the
> multiplication.

A solution using such a de Bruijn sequence has already been posted[1].
On my x86_64 machine it is the fastest solution so far:

mine & and shift: 84ns
mod 37 + lookup: 43ns
de Bruijn multiply plus lookup: 19ns

All are compiler with optimisation (the function ended up inlined and
%37 was done as previouslt descibed).

The de Bruijn method omits the x & -x from the bithacks site since we
don't need it here:

int bit_set_4(uint32_t x)
{
static const int MultiplyDeBruijnBitPosition[32] = {
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
return MultiplyDeBruijnBitPosition[x * 0x077CB531U >> 27];
}

[1] http://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightMultLookup

--
Ben.

Thad Smith

unread,

Dec 6, 2009, 11:35:49 PM12/6/09

to

Ben Bacarisse wrote:
> The de Bruijn method omits the x & -x from the bithacks site since we
> don't need it here:
>
> int bit_set_4(uint32_t x)
> {
> static const int MultiplyDeBruijnBitPosition[32] = {
> 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
> 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
> };
> return MultiplyDeBruijnBitPosition[x * 0x077CB531U >> 27];
> }

Portable version, probably same size on common processors:

return MultiplyDeBruijnBitPosition[(x * 0x077CB531U & 0xffffffffU)
>> 27];

--
Thad

bartc

unread,

Dec 7, 2009, 6:37:27 AM12/7/09

to

"Ben Bacarisse" <ben.u...@bsb.me.uk> wrote in message

> A solution using such a de Bruijn sequence has already been posted[1].
> On my x86_64 machine it is the fastest solution so far:
>
> mine & and shift: 84ns
> mod 37 + lookup: 43ns
> de Bruijn multiply plus lookup: 19ns
>
> All are compiler with optimisation (the function ended up inlined and
> %37 was done as previouslt descibed).
>
> The de Bruijn method omits the x & -x from the bithacks site since we
> don't need it here:
>
> int bit_set_4(uint32_t x)
> {
> static const int MultiplyDeBruijnBitPosition[32] = {
> 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
> 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
> };
> return MultiplyDeBruijnBitPosition[x * 0x077CB531U >> 27];
> }

That's pretty good, about as fast (on my machine) as a function containing
just a single BSF inline instruction.

But it might be worth pointing out that this code I think only works
properly if there is one and only one bit set. For x==0, it gives the same
result as x==1, and for x's with extra bits set, it seems to go wrong.

BSF (find lowest set bit) also doesn't deal with x = 0, but gives expected
results on the rest.

--
Bartc

Ben Bacarisse

unread,

Dec 7, 2009, 9:56:38 AM12/7/09

to

"bartc" <ba...@freeuk.com> writes:

You mean pointed out more explicitly than my saying that I removed the
x & -x from the linked bithack because we don't need it in this
example? More explicitly than all the text up thread about "if you
really have only one bit set then..."? :-)

> BSF (find lowest set bit) also doesn't deal with x = 0, but gives
> expected results on the rest.

To add to my timings:

gcc asm("bsf"): 14ns;

--
Ben.

bartc

unread,

Dec 7, 2009, 12:00:59 PM12/7/09

to

"Ben Bacarisse" <ben.u...@bsb.me.uk> wrote in message

news:0.32f863dbc48801b394a4.2009...@bsb.me.uk...

> "bartc" <ba...@freeuk.com> writes:
>> "Ben Bacarisse" <ben.u...@bsb.me.uk> wrote in message
>>> A solution using such a de Bruijn sequence has already been posted[1].
>>> On my x86_64 machine it is the fastest solution so far:
>>>
>>> mine & and shift: 84ns
>>> mod 37 + lookup: 43ns
>>> de Bruijn multiply plus lookup: 19ns
>>>
>>> All are compiler with optimisation (the function ended up inlined and
>>> %37 was done as previouslt descibed).

>>> int bit_set_4(uint32_t x)

>>> {
>>> static const int MultiplyDeBruijnBitPosition[32] = {
>>> 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
>>> 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
>>> };
>>> return MultiplyDeBruijnBitPosition[x * 0x077CB531U >> 27];
>>> }
>>
>> That's pretty good, about as fast (on my machine) as a function
>> containing just a single BSF inline instruction.
>>
>> But it might be worth pointing out that this code I think only works
>> properly if there is one and only one bit set. For x==0, it gives the
>> same result as x==1, and for x's with extra bits set, it seems to go
>> wrong.
>
> You mean pointed out more explicitly than my saying that I removed the
> x & -x from the linked bithack because we don't need it in this
> example? More explicitly than all the text up thread about "if you
> really have only one bit set then..."? :-)

Yeah but that was last week.

>> BSF (find lowest set bit) also doesn't deal with x = 0, but gives
>> expected results on the rest.
>
> To add to my timings:
>
> gcc asm("bsf"): 14ns;

I hadn't looked at your actual figures too closely before, but they don't
seem that fast.

On my gcc -O0 3.4.5 x86-32 setup, I'm getting about 1.5 secs for 320 million
calls, about 5ns each (summary of code below). With your original code, 3.5
to 7 secs without inlining (11 to 22ns). My interpreted language, even, does
11ns. Properly inlined assembler I think was some 0.6secs, around 2ns.

(With gcc -O1/2/3, the timings just go to zero.)

Obvuiously your tests must be very different or include extra overheads.

/* bit_set_4() defined with unsigned long (32-bit) parameter */
int main(void) {
int a;
int i,j,k,n;
#define N 1000000

//starttimer();

for (n=0; n<N; ++n){
j=1;
for (i=0; i<32; ++i) {
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
j<<=1;
}
}
//stoptimer();

}

I'm sure that's 320 million calls to bit_set_4() in there.

--
Bartc

Ben Bacarisse

unread,

Dec 7, 2009, 7:03:11 PM12/7/09

to

"bartc" <ba...@freeuk.com> writes:

> "Ben Bacarisse" <ben.u...@bsb.me.uk> wrote in message
> news:0.32f863dbc48801b394a4.2009...@bsb.me.uk...
>> "bartc" <ba...@freeuk.com> writes:
>>> "Ben Bacarisse" <ben.u...@bsb.me.uk> wrote in message
>>>> A solution using such a de Bruijn sequence has already been posted[1].
>>>> On my x86_64 machine it is the fastest solution so far:
>>>>
>>>> mine & and shift: 84ns
>>>> mod 37 + lookup: 43ns
>>>> de Bruijn multiply plus lookup: 19ns
>>>>
>>>> All are compiler with optimisation (the function ended up inlined and
>>>> %37 was done as previouslt descibed).

<snip>

>> To add to my timings:
>>
>> gcc asm("bsf"): 14ns;
>
> I hadn't looked at your actual figures too closely before, but they
> don't seem that fast.

Ah, yes. I used the wrong divisor when I divided to get ns per op
from total seconds. Double checking, I now get:

mask and shift: 5.22ns minus overheads: 4.99ns
mod 37: 2.71ns 2.48ns
de Bruijn: 1.18ns 0.95ns
asm(BSF): 0.89ns 0.66ns
overhead (estimate): 0.23ns

> On my gcc -O0 3.4.5 x86-32 setup, I'm getting about 1.5 secs for 320
> million calls, about 5ns each (summary of code below). With your
> original code, 3.5 to 7 secs without inlining (11 to 22ns). My
> interpreted language, even, does 11ns. Properly inlined assembler I
> think was some 0.6secs, around 2ns.

The above seem reasonable now. For example, 0.66ns is
16.5 cycles of my processor. The Intel manual suggests that BSF has a
latency of 16 cycles. The de Bruijn code is therefore about 1.4 times
the speed of the single instruction.

> (With gcc -O1/2/3, the timings just go to zero.)

My timing are for optimised code (-O2 seems to be as fast as it gets).

> Obvuiously your tests must be very different or include extra
> overheads.

No, just faulty arithmetic!

<snip>
--
Ben.

Moi

unread,

Dec 9, 2009, 7:00:50 AM12/9/09

to

On Mon, 07 Dec 2009 11:37:27 +0000, bartc wrote:

>
> But it might be worth pointing out that this code I think only works
> properly if there is one and only one bit set. For x==0, it gives the
> same result as x==1, and for x's with extra bits set, it seems to go
> wrong.
>
> BSF (find lowest set bit) also doesn't deal with x = 0, but gives
> expected results on the rest.

GNU's libc on linux has the ffs() ffsl() and ffsll() functions,
which are basically a BSF instruction (if present on the hardware),
; but with 1 added, so the rightmost bit will be returned as 1.
A return value of zero indicates "no bits were set".

Personally I would prefer -1 as a return value
for "no bits set", but that's a matter of taste.

HTH,
AvK