The unbelievable slowness of larger int types (in my experiment)

Paul

unread,

Nov 15, 2018, 3:35:59 PM11/15/18

to

I had fun timing the below function which uses the most naive method
of doing a popcount. I ran it from i = 0 to 10 million in a for loop.
This took 2.5 seconds. However, 10 million is small enough that you can
replace 1ull by 1 with no problems. This cut the time from 2.5 seconds
to 0.98 seconds, and I did this experiment many times.
I am admittedly omitting the code for the for loop here, but if people
feel it's relevant, I'd be happy to supply it.
The form was for(int i = 0; i < 10 million; ++i)
basicPopcount(i);

I'm using the Codeblocks IDE with a recent version of gcc and a modern
Windows operating system. My laptop is short of memory so I wonder if that's got anything to do with it.
Is it really that slow to use non-int integer types and, if so, what can
be done about it? Using unsigned long long increases the runtime by 2.5 times.

int basicPopcount(int x)
{
int count = 0;
int place = 0;
// An overflow bug can arise in this method for large values of x
// So make sure types are large enough
while(1ull << place <= x)
if(1ull << place++ & x)
++count;

return count;
}

Jorgen Grahn

unread,

Nov 15, 2018, 4:29:40 PM11/15/18

to

On Thu, 2018-11-15, Paul wrote:
> I had fun timing the below function which uses the most naive method
> of doing a popcount. I ran it from i = 0 to 10 million in a for loop.
> This took 2.5 seconds. However, 10 million is small enough that you can
> replace 1ull by 1 with no problems. This cut the time from 2.5 seconds
> to 0.98 seconds, and I did this experiment many times.
> I am admittedly omitting the code for the for loop here, but if people
> feel it's relevant, I'd be happy to supply it.

It's extremely relevant.

> The form was for(int i = 0; i < 10 million; ++i)
> basicPopcount(i);

If the compiler had access to the basicPopcount() implementation
when it compiled that, I'm surprised it wasn't all optimized away.
It doesn't do anything! Did you enable optimization when compiling?

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Paul

unread,

Nov 15, 2018, 5:12:04 PM11/15/18

to

Thanks, Jorgen.
I just used the default gcc settings previously.
This explains why it wasn't all optimized away.

I changed to -O3 optimization. I'm not sure whether this is best
or what you would advise (?).
I changed the code to try and force the compiler to run through the loop
but to do as little extra work as possible. So now it
accumulates the results.

Full code is below. My times are derived from the "execution time" on my
console. This varies but is approximately 2 seconds for the 1ull version
and 0.8 versions where I use only ints.
For the sake of clarity, the full text of both versions is below.
I learned a lot already.
Many thanks for your further thoughts,

Paul .

/************************With ull *************/
#include<iostream>

int basicPopcount(int x)
{
int count = 0;
int place = 0;
// An overflow bug can arise in this method for large values of x
// So make sure types are large enough
while(1ull << place <= x)
if(1ull << place++ & x)
++count;

return count;
}

int main()
{
int count = 0;
for(int i = 0; i < 10 * 1000 * 1000; ++i)
count += basicPopcount(i);

std::cout << count;
}

/************************Without ull *************/
#include<iostream>

int basicPopcount(int x)
{
int count = 0;
int place = 0;
// An overflow bug can arise in this method for large values of x
// So make sure types are large enough

while(1 << place <= x)
if(1 << place++ & x)
++count;

return count;
}

// Main function is unchanged but is repeated for total clarity.
int main()
{
int count = 0;
for(int i = 0; i < 10 * 1000 * 1000; ++i)
count += basicPopcount(i);

std::cout << count;
}

Robert Wessel

unread,

Nov 15, 2018, 10:38:25 PM11/15/18

to

On Thu, 15 Nov 2018 12:35:46 -0800 (PST), Paul <peps...@gmail.com>
wrote:

I just ran a quick GCC test (not on Windows, but). The generated code
is nearly identical, except for using 64-bit instead of 32 bit
registers for some of the operations. And those shouldn't see a big
performance difference on most processors. Are you by any chance
building 32-bit code? That will show a significant slowdown when
using the 64-bit type.

bitrex

unread,

Nov 16, 2018, 1:10:00 AM11/16/18

to

why not use Compiler Explorer:

<https://godbolt.org/>

and write your code there and you can see precisely what it generates
for many compiler versions and many architectures

Paul

unread,

Nov 16, 2018, 3:16:08 AM11/16/18

to

I'm sure that building 32-bit code is the problem.
Thanks for pointing this out.
I'm using gcc with CLion.
I'm going to attempt to use these instructions to fix the issue:
https://dev.my-gate.net/2014/09/15/how-to-use-mingw-w64-with-clion/

Please let me know if you have anything else to add.

Paul

Paul

unread,

Nov 16, 2018, 3:17:28 AM11/16/18

to

Thanks, good tip.

Paul

Juha Nieminen

unread,

Nov 16, 2018, 7:22:24 AM11/16/18

to

Paul <peps...@gmail.com> wrote:
> I had fun timing the below function which uses the most naive method
> of doing a popcount. I ran it from i = 0 to 10 million in a for loop.
> This took 2.5 seconds. However, 10 million is small enough that you can
> replace 1ull by 1 with no problems. This cut the time from 2.5 seconds
> to 0.98 seconds, and I did this experiment many times.

I haven't checked, but when creating such benchmarks, you should
*always* make sure and check that the compiler isn't actually
optimizing code away because the result is unused.

It's a very typical mistake to make. Just very recently I had
a conversation with someone in another forum who thought that
his O(n^2) linked list traversal was faster than who you traverse
std::list with iterators... only to find out that the compiler
was optimizing his code out completely because it noticed that
the end result was not being used for anything.

(Not saying you have made such a mistake. Just pointing out that
you should make sure that's not what's happening.)

One way to check what the compiler is doing is to make it output
in asm and examine the resulting code. (Of course this requires
a modicum of understanding of asm.)

At the very least you should always do something with the end
result (such as printing it, even if you aren't interested in
what the end result actually is). Even then, there can still
be situations where the compiler is overly smart and optimizing
code away that shouldn't be, if it's optimizing away the code
you are trying to benchmark the speed of.

Robert Wessel

unread,

Nov 17, 2018, 10:38:07 AM11/17/18

to

On Fri, 16 Nov 2018 00:15:53 -0800 (PST), Paul <peps...@gmail.com>

If you want a short, bit-by-bit, algorithm, why not just do the more
conventional basic approach or repeatedly shifting x right? Note that
I've changed the type to unsigned. Not tested, etc:

int popcnt(unsigned x)
{
int count = 0;
while (x)
{
count += x & 1;
x /= 2;
}
return count;
}

FWIW, GCC generates a rather shorter loop for that too.

Jorgen Grahn

unread,

Nov 17, 2018, 11:15:43 AM11/17/18

to

On Thu, 2018-11-15, Paul wrote:
> On Thursday, November 15, 2018 at 9:29:40 PM UTC, Jorgen Grahn wrote:
>> On Thu, 2018-11-15, Paul wrote:
>> > I had fun timing the below function which uses the most naive method
>> > of doing a popcount. I ran it from i = 0 to 10 million in a for loop.
>> > This took 2.5 seconds. However, 10 million is small enough that you can
>> > replace 1ull by 1 with no problems. This cut the time from 2.5 seconds
>> > to 0.98 seconds, and I did this experiment many times.
>> > I am admittedly omitting the code for the for loop here, but if people
>> > feel it's relevant, I'd be happy to supply it.
>>
>> It's extremely relevant.
>>
>> > The form was for(int i = 0; i < 10 million; ++i)
>> > basicPopcount(i);
>>
>> If the compiler had access to the basicPopcount() implementation
>> when it compiled that, I'm surprised it wasn't all optimized away.
>> It doesn't do anything! Did you enable optimization when compiling?
>
> Thanks, Jorgen.
> I just used the default gcc settings previously.
> This explains why it wasn't all optimized away.
>
> I changed to -O3 optimization. I'm not sure whether this is best
> or what you would advise (?).

With gcc I tend to use -O2 or -Os. Sometimes -O3, if I can show that
it generates measurably faster code. But the main thing is enabling
the optimizer at all.

Ben Bacarisse

unread,

Nov 17, 2018, 4:17:31 PM11/17/18

to

Robert Wessel <robert...@yahoo.com> writes:

>>> On Thu, 15 Nov 2018 12:35:46 -0800 (PST), Paul <peps...@gmail.com>

<snip>

>>> >int basicPopcount(int x)
>>> > {
>>> > int count = 0;
>>> > int place = 0;
>>> > // An overflow bug can arise in this method for large values of x
>>> > // So make sure types are large enough
>>> > while(1ull << place <= x)
>>> > if(1ull << place++ & x)
>>> > ++count;
>>> >
>>> > return count;
>>> > }

<snip>

> If you want a short, bit-by-bit, algorithm, why not just do the more
> conventional basic approach or repeatedly shifting x right? Note that
> I've changed the type to unsigned. Not tested, etc:
>
> int popcnt(unsigned x)
> {
> int count = 0;
> while (x)
> {
> count += x & 1;
> x /= 2;
> }
> return count;
> }
>
> FWIW, GCC generates a rather shorter loop for that too.

There's a neat way to clear the least significant set bit:

int count = 0;
for (; x; count++) x &= x - 1;
return count;

It may of course be slower, but when you have only a few bits set it may
pay off.

--
Ben.

Paul

unread,

Nov 17, 2018, 4:20:12 PM11/17/18

to

Your code is significantly faster than mine and avoids the problematic
comparison in my code.
The reason I coded it the way I did is that I was looking for the most
direct approach so I said "What's happening in the kth binary place" So this
led to me introducing 1 << place.
x/=2; seems to execute faster than x >>= 1; so maybe a problem with my code
is that bit shifting isn't as fast as the alternatives. I'm not sure why my code is so much slower than yours -- it seems to take about 1.5 times as long to execute.

Paul

Paul

unread,

Nov 17, 2018, 4:25:13 PM11/17/18

to

I think this trick, popularised (or maybe even invented) by Kernighan
is known to be much faster in general than naive bit by bit methods.

It's a better big O. Your method is O(number of bits set) which is
better than O(log n). It doesn't need to have "only a few bits set" to
pay off. It pays off in general. It might be worse in cases where
almost all the bits are set -- I haven't tested.

Paul

Ben Bacarisse

unread,

Nov 17, 2018, 9:36:04 PM11/17/18

to

You can't really say one is asymptotically better than the other unless
you start to talk about the distribution of data values, and most people
don't what to go there when using O(...) notation. When not otherwise
stated O(...) is usually used to denote the worst-case performance, and
here both algorithms are O(log n).

And if you switch to "average case" from "worst case", for uniformly
distributed data, O(number of bits set) = O((log n)/2) = O(log n).
You'd need some non-uniform assumption about the values to be able to
say that O(number of bits set) is better than O(log n).

--
Ben.

Paul

unread,

Nov 18, 2018, 4:19:10 AM11/18/18

to

Ben,

Thanks. Some very valid and good points here.
For clarity, the two methods being discussed on these last few posts are
posted at the end of this thread.
The fact of the matter, which you will see if you test them, is that
(with the possible exception of some rare cases) the first implementation
is much faster. Your initial post on the first method doesn't
acknowledge that. Try it and see if you agree with me.
Naively, you'd expect it to take about half the time in general (because with random numbers you'd expect roughly half the bits to be set to 1).
I just tried testing the two with numbers up to 10 million.
The x &= x-1 trick reduced the time by 40%. The fact that this is not
quite 50% suggests that, if nearly all the original bits are set to 1,
iterating x by x/=2 is narrowly faster. Obviously, these thoughts are
very tentative because they depend on compiler settings, systematic trials
etc.
Paul

int popcnt(unsigned x)
{
int count = 0;

for (; x; count++) x &= x - 1;
return count;
}

and

Ben Bacarisse

unread,

Nov 18, 2018, 7:41:25 AM11/18/18

to

Yes, I was way too vague. It's going to take a bad compiler to make it
slower that the other method given here. Other constant-time methods
may beat it depending on the architecture, compiler etc.

> Try it and see if you agree with me. Naively, you'd expect it to take
> about half the time in general (because with random numbers you'd
> expect roughly half the bits to be set to 1). I just tried testing
> the two with numbers up to 10 million. The x &= x-1 trick reduced the
> time by 40%.

I got a slightly better reduction in time on uniform data. On worst
case data they take about the same time (as you would expect). But on
my machine __builtin_popcount beats all other methods hands-down by at
least a factor of 5.

By the way, the while (x) x /= 2 loop does not test all bits either but
it does test about twice the number that the other method tests. (I've
not done the exact computation. It would be a neat exercise for a maths
class.)

If I get time, I may try to time these against a table look-up method.
It's a lone time since I've tried that.

> The fact that this is not quite 50% suggests that, if
> nearly all the original bits are set to 1, iterating x by x/=2 is
> narrowly faster. Obviously, these thoughts are very tentative because
> they depend on compiler settings, systematic trials etc. Paul
>
> int popcnt(unsigned x)
> {
> int count = 0;
> for (; x; count++) x &= x - 1;
> return count;
> }
>
> and
>
> int popcnt(unsigned x)
> {
> int count = 0;
> while (x)
> {
> count += x & 1;
> x /= 2;
> }
> return count;
> }

--
Ben.

Paul

unread,

Nov 18, 2018, 8:13:10 AM11/18/18

to

On Sunday, November 18, 2018 at 12:41:25 PM UTC, Ben Bacarisse wrote:
... on

> my machine __builtin_popcount beats all other methods hands-down by at
> least a factor of 5.

...

Thanks for your thoughts. Could you clarify "all other methods" please?
How about the below?
My tests make it more or less an exact match for the builtin, and googling
for discussions, that's what others are saying, too.
I don't think the below is unambiguously slower than the builtin.
Admittedly, SWAR is very difficult to explain to someone who hasn't seen
it before. Perhaps you mean "All other methods which are easily
explainable in a few minutes so long as someone knows a bit of C++".

Thanks again,
Paul

int swarPopcount(unsigned i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

Ben Bacarisse

unread,

Nov 18, 2018, 10:37:17 AM11/18/18

to

Paul <peps...@gmail.com> writes:

> On Sunday, November 18, 2018 at 12:41:25 PM UTC, Ben Bacarisse wrote:
> ... on
>> my machine __builtin_popcount beats all other methods hands-down by at
>> least a factor of 5.
> ...
>
> Thanks for your thoughts. Could you clarify "all other methods"
> please?

A mistake. I should have said both other methods -- the two loops you
were comparing. I have not test all others by any means!

> How about the below?
> My tests make it more or less an exact match for the builtin, and googling
> for discussions, that's what others are saying, too.

Optimised (-O2 and -O3), it's faster than the __builtin_popcount on my
system and takes about twice the time unoptimised.

This:

unsigned c;
__asm__ volatile ("POPCNT %1, %0;"
:"=r"(c)
:"r"(x)
);
return c;

is faster than the SWAR version (and therefore the builtin one and the
loops) but the difference becomes small when both are optimised. But of
course it's horribly non-portable. I'd assumed that __builtin_popcount
would use this method but it appear not.)

(BTW, I don't know much about __asm__ in gcc so if anyone wants to tell
me how I should have done it, I'd be grateful.)

> I don't think the below is unambiguously slower than the builtin.
> Admittedly, SWAR is very difficult to explain to someone who hasn't seen
> it before. Perhaps you mean "All other methods which are easily
> explainable in a few minutes so long as someone knows a bit of C++".

No just a slip of the fingers. "All" should have been just the two
loops being compared in those recent posts.

> int swarPopcount(unsigned i)
> {
> i = i - ((i >> 1) & 0x55555555);
> i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
> return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
> }

--
Ben.

Jorgen Grahn

unread,

Nov 18, 2018, 11:44:21 AM11/18/18

to

On Sat, 2018-11-17, Paul wrote:
> On Saturday, November 17, 2018 at 3:38:07 PM UTC, robert...@yahoo.com wrote:

...

> Your code is significantly faster than mine and avoids the
> problematic comparison in my code. The reason I coded it the way I
> did is that I was looking for the most direct approach so I said
> "What's happening in the kth binary place" So this led to me
> introducing 1 << place.

I was wondering about that (but I didn't ask). What you write makes
sense: scan through position 0, 1, 2, ... and look for set bits.

Yet my instinct is to do it like robert: draining all the bits out
through the bottom, as it were.

> x/=2; seems to execute faster than x >>= 1; so maybe a problem with
> my code is that bit shifting isn't as fast as the alternatives.

That seems unlikely. Back in the 1980s, people used to write n>>2
instead of n/4, because compilers were poor and DIV instructions very
expensive. But I soon learned that a half-decent optimizer would turn
both into the same assembly code (if n was unsigned, anyway).

David Brown

unread,

Nov 18, 2018, 1:37:07 PM11/18/18

to

POPCNT was added with the "nehalem" generation of Intel chips, along
with the SSE4 instructions, as far as I can tell. In the AMD line, it
came a little later. __builtin_popcount on gcc (and presumably also
clang) will use POPCNT on x86, and equivalent instructions for other
processors, if it is told to generate code for a device family that
supports it.

Thus if you just ask for "x86-64" instructions, gcc won't use the POPCNT
instruction as it is not supported by all x86-64 devices. But if you
ask for "-march=nehalem", you /will/ get it for __builtin_popcount.

In particular, if you use "-march=native", gcc will generate code that
makes use of all the features of the cpu you are using at the time.
That is a good idea for making programs that you will run on your own
system, but limits portability of the binary on other processors. If
you really want fast code /and/ portability, gcc supports generation of
multiple copies of functions with run-time selection of the actual
version based on the processor in use at the time.

> (BTW, I don't know much about __asm__ in gcc so if anyone wants to tell
> me how I should have done it, I'd be grateful.)
>

It looks fine to me. You can use names instead of numbered parameters,
which is useful for more advanced inline assembly, but numbered
parameters are okay here.

Ben Bacarisse

unread,

Nov 18, 2018, 2:48:27 PM11/18/18

to

David Brown <david...@hesbynett.no> writes:

> On 18/11/2018 16:37, Ben Bacarisse wrote:
>> Paul <peps...@gmail.com> writes:

<snip>

>>> How about the below?
>>> My tests make it more or less an exact match for the builtin, and googling
>>> for discussions, that's what others are saying, too.
>>
>> Optimised (-O2 and -O3), it's faster than the __builtin_popcount on my
>> system and takes about twice the time unoptimised.
>>
>> This:
>>
>> unsigned c;
>> __asm__ volatile ("POPCNT %1, %0;"
>> :"=r"(c)
>> :"r"(x)
>> );
>> return c;
>>
>> is faster than the SWAR version (and therefore the builtin one and the
>> loops) but the difference becomes small when both are optimised. But of
>> course it's horribly non-portable. I'd assumed that __builtin_popcount
>> would use this method but it appear not.)
>>
>
> POPCNT was added with the "nehalem" generation of Intel chips, along
> with the SSE4 instructions, as far as I can tell. In the AMD line, it
> came a little later. __builtin_popcount on gcc (and presumably also
> clang) will use POPCNT on x86, and equivalent instructions for other
> processors, if it is told to generate code for a device family that
> supports it.

<snip>

> In particular, if you use "-march=native", gcc will generate code that
> makes use of all the features of the cpu you are using at the
> time.

Interesting. I had forgotten that I was just doing a generic compile.
I've tried -march=native and the timing is exactly as for the __asm__
version. Thanks for that tip.

(Yet another way in which benchmarking code like this is increasingly
complicated.)

<snip>
--
Ben.

David Brown

unread,

Nov 19, 2018, 2:56:30 AM11/19/18

to

It is not just benchmarking code like this that is complicated - it is
writing maximally efficient code that is complicated. For this kind of
thing, there is no universal "best" method. My preference for such
cases is to use compiler-specific features like __builtin_popcount()
where possible (with fall-backs to generic code if you want more
portability). And then tell the compiler as much as you can about the
target - using -march flags with gcc, and similar flags on other
compilers.

The important point is not to try to figure out clever tricks in your
own code, but to work /with/ the compiler as best you can. "Clever"
tricks have a tendency to work against the compiler, perhaps relying on
undefined behaviour (though not in this case as far as I have noticed)
or limit its ability to do wider optimisations such as constant
propagation. The place for clever tricks these days is in the compiler,
rather than in the user code.

(Of course, sometimes you play with clever tricks for fun.)

Ben Bacarisse

unread,

Nov 19, 2018, 6:33:04 AM11/19/18

to

That probably matches your typical usage. For software that gets built
using a generic compile (because a moderately generic binary package is
being built) you'll end up with a slowish popcount function (at least
with the examples I tried).

> The important point is not to try to figure out clever tricks in your
> own code, but to work /with/ the compiler as best you can. "Clever"
> tricks have a tendency to work against the compiler, perhaps relying on
> undefined behaviour (though not in this case as far as I have noticed)
> or limit its ability to do wider optimisations such as constant
> propagation. The place for clever tricks these days is in the compiler,
> rather than in the user code.

That is generally true, and probably true in this case as well because a
factor of two (as here) is not much of a price to pay in most software
that will use a popcount function. It's fast even when it's not the
fastest.

There's also the problem that you might be writing code that can't
assume a particular compiler. It's then not going to be 100% clear what
working with the compiler really means.

> (Of course, sometimes you play with clever tricks for fun.)

Ack.

--
Ben.

David Brown

unread,

Nov 19, 2018, 6:57:39 AM11/19/18

to

Yes, on my typical usage I know exactly what cpu the code will run on,
and can tune precisely.

For PC software that is delivered in source form, then this will also
work - it is up to the user to have flags suitable for their system.
(You could put -march=native in makefiles, which is the ideal choice for
people using software on the same system as they use to build it, but
makes it less portable to other systems.)

For software delivered as binaries, it's a different matter. The best
you can usually do is guess a minimal requirement for typical users.

And if you have code that is particularly performance sensitive, and
particularly affected by the exact processor capabilities (such as
POPCNT instruction or SIMD versions), then you can use gcc "function
multiversioning" and the "target_clones" function attribute to
automatically generate several versions of a function, tuned to
different processors. This adds a layer of indirection to the call,
which spoils some of the benefits of having a fast popcnt instruction -
it is best used on a function that does a bit more work, such as one
that calls popcount() in a loop. But it is simple to use:

__attribute__((target_clones("default,popcnt")))
int popcount(unsigned int x) {
return __builtin_popcount(x);
}

Then you just use "popcount()" as a normal function in your code, and
you'll get either a generic version (using a lookup table, I think) or a
popcnt instruction if your cpu supports it.

>
>> The important point is not to try to figure out clever tricks in your
>> own code, but to work /with/ the compiler as best you can. "Clever"
>> tricks have a tendency to work against the compiler, perhaps relying on
>> undefined behaviour (though not in this case as far as I have noticed)
>> or limit its ability to do wider optimisations such as constant
>> propagation. The place for clever tricks these days is in the compiler,
>> rather than in the user code.
>
> That is generally true, and probably true in this case as well because a
> factor of two (as here) is not much of a price to pay in most software
> that will use a popcount function. It's fast even when it's not the
> fastest.
>
> There's also the problem that you might be writing code that can't
> assume a particular compiler. It's then not going to be 100% clear what
> working with the compiler really means.
>

I would say "working with the compiler" means something like :

#if defined(__GNUC__)
// Optimised version using gcc extensions
#elif defined(_MSC_VER)
// Optimised version using MSVC extensions
#elif ...
#else
// Fallback generic version
#endif

You can't tune for every compiler, but you might want to tune for (or
"work with") specific compilers.