Using #define in the construction of an array

Paul

unread,

Nov 9, 2018, 12:12:28 PM11/9/18

to

Below is part of some code for counting how many bits of an integer are set,
using a lookup table.
I don't understand at all the way that #define is being used.
I'm fine with the static const unsigned char BitsSetTable256[256] declaration
but I don't get how we can populate an array with #define statements.
It does compile though with gcc.
I'd be very grateful if anyone can explain it to me.

Thank you,

Paul

static const unsigned char BitsSetTable256[256] =
{
# define B2(n) n, n+1, n+1, n+2
# define B4(n) B2(n), B2(n+1), B2(n+1), B2(n+2)
# define B6(n) B4(n), B4(n+1), B4(n+1), B4(n+2)
B6(0), B6(1), B6(1), B6(2)
};

Thiago Adams

unread,

Nov 9, 2018, 12:43:17 PM11/9/18

to

The preprocessor is an separated step and doesn't matter
where you use #define.
The expansion is here
B6(0), B6(1), B6(1), B6(2)

After expansion, what the compiler will see is this:

static const unsigned char BitsSetTable256[256] =
{

0, 0+1, 0+1, 0+2, 0+1, 0+1+1, 0+1+1, 0+1+2, 0+1, 0+1+1, 0+1+1, 0+1+2, 0+2, 0+2+1, 0+2+1, 0+2+2, 0+1, 0+1+1, 0+1+1, 0+1+2, 0+1+1, 0+1+1+1, 0+1+1+1, 0+1+1+2, 0+1+1, 0+1+1+1, 0+1+1+1, 0+1+1+2, 0+1+2, 0+1+2+1, 0+1+2+1, 0+1+2+2, 0+1, 0+1+1, 0+1+1, 0+1+2, 0+1+1, 0+1+1+1, 0+1+1+1, 0+1+1+2, 0+1+1, 0+1+1+1, 0+1+1+1, 0+1+1+2, 0+1+2, 0+1+2+1, 0+1+2+1, 0+1+2+2, 0+2, 0+2+1, 0+2+1, 0+2+2, 0+2+1, 0+2+1+1, 0+2+1+1, 0+2+1+2, 0+2+1, 0+2+1+1, 0+2+1+1, 0+2+1+2, 0+2+2, 0+2+2+1, 0+2+2+1, 0+2+2+2, 1, 1+1, 1+1, 1+2, 1+1, 1+1+1, 1+1+1, 1+1+2, 1+1, 1+1+1, 1+1+1, 1+1+2, 1+2, 1+2+1, 1+2+1, 1+2+2, 1+1, 1+1+1, 1+1+1, 1+1+2, 1+1+1, 1+1+1+1, 1+1+1+1, 1+1+1+2, 1+1+1, 1+1+1+1, 1+1+1+1, 1+1+1+2, 1+1+2, 1+1+2+1, 1+1+2+1, 1+1+2+2, 1+1, 1+1+1, 1+1+1, 1+1+2, 1+1+1, 1+1+1+1, 1+1+1+1, 1+1+1+2, 1+1+1, 1+1+1+1, 1+1+1+1, 1+1+1+2, 1+1+2, 1+1+2+1, 1+1+2+1, 1+1+2+2, 1+2, 1+2+1, 1+2+1, 1+2+2, 1+2+1, 1+2+1+1, 1+2+1+1, 1+2+1+2, 1+2+1, 1+2+1+1, 1+2+1+1, 1+2+1+2, 1+2+2, 1+2+2+1, 1+2+2+1, 1+2+2+2, 1, 1+1, 1+1, 1+2, 1+1, 1+1+1, 1+1+1, 1+1+2, 1+1, 1+1+1, 1+1+1, 1+1+2, 1+2, 1+2+1, 1+2+1, 1+2+2, 1+1, 1+1+1, 1+1+1, 1+1+2, 1+1+1, 1+1+1+1, 1+1+1+1, 1+1+1+2, 1+1+1, 1+1+1+1, 1+1+1+1, 1+1+1+2, 1+1+2, 1+1+2+1, 1+1+2+1, 1+1+2+2, 1+1, 1+1+1, 1+1+1, 1+1+2, 1+1+1, 1+1+1+1, 1+1+1+1, 1+1+1+2, 1+1+1, 1+1+1+1, 1+1+1+1, 1+1+1+2, 1+1+2, 1+1+2+1, 1+1+2+1, 1+1+2+2, 1+2, 1+2+1, 1+2+1, 1+2+2, 1+2+1, 1+2+1+1, 1+2+1+1, 1+2+1+2, 1+2+1, 1+2+1+1, 1+2+1+1, 1+2+1+2, 1+2+2, 1+2+2+1, 1+2+2+1, 1+2+2+2, 2, 2+1, 2+1, 2+2, 2+1, 2+1+1, 2+1+1, 2+1+2, 2+1, 2+1+1, 2+1+1, 2+1+2, 2+2, 2+2+1, 2+2+1, 2+2+2, 2+1, 2+1+1, 2+1+1, 2+1+2, 2+1+1, 2+1+1+1, 2+1+1+1, 2+1+1+2, 2+1+1, 2+1+1+1, 2+1+1+1, 2+1+1+2, 2+1+2, 2+1+2+1, 2+1+2+1, 2+1+2+2, 2+1, 2+1+1, 2+1+1, 2+1+2, 2+1+1, 2+1+1+1, 2+1+1+1, 2+1+1+2, 2+1+1, 2+1+1+1, 2+1+1+1, 2+1+1+2, 2+1+2, 2+1+2+1, 2+1+2+1, 2+1+2+2, 2+2, 2+2+1, 2+2+1, 2+2+2, 2+2+1, 2+2+1+1, 2+2+1+1, 2+2+1+2, 2+2+1, 2+2+1+1, 2+2+1+1, 2+2+1+2, 2+2+2, 2+2+2+1, 2+2+2+1, 2+2+2+2
};

james...@alumni.caltech.edu

unread,

Nov 9, 2018, 12:57:30 PM11/9/18

to

It's not populating the array with #define statements. That code would
work exactly the same if the #defines were moved well ahead of the array
declaration:

# define B2(n) n, n+1, n+1, n+2
# define B4(n) B2(n), B2(n+1), B2(n+1), B2(n+2)
# define B6(n) B4(n), B4(n+1), B4(n+1), B4(n+2)

static const unsigned char BitsSetTable256[256] =
{

B6(0), B6(1), B6(1), B6(2)
};

The key thing you need to understand is that #definition of a macro
causes subsequent occurrences of that macro to be replaced by it's
replacement list. This can be a multi-pass process, and that is the case
here. The first pass causes the declaration to be re-written as follows:

static const unsigned char BitsSetTable256[256] =
{

B4(0), B4(0+1), B4(0+1), B4(0+2), B4(1), B4(1+1), B4(1+1), B4(1+2), B4(1), B4(1+1), B4(1+1), B4(1+2), B4(2), B4(2+1), B4(2+1), B4(2+2)
};

The next pass replaces every occurrence of B4() with four occurrences of
B2(). For instance, "B4(1+1)" gets replaced by "B2(1+1), B2(1+1+1),
B2(1+1+1), B2(1+1+2)". The final pass replaces every occurrence of B2()
with a list of four individual expressions. For instance, "B2(1+1+1)"
gets replaced by "1+1+1, 1+1+1+1, 1+1+1+1, 1+1+1+2".

Does that make it clearer?

Paul

unread,

Nov 9, 2018, 3:25:49 PM11/9/18

to

Yes, my question has been answered. The result is apparently that, for 0 <= k <=255,
BitsSetTable[k] = number of bits set in the binary expansion of k.

No clue why this works, though. It's neat.

Paul

Scott Lurndal

unread,

Nov 9, 2018, 4:12:19 PM11/9/18

to

Look at the number of bits required to represent 0-f

0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4

Note the pattern: 0112 N
1223 N+1
1223 N+1
2334 N+2

look familiar?

Alf P. Steinbach

unread,

Nov 9, 2018, 4:52:50 PM11/9/18

to

Others have already explained about the preprocessor.

Below is one way to avoid using it for this task in C++.

I don't know if it's worth it, though, but it may perhaps bring some
insight about the numbers (they're a mystery to me so far).

# define B2(n) n, n+1, n+1, n+2
# define B4(n) B2(n), B2(n+1), B2(n+1), B2(n+2)
# define B6(n) B4(n), B4(n+1), B4(n+1), B4(n+2)

static const unsigned char BitsSetTable256[256] =
{

B6(0), B6(1), B6(1), B6(2)
};

//--------------------------------------------------------------
#include <array>
using Byte = unsigned char;

template< int level >
constexpr auto b( const int n, const int i )
-> int
{
const int sequence_length = 1 << level;
const int subsequence_length = sequence_length/4;
const int subseq_i = i/subsequence_length;

return b<level-2>(
n + subseq_i - (subseq_i >= 2),
i%subsequence_length
);
}

template<>
constexpr auto b<2>( const int n, const int i )
-> int
{ return n + i - (i >= 2); }

constexpr auto bits_set_256( const int i )
-> Byte
{ return b<8>( 0, i ); }

//--------------------------------------------------------------
#include <iostream>
#include <stdlib.h>
auto main()
-> int
{
using namespace std;
for( int i = 0; i < 256; ++i )
{
const int a = BitsSetTable256[i];
const int b = bits_set_256( i );
if( a != b )
{
cout
<< "At index " << i << ": "
<< "BitSetTable256 " << a << " != "
<< "bits_set_256 " << b << endl;
return EXIT_FAILURE;
}
}
cout << "Alles in ordnung!" << endl;
return EXIT_SUCCESS;
}

Cheers!,

- Alf

Jorgen Grahn

unread,

Nov 9, 2018, 6:22:13 PM11/9/18

to

On Fri, 2018-11-09, Paul wrote:
> Below is part of some code for counting how many bits of an integer are set,
> using a lookup table.

...

> static const unsigned char BitsSetTable256[256] =

...

FWIW, I wouldn't do it like that: it's cryptic, and I suspect it's
slow. Lookup tables made more sense in the 1990s, before CPUs became
much faster than RAM.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Paavo Helde

unread,

Nov 10, 2018, 2:54:49 AM11/10/18

to

On 10.11.2018 1:22, Jorgen Grahn wrote:
> On Fri, 2018-11-09, Paul wrote:
>> Below is part of some code for counting how many bits of an integer are set,
>> using a lookup table.
> ...
>> static const unsigned char BitsSetTable256[256] =
> ...
>
> FWIW, I wouldn't do it like that: it's cryptic, and I suspect it's
> slow. Lookup tables made more sense in the 1990s, before CPUs became
> much faster than RAM.

To be honest, for a 256-byte lookup table one should talk about CPU
cache speeds, not RAM speed. Also, if the lookup table is small enough
it can fit directly into CPU registers. Google found a link which claims
that the fastest way to count raised bits is the built-in hardware
instruction (POPCNT in newer Intel/AMD), followed closely by a table
lookup for each 4 bits, all inside the CPU registers:

"https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer"

It also provides a high level function, but in my eyes this is not less
cryptic than the macros ;-)

int numberOfSetBits(int i)
{
// Java: use >>> instead of >>
// C or C++: use uint32_t
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

Jorgen Grahn

unread,

Nov 10, 2018, 3:15:20 AM11/10/18

to

On Sat, 2018-11-10, Paavo Helde wrote:
> On 10.11.2018 1:22, Jorgen Grahn wrote:
>> On Fri, 2018-11-09, Paul wrote:
>>> Below is part of some code for counting how many bits of an integer are set,
>>> using a lookup table.
>> ...
>>> static const unsigned char BitsSetTable256[256] =
>> ...
>>
>> FWIW, I wouldn't do it like that: it's cryptic, and I suspect it's
>> slow. Lookup tables made more sense in the 1990s, before CPUs became
>> much faster than RAM.
>
> To be honest, for a 256-byte lookup table one should talk about CPU
> cache speeds, not RAM speed.

Yes, but the difference between CPU and RAM speeds is the reason
caches exist. I didn't want to go too deep into all that, because
it's complex and I don't know it well.

Juha Nieminen

unread,

Nov 11, 2018, 7:39:37 AM11/11/18

to

Jorgen Grahn <grahn...@snipabacken.se> wrote:
> FWIW, I wouldn't do it like that: it's cryptic, and I suspect it's
> slow. Lookup tables made more sense in the 1990s, before CPUs became
> much faster than RAM.

Ok, show me a function that counts the number of 1-bits in a
bit array (which is represented eg. by an array of unsigned ints)
that does not use any lookup tables and is faster than a typical
implementation that does.

Öö Tiib

unread,

Nov 11, 2018, 8:01:08 AM11/11/18

to

Perhaps usage of POPCNT of SSE4 is most efficient. I suspect
__popcnt of Visual Studio and __builtin_popcount of gcc do that,
but I've never needed to investigate it.

Jorgen Grahn

unread,

Nov 11, 2018, 8:01:22 AM11/11/18

to

I wrote /suspect/, i.e. I don't claim to know. Anyway, realistic
benchmarks for cache-dependent code seem hard to produce. But surely
you agree that lookup tables are less useful today than 30 years ago?

FWIW, last time I implemented something like this, I used a lookup
table for nybbles. Using 16 bytes twice seemed much less of a waste
than using 256 bytes once, and that way readability didn't become an
issue.

Scott Lurndal

unread,

Nov 11, 2018, 10:39:43 AM11/11/18

to

Any use of the 1-cycle population count instruction would beat
any lookup table approach. gcc has __builtin_popcount().

size_t
countbits(const unsigned int * vector, size_t num_elements)
{
size_t rval = 0ul;

while (num_elements--) {
rval += __builtin_popcount(*vector++);
}
return rval;
}

Even quicker if the vector consists of unsigned longs/long-longs.

Melzzzzz

unread,

Nov 11, 2018, 12:15:36 PM11/11/18

to

Yeah, if you schedule 3-4 instructions it is as fast as avx2 version...
>

--
press any key to continue or any other to quit...

David Brown

unread,

Nov 12, 2018, 4:19:47 AM11/12/18

to

__builtin_popcount()

A number of processors have instructions for doing this - and a single
"popcnt" opcode will be faster than any manual implementation.

If you need a fast popcount function, it makes sense to see if your
compiler has a builtin or extension for it, or if there is a suitable
intrinsic function available for your target processor. Depending on
your portability needs, you will maybe need conditional compilation to
use gcc's intrinsic on gcc, MSVC's intrinsic on MSVC, etc., with a
fall-back to a generic algorithm.

And don't forget compiler flags that match the target processor families
- if your code doesn't have to run on an old device, don't limit the
compiler to the instruction set from the old devices.

(Yes, I know this answer is cheating.)

I suspect, however, that when the processor does not have a dedicated
population count instruction, a lookup table will be the fastest
implementation. There is no point in bothering about a fast
implementation unless you are calling the function frequently, so the
table is likely to be in cache when it matters - and therefore very fast.

Juha Nieminen

unread,

Nov 12, 2018, 5:20:35 AM11/12/18

to

David Brown <david...@hesbynett.no> wrote:
> (Yes, I know this answer is cheating.)

When I wrote the question, I actually expected many replies of that sort.
At least you were honest enough to admit that it's a non-standard
extension, not even supported by most CPUs nor compilers, and thus not
really an option in portable code.

> I suspect, however, that when the processor does not have a dedicated
> population count instruction, a lookup table will be the fastest
> implementation. There is no point in bothering about a fast
> implementation unless you are calling the function frequently, so the
> table is likely to be in cache when it matters - and therefore very fast.

Most (probably all?) implementations of std::bitset and
boost::dynamic_bitset use a lookup table algorithm for their count()
function, and it's really fast (and much faster than anything else
you could concoct, using standard C++.)

Even a simple 256-element lookup table is faster than anything else
not using lookup tables (because it essentially allows you to process
8 bits at a time, rather than having to check individual bits one by one).
And it only takes 256 bytes, so it ought to fit inside any CPU cache
quite comfortably.

David Brown

unread,

Nov 12, 2018, 7:00:28 AM11/12/18

to

On 12/11/18 11:20, Juha Nieminen wrote:
> David Brown <david...@hesbynett.no> wrote:
>> (Yes, I know this answer is cheating.)
>
> When I wrote the question, I actually expected many replies of that sort.
> At least you were honest enough to admit that it's a non-standard
> extension, not even supported by most CPUs nor compilers, and thus not
> really an option in portable code.
>

I am quite happy with the idea of code like:

#ifdef __GNUC__
inline int popcount(unsigned int x) {
return __builtin_popcount(x);
}
#else
int popcount(unsigned int x) {
// calculation
}
#endif

Often for this sort of thing, there is no "best" answer or best
algorithm. Conditional compilation like this can let you get the most
efficient code on platforms with the relevant extensions or opcodes, and
correct but slower code for wider portability. (If you have a lot of
such code, you might want to split things into different source files
than have lots of small conditional bits.)

>> I suspect, however, that when the processor does not have a dedicated
>> population count instruction, a lookup table will be the fastest
>> implementation. There is no point in bothering about a fast
>> implementation unless you are calling the function frequently, so the
>> table is likely to be in cache when it matters - and therefore very fast.
>
> Most (probably all?) implementations of std::bitset and
> boost::dynamic_bitset use a lookup table algorithm for their count()
> function, and it's really fast (and much faster than anything else
> you could concoct, using standard C++.)
>
> Even a simple 256-element lookup table is faster than anything else
> not using lookup tables (because it essentially allows you to process
> 8 bits at a time, rather than having to check individual bits one by one).
> And it only takes 256 bytes, so it ought to fit inside any CPU cache
> quite comfortably.
>

Agreed.

Scott Lurndal

unread,

Nov 12, 2018, 9:06:57 AM11/12/18

to

Juha Nieminen <nos...@thanks.invalid> writes:
>David Brown <david...@hesbynett.no> wrote:
>> (Yes, I know this answer is cheating.)
>
>When I wrote the question, I actually expected many replies of that sort.
>At least you were honest enough to admit that it's a non-standard
>extension, not even supported by most CPUs nor compilers, and thus not
>really an option in portable code.

the popcount instruction is supported in most _modern_ cpus and
has been for quite some time. And on implementations of GCC for
processors that don't have a builtin population count instruction,
gcc will provide an appropriate function.

>
>Even a simple 256-element lookup table is faster than anything else
>not using lookup tables (because it essentially allows you to process
>8 bits at a time, rather than having to check individual bits one by one).
>And it only takes 256 bytes, so it ought to fit inside any CPU cache
>quite comfortably.

Displacing something more useful in the cache...

__builtin_popcount is a viable solution for 99% of the code that
needs it - portable "pure" C/C++ code is a myth.

Jorgen Grahn

unread,

Nov 12, 2018, 4:08:29 PM11/12/18

to

I wrote a silly benchmark, for (a) popcount using a small lookup table
for nybbles, and (b) using Paul's larger lookup table for octets.

galium:view/popcount% git remote -v
origin git://snipabacken.se/popcount (fetch)
origin git://snipabacken.se/popcount (push)

galium:view/popcount% make && time ./a ; time ./b
g++ -W -Wall -pedantic -std=c++11 -g -O3 -c -o a.o a.cc
g++ -W -Wall -pedantic -std=c++11 -g -O3 -o a a.o
g++ -W -Wall -pedantic -std=c++11 -g -O3 -c -o b.o b.cc
g++ -W -Wall -pedantic -std=c++11 -g -O3 -o b b.o
1962026240
./a 13.49s user 0.00s system 99% cpu 13.502 total
1962026240
./b 104.91s user 0.00s system 99% cpu 1:44.99 total

I expected (b) to be about as fast as (a) -- thus proving my theory
that you don't gain anything by keeping large and hard-to-read tables
-- but here it was an order of magnitude /slower/, even though noone is
fighting it for the cache.

Making the (b) table larger by keeping unsigned short instead of
unsigned char made it an additional 70% slower.

Note that I'm only interested in the "large lookup tables are bad"
idea I proposed earlier, not in fast popcounts with compiler
extensions. And to be honest, I'm not interested enough to figure out
why the wrong implementation is the slowest. Feel free to clone and
play with it.

Scott Lurndal

unread,

Nov 12, 2018, 4:41:57 PM11/12/18

to

(a) inlined the everthing into main, including the constexpr vector
initialization.

(b) didn't inline the popcount (which generated real ugly object code using the
legacy string instructions with REP prefix)
function so you have 1e9 invocations of the popcount function to execute.

(b) likely slower because of the local 'constexpr' in 'octet' which is
rebuilt on every invocation of popcount. Make it static instead and
you should get comparable performance.

your version:
$ time ./a
1962026240

real 0m4.22s
user 0m4.23s
sys 0m0.00s
$ time ./b
1962026240

real 0m54.10s
user 0m54.12s
sys 0m0.00s

with static in b.cc:

$ time ./b
1962026240

real 0m1.33s
user 0m1.33s
sys 0m0.01s

about 4x faster than (a).

Melzzzzz

unread,

Nov 12, 2018, 5:09:59 PM11/12/18

to

Try it with clang.

Jorgen Grahn

unread,

Nov 12, 2018, 5:15:36 PM11/12/18

to

Oh. I didn't expect the compiler (g++ 6.3 with -O3) to miss such
obvious opportunities for optimization -- I wonder why it did? I saw
an octet(unsigned) symbol in the object code of (b), but didn't
investigate.

> your version:
> $ time ./a
> 1962026240
>
> real 0m4.22s
> user 0m4.23s
> sys 0m0.00s
> $ time ./b
> 1962026240
>
> real 0m54.10s
> user 0m54.12s
> sys 0m0.00s
>
> with static in b.cc:
>
> $ time ./b
> 1962026240
>
> real 0m1.33s
> user 0m1.33s
> sys 0m0.01s
>
> about 4x faster than (a).

Hm, I would have expected 2x, or less of a difference.

Louis Krupp

unread,

Nov 12, 2018, 9:26:57 PM11/12/18

to

On Fri, 9 Nov 2018 09:12:07 -0800 (PST), Paul <peps...@gmail.com>
wrote:

I just read Item 15, "Use constexpr unless someone on comp.lang.c++
gives you a good reason not to," in Scott Meyers's _Effective Modern
C++, so here goes:

===
#include <bitset>
#include <iostream>

const int N = 256;

// I didn't invent this.
constexpr int ones(unsigned int n)
{
auto t = 0;

while (n) {
t++;
n &= (n - 1);
}

return t;
}

class ones_array_obj {
public:
constexpr ones_array_obj() noexcept;
constexpr int operator[] (unsigned int n) const noexcept;
private:
int ones_array[N];
};

inline constexpr ones_array_obj::ones_array_obj() noexcept :
ones_array{0}
{
for (int n = 0; n < N; n++) {
ones_array[n] = ones(n);
}
}

inline constexpr int ones_array_obj::operator[] (unsigned int n) const
noexcept
{
return ones_array[n];
}

constexpr ones_array_obj oa;

int main()
{
for (unsigned int n = 0; n < N; n++) {
std::cout << std::bitset<8>(n) << " " << oa[n] << "\n";
}
}
===

g++ version 8.2.1 compiles this and (with no optimization options)
puts ones_array in the code file.

Louis

Scott Lurndal

unread,

Nov 13, 2018, 8:46:59 AM11/13/18

to

I didn't bother changing the constexpr to static in a.cc, which
may have made a small difference.

Melzzzzz

unread,

Nov 13, 2018, 11:51:34 AM11/13/18

to

~/projects/popcount >>> clang -O2 -march=native a.cc -o a -lstdc++ ±[●][master]
~/projects/popcount >>> clang -O2 -march=native b.cc -o b -lstdc++ ±[●][master]
~/projects/popcount >>> time ./a ±[●][master]
1962026240
./a 1.98s user 0.00s system 99% cpu 1.989 total
~/projects/popcount >>> time ./b ±[●][master]
1962026240
./b 1.04s user 0.00s system 99% cpu 1.043 total
~/projects/popcount >>>

His version with clang...

Juha Nieminen

unread,

Nov 14, 2018, 3:11:09 AM11/14/18

to

Scott Lurndal <sc...@slp53.sl.home> wrote:
> __builtin_popcount is a viable solution for 99% of the code that
> needs it - portable "pure" C/C++ code is a myth.

Not only are you making your code non-standard and non-portable, and tied
to pretty much one single compiler, but ironically that very compiler uses
a lookup table for its fallback implementation of that function. Which is
the whole point I was making.

Scott Lurndal

unread,

Nov 14, 2018, 8:46:24 AM11/14/18

to

Juha Nieminen <nos...@thanks.invalid> writes:
>Scott Lurndal <sc...@slp53.sl.home> wrote:
>> __builtin_popcount is a viable solution for 99% of the code that
>> needs it - portable "pure" C/C++ code is a myth.
>
>Not only are you making your code non-standard and non-portable,

So what? My code relies heavily on POSIX interfaces anyway.

Really, I've never seen a useful "portable" "standard" C or C++
real-world application. That's not to say they don't exist, but
they're very rare.

Leaving aside classwork, of course.

Öö Tiib

unread,

Nov 14, 2018, 10:02:59 AM11/14/18

to

Exactly. Ultimately portable program can not exist in practice.
C and C++ just do not contain portable features to find out
implementation-specific details compile-time. So we have
to achieve portability with side-by-side platform-specific
code (like __popcnt, __builtin_popcount or fall-back to #error).

But (in practice) most of the most used software (starting
from web browsers) is written in C or C++. Lot of that runs
(again in practice) on majority of stuff that people actually
use.

David Brown

unread,

Nov 14, 2018, 10:22:16 AM11/14/18

to

There is a distinction between portable code, and portable programs. I
doubt if there are many fully portable C or C++ programs. But there is
a good deal of portable C and C++ code. It is not at all uncommon to
have most of the C implementation files and headers written in portable
code, with only a few specific parts written in an
implementation-specific or non-portable manner.

So you could have a header file that contained:

static inline int popcount(unsigned int x) {
return __builtin_popcount(x);
}

That would be non-standard and non-portable.

The C file that used the code, #include'ing the header and calling the
function "popcount", would be portable.

So the program as a whole would be non-portable, but the C code that
called the popcount function would be portable.

Tim Rentsch

unread,

Nov 23, 2018, 5:33:01 PM11/23/18

to

It isn't hard to write such a function, depending on the
particulars. One example:

U64
count_array_simply_generic( U64 N, const U32 in[N] ){
U64 total = 0;
U64 i;
for( i = 0; i < N; i++ ){
U32 a = in[i];
U32 b = (a & 0x55555555) + ((a & 0xaaaaaaaa) >> 1);
U32 c = (b & 0x33333333) + ((b & 0xcccccccc) >> 2);
U32 d = c + (c>>4) & 0x0F0F0F0F;
U32 e = d + (d>>8);
total += e + (e>>16) & 255;
}
return total;
}

This method runs about 1.8 times the speed of a table lookup
method (using a similar loop over each 32-bit word, but with
shift/mask/lookup to get the count in each 8-bit byte). The
particulars: both compiled with -O3, with an 8-element array.
Larger arrays do better, and smaller arrays do worse. The table
lookup method is comparatively insensitive to optimization level;
the function above is very sensitive, never doing better than
table lookup for any size array at optimization levels -O2 or
less.

Analogous code for 64-bit array elements does better, at all
optimzation levels, for arrays of 4 elements or more. A function
that is a bit more clever gives still better relative performance
and beats table lookup for arrays of 3 elements or more. But
that comes with some interesting caveats: sometimes it does
worse (than itself) at higher optimization levels, and for very
large arrays it starts losing to simpler methods (though it
still always beats table lookup for arrays of at least 3 elements).

Two additional comments: there can be a lot of variation from
run to run (another execution but no re-building), and in all
measurements the table lookup method had a warm cache so there
were no cache miss or page fault problems.

Given all the above, I am inclined to favor a method along the
lines of the function shown above, rather than using a table
lookup approach. I know there are cases where table lookup
might do better, but overall it seems like ROI favors ignoring
those possibilities.