Efficient search in std::bitset

jono...@googlemail.com

unread,

Jan 23, 2015, 5:24:28 PM1/23/15

to

Hi there,

how would you implement an efficient search for the first bit conforming to a particular bit_state (true or false) in a std::bitset<N> ?

Here's my desired prototype:

template<size_t N>
size_t find(bitset<N> bs, size_t idx, bool state)

// Search for index i such that i >= idx and bs[i] == state

/* if no such index is found, return i = Bitset::npos

namespace Bitset {
size_t npos = numeric_limits<size_t>::max();
};
*/

jono...@googlemail.com

unread,

Jan 23, 2015, 5:32:33 PM1/23/15

to

Here's my attempt:

//////////////////////////////////////
#include <bitset>
#include <stdexcept>
#include <limits>
#include <climits>

using namespace std;

namespace Bitset {
size_t npos = numeric_limits<size_t>::max();
};

unsigned find_in_ulong(unsigned long lon, bool state)
/* only call this function if lon does definately have a bit that is set to state */
{
constexpr unsigned char mask_byte_u = -1;

const unsigned long fail_byte(state
? 0U /* none set */
: mask_byte_u /* all set */);

unsigned shift = 0;
do {
if ((lon & mask_byte_u) != fail_byte) {
for (unsigned i = 0; i < CHAR_BIT; ++i) {
if ((lon & 1) == state)
return shift + i;
lon >>= 1U;
}
// should never reach this line
}
lon >>= CHAR_BIT;
shift += CHAR_BIT;
} while (lon);
throw runtime_error("state not found in find_in_ulong");

}

template<size_t N>
size_t find(bitset<N> bs, size_t idx, bool state)

{
if (idx >= N)
return Bitset::npos; // not found

bs >>= idx;

constexpr unsigned num_mask_bits = CHAR_BIT * sizeof(unsigned long);
constexpr unsigned long mask_long_u = -1UL;

const bitset<N> mask_long(mask_long_u);

const unsigned long fail_long(state
? 0U /* none set */
: mask_long_u /* all set */);
while (idx < N) {
unsigned long lon = (bs & mask_long).to_ulong();
if (lon != fail_long)
return idx + find_in_ulong(lon, state);
bs >>= num_mask_bits;
idx += num_mask_bits;
}
return Bitset::npos; // not found
}

//////////////////////////////////////

I reckon that the performance is fairly decent in terms of speed.

But I depend on a library implementation of std::bitset which has a clever and lightning-fast to_ulong() function. This is only possible if the design and implementation of std::bitset has a bool flag, which shows if high-bits outside of the unsigned long range are set (and the exception overflow_error is hence called); and if not shoots out that ulong as fast as possible. In other words: to_ulong() should not waste any processor cycles with checking bits!

jono...@googlemail.com

unread,

Jan 23, 2015, 5:40:04 PM1/23/15

to

A possible speed bottleneck, is the masking of the following: (bs & mask_long)

Why does std::bitset not have a way of chopping off unneeded higher bits, and just returning a u_long of the lower bits (without considering any higher bits that are set)? A function like: chop_ulong()

That would allow one to leverage more performance, wouldn't it ?

Pavel

unread,

Jan 24, 2015, 12:43:00 AM1/24/15

to

The performance of such an operation is too dependent on predominant usage and
data distribution to have a generic solution. To illustrated the point about
different data distributions (for simplicity, the below assumes the "particular
bit state" your are searching for is true and you will see that often the best
representation of the bit set is not std::bitset at all):

- the probability true and false is 0.5 at every position. The best algorithm is
probably the naive iteration (bitset representation should do)
- the first 'true' can appear at any position with equal probability; the
probability of true and false in any subsequent position is 0.5. You are
probably best off with never storing the useless beginning, as you hinted to
above. You could use some deque<bitset<sizeof(unsigned long) * CHAR_BIT> > and
store the initial position of 1 (which must be in the first bitset) with it.
- there are relatively very few "trues" and all you need is to iterate over
their indices in the bitset in order. You could even use std::set<size_t> that
stores indices of trues instead of the bitset. If you do not need to toggle bits
after creating bitset, you could also a sorted vector of indices of "true" in
this case.
- if your bitset fits in a single value of an integral type, but trues are
relatively rare there, there are a few binary-log algorithms you could use, some
of order of O(log2(sizeof(your-integral-type) * CHAR_BIT), see for example
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious. Or, you
could use a single platform-dependent instruction, e.g. for Intel:
http://web.itu.edu.tr/kesgin/mul06/intel/instr/bsf.html

etc.

On a side note, bitset is rarely good for efficiency as you do not have access
to the underlying memory; usually you are better off with your own
representation tailored for your needs.

std::vector<bool>, even though it has non-compile-time size, has IMHO a better
theoretical potential for out-of-the box optimization in the Library as the
Library could (I think) specialize algorithm such as std::find for its
const_iterator type using platform-specific tricks in implementation including
those mentioned above and your requested find (reformulated for vector<bool>)
can be implemented in a straightforward manner over std::find. I am not sure if
any implementation provides so optimized std::find() though.

Hope this will help,
-Pavel

Paavo Helde

unread,

Jan 24, 2015, 2:02:33 AM1/24/15

to

jono...@googlemail.com wrote in
news:565385f3-c97e-41c2...@googlegroups.com:

> On Friday, January 23, 2015 at 11:32:33 PM UTC+1,
> jono...@googlemail.com wrote:
>> On Friday, January 23, 2015 at 11:24:28 PM UTC+1,
>> jono...@googlemail.com
> wrote:
>> > Hi there,
>> >
>> > how would you implement an efficient search for the first bit
>> > conformin
> g to a particular bit_state (true or false) in a std::bitset<N> ?

The straightforward solution for this is to use a simple loop and
std::bitset::test(). The implementation of std::bitset is all inline so
the result could actually be pretty fast with a good optimizing compiler.
In the following I presume that you have profiled this and determined the
speed is not sufficient. If you are not done this, you should do it now.

A side note: if you have somehow obtained raw bits as an integer value
then if you need extreme speed you might consider to use the relevant CPU
instruction directly (exposed e.g. by gcc as __builtin_clz()).

>> But I depend on a library implementation of std::bitset which has a
>> cleve
> r and lightning-fast to_ulong() function. This is only possible if the
> design and implementation of std::bitset has a bool flag, which shows
> if high-bits outside of the unsigned long range are set (and the
> exception overflow_error is hence called); and if not shoots out that
> ulong as fast as possible. In other words: to_ulong() should not waste
> any processor cycles with checking bits!

This assumption probably does not hold. Maintaining such a bool flag
would mean extra unneeded overhead for all programs which do not need
this functionality.

>
> A possible speed bottleneck, is the masking of the following: (bs &
> mask_long)
>
> Why does std::bitset not have a way of chopping off unneeded higher
> bits, and just returning a u_long of the lower bits (without
> considering any higher bits that are set)? A function like:
> chop_ulong()
>
> That would allow one to leverage more performance, wouldn't it ?

And it would lose information. If you are certain that all bits fit in an
unsigned long long, why not just use std::bitset<N> with small enough N?
N is a compile-time constant, so the optimizer would probably optimize
to_ullong() to a single instruction in this case.

If I were to implement this as a general library function and the general
solution appears to be too slow, I would create a template function which
branches to two implementations, one a fast one for small N-s using
to_ullong(), and a general one using a loop and test() for larger N-s.
Then it would be up to the caller to use small enough N if it wants the
best speed.

Cheers
Paavo

jono...@googlemail.com

unread,

Jan 24, 2015, 5:22:02 AM1/24/15

to

For a flexible top-speed implementation, bitset would need a function get_ulong(i) which returns the i'th ulong of the inner bitset representation.

Why? Because then I can extract whatever part of the bitset I need, without messing with shifts.

e.g.
bitset<CHAR_BIT * sizeof(unsigned long) * m>
would return the appropriate ulong for get_ulong(0), get_ulong(1), ..., get_ulong(m-1); but throw an exception when calling get_ulong(idx) if idx >= m

In general for bitset<N>, I can then call get_ulong() for indices 0 to (N-1)/CHAR_BIT/sizeof(unsigned long)

jono...@googlemail.com

unread,

Jan 24, 2015, 5:26:54 AM1/24/15

to

So the standard should enlarge the bitset class for a mechanism like that! ;)

Oh for those people who like to copy-and-paste code:
WORD OF WARNING ABOUT THE CODE IN THE 2ND POST:
It will fail (i.e. return wrong result), if searching for bit-state false, if N is not a multiple of CHAR_BIT*sizeof(unsigned long).
So use that technique only when searching for a bit-state true.

jono...@googlemail.com

unread,

Jan 24, 2015, 1:04:31 PM1/24/15

to

Here's some highly illegal code, which is BLAZING FAST - this is the kind of stuff I want the C++ standard std::bitset to do for me... (because it's complex and should be kept firmly out of sight).

It even works perfectly on linux using gcc.

It is highly illegal because it uses this:

bitset<33> y(7);
cout << *reinterpret_cast<const ulong *>(&y) << '\n';

(So the libstdc++ implementation of bitset has the the internal bit representation at the right spot! Nice. https://github.com/gcc-mirror/gcc/blob/d353bf189d2bbaf4059f402ee4d2a5ea074c349f/libstdc%2B%2B-v3/include/std/bitset#L76 )

////////////////////////////////////////////////////////

#include <bitset>
#include <stdexcept>
#include <limits>
#include <climits>

#include <iostream>
#include <cassert>

#include <climits>

using namespace std;

namespace Bitset {
size_t npos = numeric_limits<size_t>::max();
};

/// some lookup tables

#if (ULONG_MAX == 4294967295UL)

// unsigned long has 32 bits
static_assert(sizeof(unsigned long) == 4, "unsigned long must have 4 bytes");

static const int MultiplyDeBruijnBitPosition[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};

#else
#if (ULONG_MAX == 18446744073709551615UL)

// unsigned long has 64 bits
static_assert(sizeof(unsigned long) == 8, "unsigned long must have 8 bytes");

static const int MultiplyDeBruijnBitPosition[64] =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63
};

#else

static_assert(0, "strange machine architecture");

#endif
#endif

int find_lowest_idx_set_in_ulong(unsigned long val)
// only call this function if val does definately have a bit that is set
{
#if (ULONG_MAX == 4294967295UL)

// unsigned long has 32 bits

// Reference: http://stackoverflow.com/a/757266 ( http://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightMultLookup )
return MultiplyDeBruijnBitPosition[(((val & -val) * 0x077CB531UL)) >> 27];

#else
#if (ULONG_MAX == 18446744073709551615UL)

// unsigned long has 64 bits

// Reference: http://stackoverflow.com/a/24146182 ( http://chessprogramming.wikispaces.com/BitScan#KimWalisch )
return MultiplyDeBruijnBitPosition[(((val ^ (val-1)) * 0x03F79D71B4CB0A89UL)) >> 58];

#endif
#endif
}

template<size_t N>
size_t find(const bitset<N>& bs, size_t idx)

{
if (idx >= N)
return Bitset::npos; // not found

using ulong = unsigned long;
constexpr int num_bits_ulong = CHAR_BIT * sizeof(ulong);

size_t idx_quot = idx / num_bits_ulong; // index into ulong array
const size_t idx_rem = idx - idx_quot * num_bits_ulong;

constexpr size_t N_1 = N-1;
const size_t lim_quot = (N_1) / num_bits_ulong;
const size_t lim_rem = (N_1) - lim_quot * num_bits_ulong;

assert((reinterpret_cast<ulong>(reinterpret_cast<const ulong *>(&bs)) % sizeof(ulong)) == 0); /* check alignment */
const ulong *mem = reinterpret_cast<const ulong *>(&bs) + idx_quot;

ulong val = *mem & (-1UL << idx_rem);

if (idx_quot < lim_quot) {
if (val) {
return idx_quot * num_bits_ulong + find_lowest_idx_set_in_ulong(val);
}

while (++idx_quot < lim_quot) {
if (mem[idx_quot])
return idx_quot * num_bits_ulong + find_lowest_idx_set_in_ulong(mem[idx_quot]);
}

val = mem[idx_quot];
}

val &= (-1UL >> (num_bits_ulong - lim_rem - 1));
if (val) {
return idx_quot * num_bits_ulong + find_lowest_idx_set_in_ulong(val);

}

return Bitset::npos; // not found
}

int main()
{
bitset<33> y(7);
cout << *reinterpret_cast<const ulong *>(&y) << '\n';

bitset<1000> x;
x.set(0); /* this is the low bit we are looking for -- it gets shifted along in the for loop below */
x.set(7); /* (we're not looking for this bit) */
for (; x.any(); x <<= 1)
cout << find(x, 0) << ' ';
cout << '\n';

return 0;
}

////////////////////////////////////////////////////////

jono...@googlemail.com

unread,

Jan 24, 2015, 1:31:21 PM1/24/15

to

Slight error in the code above if find(bs, idx) is called with idx > 0.

Here's a fixup:

Explanation of the code:

template<size_t N>
size_t find(const bitset<N>& bs, size_t idx)

The code searches for the first set-bit (true), starting from the bit with index idx. Usually idx will be 0, to simply search from the rightmost lsb bit, until the first set-bit is found. Returns Bitset::npos if no set-bit was found

const ulong *mem = reinterpret_cast<const ulong *>(&bs);

ulong val = mem[idx_quot] & (-1UL << idx_rem);

if (idx_quot < lim_quot) {
if (val) {
return idx_quot * num_bits_ulong + find_lowest_idx_set_in_ulong(val);
}

while (++idx_quot < lim_quot) {
if (mem[idx_quot])
return idx_quot * num_bits_ulong + find_lowest_idx_set_in_ulong(mem[idx_quot]);
}

val = mem[idx_quot];
}

val &= (-1UL >> (num_bits_ulong - lim_rem - 1));
if (val) {
return idx_quot * num_bits_ulong + find_lowest_idx_set_in_ulong(val);
}

return Bitset::npos; // not found
}

int main()
{

bitset<1000> x;
x.set(0); /* this is the low bit we are looking for -- it gets shifted along in the for loop below */
x.set(7); /* (we're not looking for this bit) */
for (; x.any(); x <<= 1)
cout << find(x, 0) << ' ';

cout << "\n\n";

x.set(900);
x.set(907);

for (; x.any(); x <<= 1)

cout << find(x, 500) << ' ';
cout << '\n';

return 0;
}
////////////////////////////////////////////////

Öö Tiib

unread,

Jan 24, 2015, 3:12:37 PM1/24/15

to

On Saturday, 24 January 2015 20:31:21 UTC+2, jono...@googlemail.com wrote:
> On Saturday, January 24, 2015 at 7:04:31 PM UTC+1, jono...@googlemail.com wrote:
> > Here's some highly illegal code, which is BLAZING FAST - this is the
> > kind of stuff I want the C++ standard std::bitset to do for me...
> > (because it's complex and should be kept firmly out of sight).
> >
> > It even works perfectly on linux using gcc.

<snip>

> Slight error in the code above if find(bs, idx) is called with idx > 0.
>
> Here's a fixup:
>
> Explanation of the code:
>
> template<size_t N>
> size_t find(const bitset<N>& bs, size_t idx)
>
> The code searches for the first set-bit (true), starting from the bit
> with index idx. Usually idx will be 0, to simply search from the
> rightmost lsb bit, until the first set-bit is found. Returns
> Bitset::npos if no set-bit was found

<snip>

You write algorithm that does a work of single assembly
instruction on lot of processors. It is often supported
by tools and libraries. Read

http://en.wikipedia.org/wiki/Find_first_set

It does not still hurt to be inventive but always scan the manuals
first before inventing functions that feel mundane.
Even some (strangely less popular) answers to that stackoverflow
question did point it out.

jacob navia

unread,

Jan 24, 2015, 4:35:55 PM1/24/15

to

That is fairly trivial. A much more complex problem is to efficiently
search for a BIT PATTERN in a bit-string.

When writing the C containers library (CCL) I researched that problem
and to my surprise is absolutely not trivial.

The problem is:

Given: A vector of N bits
A search pattern of n bits where n < N.

Find the first (if any) occurrence of the pattern in the bit vector.

jono...@googlemail.com

unread,

Jan 24, 2015, 5:20:49 PM1/24/15

to

On Saturday, January 24, 2015 at 10:35:55 PM UTC+1, jacob navia wrote:

Yip, this sound difficult indeed.

One thing to try here, perhaps, it to realize that a search for a bit pattern, is similar to a search for a substring.
Maby some of the literature about "efficiently find substring" (Boyer-Moore string search algorithm, etc.) is adaptable ???

jono...@googlemail.com

unread,

Jan 24, 2015, 5:42:21 PM1/24/15

to

On Saturday, January 24, 2015 at 10:35:55 PM UTC+1, jacob navia wrote:

maby ...

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.8175&rep=rep1&type=pdf
http://www.busillis.com/o_kulekci/publications/mcs10.pdf
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=5576123
http://security.riit.tsinghua.edu.cn/share/pattern.html

etc.
https://www.google.com/search?q=fast+search+for+bitpatterns&ie=utf-8
("bit-parallel", "search", "matching", "bitpattern", "subpattern", etc.)

asetof...@gmail.com

unread,

Jan 25, 2015, 3:19:47 AM1/25/15

to

Find a path bit of 8 bit in
an array of N bit
Example: N == 1000 bit m=1000/32
and n==8 bit

Return the index as u32
of array of bits a[]
U32 a[m], s=0, limit=m*4;
U8 b, *g=a
La:
for(i=0; i<limit; ++i)
if((g[i]&b==b)) return i*8+s;
if(s==7) return -1; // not find
ShiftDivision(a,1); ++s
goto La

The same for 16 or 32 or 64 bit
if the bit are for example 3 bit
111 one search the char
00000111

asetof...@gmail.com

unread,

Jan 25, 2015, 3:30:25 AM1/25/15

to

The above would find one occurrence
if exist
or modify a little all occurrences
Not the first occurrence...

asetof...@gmail.com

unread,

Jan 25, 2015, 5:24:04 AM1/25/15

to

The above code perhaps would

find one
occurrence if exist
or modify a little all occurrences

[as result one array of u32 index]

Not the first occurrence only
efficiently...

Ciao

Tobias Müller

unread,

Jan 26, 2015, 1:46:31 AM1/26/15

to

Paavo Helde <myfir...@osa.pri.ee> wrote:
> jono...@googlemail.com wrote in
> news:565385f3-c97e-41c2...@googlegroups.com:
>
>> On Friday, January 23, 2015 at 11:32:33 PM UTC+1,
>> jono...@googlemail.com wrote:
>>> On Friday, January 23, 2015 at 11:24:28 PM UTC+1,
>>> jono...@googlemail.com
>> wrote:
>>>> Hi there,
>>>>
>>>> how would you implement an efficient search for the first bit
>>>> conformin
>> g to a particular bit_state (true or false) in a std::bitset<N> ?
>
> The straightforward solution for this is to use a simple loop and
> std::bitset::test(). The implementation of std::bitset is all inline so
> the result could actually be pretty fast with a good optimizing compiler.

IME using operator[] is faster than test() because it omits bounds
checking.
It seems that the proxy object of operator[] can be optimized away easier
than bounds checking.

Tobi

Paavo Helde

unread,

Jan 26, 2015, 3:14:23 AM1/26/15

to

=?UTF-8?Q?Tobias=20M=C3=BCller?= <tro...@bluewin.ch> wrote in
news:1647449147443947321.18...@news.eternal-september.
org:

>
> IME using operator[] is faster than test() because it omits bounds
> checking.
> It seems that the proxy object of operator[] can be optimized away
> easier than bounds checking.

Yes, you are right, I somehow mixed them up.

Cheers
Paavo

Scott Lurndal

unread,

Jan 26, 2015, 10:10:13 AM1/26/15

to

jono...@googlemail.com writes:
>Hi there,
>
>how would you implement an efficient search for the first bit conforming to a particular bit_state (true or false) in a std::bitset<N> ?
>

In my projects that use gcc, I typically use __builtin_ffs or
__builtin_ctz depending on the sense of the search.

scott