Changes to bitset.pxi

Skip to first unread message

Sep 16, 2020, 5:59:57 AM9/16/20
to sage-devel
Dear all, I want to redesign the bitset structure of combinatorial polyhedron and move it to `data_structures/bitset.pxi`. This includes some changes to bitset.pxi. Please comment, whether the proposed design changes on the ticket are ok. Mostly they are the following:

1. Define most of the functions in bitset.pxi for fuzed types.

2. Move functions that can be optimized by intrinsics to a seperate file.

3. (Not yet done, but also important.) Optimize some functions in bitset.pxi by intrinsics. This includes an overalignment condition. One disadvantage of overalignment is that it makes realloc much more complicated. Small bitsets also need more memory.

4. Indirect typecast of functions in `bitset.pxi` will no longer be possible. E.g. you cannot call `bitset_add` with signature `(bitset_t, PyObject)`, but `(bitset_t, int)` or `(bitset_t, size_t)` or similar are ok. See

Eventually, this should go into smaller tickets. But until then it would be good to know, if the general direction is acceptable or which part is not.

Vincent Delecroix

Sep 16, 2020, 8:32:28 AM9/16/20
Dear Jonathan,

0. It would be good to benefit from extended CPU instructions
for having faster bitsets.

1. As mentioned in the Cython documentation [1], pxi files are
not the advised way to deal with Cython code. inline functions
are perfectly usable when written in pxd headers. See for
example the header file sage/libs/linbox/conversion.pxd which
has no pyx implementation since everything is inlined.

2. Many people care about efficient bitsets and I doubt that
there is no finely tuned library around. I would rather try
to find an actively maintained open source project focused
on bitsets rather making it part of Sage internals.

3. Depending on the application you have in mind you might
want to distinguish between sparse and dense bitsets. Sparse
could be implemented as sets via has tables or binary trees,


Sep 16, 2020, 11:18:43 AM9/16/20
to sage-devel
Dear Vincent, thanks for your reply.

To sum up my reply: Should I try to replace bitsets completely by RaoringBitmap and see how it performs? (I need some help with the benchmarking, as I don't know good benchmarks for other use cases.)

to 1. Okay. I wasn't aware of that. I just copied the behaviour of bitset.pxi/pyx/pxd. I didn't know that this kind of usage is deprecated.

to 2. I was looking at roaring bitmaps for a while now
They claim it's a simple pip install. So it might be suitable as a standard package in sage.

It only supports values up to 2**32. But seriously, our current implementation requests 512 MB continous memory in this case. So this performs awful anyway.

I was reluctant to use it, because I thought, you cannot avoid allocating memory over and over.
Only after you and Travis insisted to give up our private implemtation, I looked up, if they have all the features I want.
In fact the only thing I need is this pull request for exposing a definition.
(And we can always include this definition ourselves from the header.)

With this to perform `dest = A & B`, I need to copy `A` first and to do the operation inplace then. This is a bit slower than optimal, but avoids allocation, if possible.
So after a couple of initial reallocs there shouldn't be any more memory allocations.
The optimized subset check is much more important to me, which appears to be at least as good as my implementation.

Depending on what you and other people think of, I can try to make use of RoaringBitMap or any other library that likely performs well. is still a good step towards this, as it makes it a lot easier to just replace `bitset_t` by something else.

I could try to create a branch that completely rips out bitset.pxi and replaces it by a different implementation (with same names as to not change anything else). Then many people could try to come up with benchmarks in different parts of sage.

E.g. this might make dense graphs much better in comparison.

to 3. I didn't try many things yet. For my use case the only thing I need (at the moment) is fast intersection and fast subset check. Dense bitsets perform pretty good with that as it is hard to beat 256 bits per CPU-step (whatever this means). I could perform a bit better however,
by keeping an array of the significant chunks. I wanted to try different things. This is the main reason for, because my design choice is awful, if you want to change one implemtation detail. With this ticket I can pretty much tell the face-structure that it consists of a roaring bitmap of atoms and coatoms and adapt a few basic functions to the name scheme of that (I also need to take care of freeing the memory, but hey, that's just writing a couple of __dealloc__ methods).

Vincent Delecroix

Sep 17, 2020, 6:48:52 AM9/17/20
About 2: the C library CRoaring seems a good match but does not
seem to contain any assembler or advanced CPU instructions. Does it?
It is hard to figure out how well is implemented and maintained the
Python interface PyRoaringBitMap.

Sep 17, 2020, 7:50:00 AM9/17/20
to sage-devel
CRoaring definitely uses advanced CPU-instructions. It might not be used everywhere possible (e.g. I don't see a distinction between AVX and AVX2, which means that in one of those cases  something is missing). I got a prompt reply regarding my pull request (that the pull request needs work). So PyRoaringBitmap is maintained, I guess.  I guess it would be useful, if there was only one python wrapper for croaring. I don't know how suitable it would be as a base class for our bitset class. First I would like to see, if CRoaring is suitable at all. The problem with compiling pyroaring for me is that it hides the cython stuff away.

At the moment I'm replacing bitsets by croaring to see how it behaves.

What is extremely annoying is that many parts touch the bitset data structure. They need some special method and instead of implemnting it in bitset.pxi, they rely on implementation details.

Sep 17, 2020, 4:45:12 PM9/17/20
to sage-devel
Ok. I don't think CRoaring serves well for my purpose. I'm getting multiples of my benchmarks (something like a factor of 10).

Things are ok for small instances, but it seems that the bookkeeping explodes. I thought about it this afternoon and I don't think anything optimized for memory would work for me. I'm constantly creating and throwing away bitsets, so it's not worth optimizing for that. If you want some extra bookkeeping, it must pay of it speed, not in memory.

I guess for data science people those bitsets have a different meaning than for me.

Sep 18, 2020, 5:30:34 AM9/18/20
to sage-devel
To sum up this "discussion": There are no general objections to the suggested changes. I should however move away from `pxi` files.

In particular it is fine to overalign bitsets at some point, which will make realloc less efficient (?)!

At some point it would be nice to use some external library for the bitsets. It is not clear yet, which to use for this and if there exists one that fits our needs (not to mention that our needs might differ.)
But at least I shouldn't burn bridges towards this. Note that some files heavily rely on implementation details by calling attributes directly. IMO one should be able to replace attributes in `bitset_s` by others without breaking any top-level usage.
If we don't find an external library that fits our needs, we could make one out of bitset.pxi.

IMO the current state of bitsets is fine. With little effort, we can use better CPU-instructions. The largest part of the effort is to merge my own code duplication.

Sep 22, 2020, 8:14:49 AM9/22/20
to sage-devel
Actually  roaring performs pretty well, I just used an awful test case. Roaring bitmaps has container size 64k, so it performs awful for anything far below that. My test case had size 768.

Roaring is interesting however as a bitset implementation for anything beyond 64k and seems to perform much better than my private implementation.

I still think the approach with the fuzed types is good. This allows us to have a common approach and the application could switch types without changing the code.

So I would propose having different bitset implementations that all look the same regarding the interface. Something like:

- bitset_t: As before but maybe still using intrinsics (interesting for bitsets with up to ~50k bits).
- bitset_sparse_t: Behaving better for sparse bitsets than bitset_t
- bitset_roaring_t: A type exposing the roaring stuff (interesting for bitsets from ~50k bits).

I don't know yet, if having bitset_sparse_t is really worth it. But there are lots of non-trivial calculations with bitsets of size far less than 50k

One could think of using a different standard library for other cases, but I didn't find anything optimized with CPU-specific instructions for this.

Am I the only one using larger bitsets?


Sébastien Labbé

Sep 22, 2020, 10:07:15 AM9/22/20
to sage-devel
Am I the only one using larger bitsets?

Doing the following from SAGE_ROOT:

git grep --name-only bitset

shows that modules in sage using bitset are mostly graphs but also matroid, groups, combinat, data_structures, but also crypto, coding, misc and quivers. I don't know if those usage are for larger bitsets or not.

Sep 23, 2020, 2:46:33 AM9/23/20
to sage-devel
Thanks. I still can only guess, if people care about large bitsets like this.

I can try to make it available with the same syntax and then however likes it, can just use it.

David Coudert

Sep 24, 2020, 3:39:55 AM9/24/20
to sage-devel
Most of the bitsets used in the graph backend are to record sets of seen or active vertices. These sets are not sparse.
But I think we have several places in the graph module where we use sets instead of bitsets for small number of vertices. Sparse bitsets could be useful in this case if effectively faster.
Reply all
Reply to author
0 new messages