My use case involves lots of bit field expressions. The primary ingredient of bit fields is the expression (a>>b)&c.
Unfortunately, numexpr doesn't support bit field use cases. As the name suggests, the assumption is that expressions are about numbers, and not about bits.This is surprising to me because I would expect there to be plenty of use cases for bit field crunching.
To be more specific, there is no datatype that supports both >> and & in numexpr. You need both of those to do anything with bit fields.
There has been some old discussion about the absence of support for unsigned int, which is what one would naturally choose for bit field tasks. That would be nice, but that's beating a 10 year old dead horse, and it's not going to happen, and it wouldn't solve the problem anyway. The argument against implementing unsigned int is that it is technically challenging to implement all the operators with all combinations of types such that everything makes sense mathematically. Fortunately, signed int would be adequate for bit field processing if it supported bit-wise &. It doesn't. I'm fine with signed int as 2's complement logic is not a problem for bit fields (the only caveat is that >> fills in the left with the sign bit instead of 0, but that difference is erased by the &). The only problem is that numexpr doesn't currently support & for signed int.
The easiest solution to all this would be to implement & for signed integers. I can't imagine any conflict with other use cases, since & doesn't have many mathematical use cases. I'm just talking about adding and_iii, and_lll. We don't have to go into dealing with new uses being surprised by unexpected results from bitwise & between 64 bit floating point and 32 bit int. We can safely assume that if someone is performing bitwise & that they are working with bits. Currently, & is only supported for bytes (and_bbb), but there is no >> for bytes (rshift_bbb), and in any case I would not want to process one byte at a time.
There's a secondary problem in my use case: numexpr doesn't support conditional expressions such as "a if b else c if d else e". But I'll cross that bridge when I get to it.
Until then, this means that I have to fall back on 'python' vm to evaluate my expressions. This works fine, but it's sad to miss out on the performance boost.
So I'd like comments on alternative solutions to the general problem of implementing bit field operations, i.e. "(a>>b)&c)", using a column store.
Options I've considered:
- Add support for and_iii, and_lll to numexpr
- Any downside to this?
- It doesn't seem to have the can of worms that adding unsigned it would have.
- Use numba to implement the inner loop, to be invoked per block
- I like numba because it does exactly what I want it to do with my expressions
- I can't imagine anything faster than compiled code running on each block
- Unfortunately iterblocks() seems to have a lot of overhead that completely kills any performance gains
- Also, iterblocks loads all columns. Is there a way to pull just the columns I need?
- So I probably need something lower level than iterblocks to access the columns.
- Find a different column store
- Or maybe just go settle for uncompressed flat binary data and roll my own solution with numba and numpy....