Bitshuffle and BLOSCLZ

83 views
Skip to first unread message

Luke Robison

unread,
Feb 21, 2020, 3:46:56 PM2/21/20
to blosc
All,

I have existing code that bit-packs unsigned integers into a uint_8 array.  In this way if we only want to store 3-bit integers we can pack 5 values into 2 bytes (with 1 bit leftover).  While looking into blosc (and blosc2) I thought perhaps a compression algorithm with the bitshuffle filter might do this for me, so I gave it a try.

I set up my context like so:
cparms = BLOSC2_CPARAMS_DEFAULTS;
cparms.compcode = BLOSC_LZ4;
cparms.clevel = 9;
memset(&cparms.filters, 0, sizeof(cparms.filters));
cparms.filtes[0] = BLOSC_BITSHUFFLE;
cparms.typesize = sizeof(uint8_t);

and I filled my buffers with rand()%8

As expected I get output that is 3/8 the size of my input.  (yay!) 

But when I switch my compcode to BLOSC_BLOSCLZ then my output is approximately the same size as my input.  The rest of to compcodes (_LZ4, _LZ4HC, ZLIB, _ZSTD, _LIZARD) all have decent compression.

Is this an error in BLOSC_BLOSCLZ compression code?

I'm using a version of blosc2 I compiled my self downloaded from github (~ 1 week old).  Hm, as I'm writing this I see Francesc you commited some bitshuffle changes just a few hours ago.  I guess I need to update.

Luke

Francesc Alted

unread,
Feb 24, 2020, 3:34:47 AM2/24/20
to Blosc
Hi Luke,

Yes, BloscLZ codec is mainly the place where I try to come with good compression ratios and speed for highly redundant datasets (like sparse matrices or series of numbers that can be predicted easily).  For data that is more random, typically LZ4 or Zstd are better bets.

And yes, you should checkout my latest changes in c-blosc2, as bitshuffle could be more efficient in a number of scenarios.  Just a warning: c-blosc2 is still in beta, and the format could change before it receives the 'final' tag (probably 2.0.0-final or similar).

Cheers,
Francesc

--
You received this message because you are subscribed to the Google Groups "blosc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blosc+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/blosc/023c99c6-12b1-43af-a448-482912b81390%40googlegroups.com.


--
Francesc Alted

Luke Robison

unread,
Feb 24, 2020, 12:21:09 PM2/24/20
to bl...@googlegroups.com
Francesc,

Thank you for your reply, and I appreciate the warning on the beta status.  I have observed BLOSCLZ to be really quite fast so perhaps that shouldn't be a surprise to me.  I've been enjoying using this library, along with caterva, to store multi-dimensional numeric data.  I wonder if you could answer a few follow-up questions too:

1) I had assumed bitshuffle on arrays with typesize=1 would take the first bit from 8 elements to fill the 1st shuffled byte, then the 2nd bit from the same 8 and so on, working on blocks of 8*typesize bytes at a time.  When I tried doing this myself prior to blosc2, I see that the resulting compression rate is not as good as when I use blosc2 to bitshuffle.  Can you describe what bitshuffle is actually doing, does it operate on a whole page or chunk at once?  I didn't see it documented and it does seem important if I'm constructing a filter pipeline.  I'm also curious what happens when bitshuffle and byteshuffle are both enabled.

2) my real data is typically random-ish signed integers with a mean of 0 and less than full-bit-depth of significant bits, this means the MSBs are often switching between 1-s and 0-s as a result of typical 2's-complement encoding, preventing any of the compression algorithms from achieving any decent compression even after a bitshuffle.  Looking at the filter pipeline options I couldn't think of a way to make this data more attractive to the compressor.  If I convert my input data to a different signed representation (such as protobuf's zig-zag encoding, or a sign-magnitude encoding, then the compression is successful once again (since now many of the MSB's are the same bit).  I wonder if you would consider that out-of-scope for a type-unaware compression library or if since blosc2 (and caterva) seems to be targeted towards binary data if it might be acceptable to propose a new filter that would accomplish something like this.

Thank you,
Luke

Francesc Alted

unread,
Feb 24, 2020, 2:12:37 PM2/24/20
to Blosc
Hi Luke,

On Mon, Feb 24, 2020 at 6:21 PM Luke Robison <luker...@gmail.com> wrote:
Francesc,

Thank you for your reply, and I appreciate the warning on the beta status.  I have observed BLOSCLZ to be really quite fast so perhaps that shouldn't be a surprise to me.  I've been enjoying using this library, along with caterva, to store multi-dimensional numeric data.  I wonder if you could answer a few follow-up questions too:

1) I had assumed bitshuffle on arrays with typesize=1 would take the first bit from 8 elements to fill the 1st shuffled byte, then the 2nd bit from the same 8 and so on, working on blocks of 8*typesize bytes at a time.  When I tried doing this myself prior to blosc2, I see that the resulting compression rate is not as good as when I use blosc2 to bitshuffle.  Can you describe what bitshuffle is actually doing, does it operate on a whole page or chunk at once?  I didn't see it documented and it does seem important if I'm constructing a filter pipeline.  I'm also curious what happens when bitshuffle and byteshuffle are both enabled.

bitshuffle has been implemented by Kiyo Masui and adapted to Blosc mainly by me.  I think your best bet is to check out the original paper by Kiyo at: https://arxiv.org/abs/1503.00638

Regarding mixing bit and byte shuffle, I am curious too, so please send your feedback in case you try it.

 

2) my real data is typically random-ish signed integers with a mean of 0 and less than full-bit-depth of significant bits, this means the MSBs are often switching between 1-s and 0-s as a result of typical 2's-complement encoding, preventing any of the compression algorithms from achieving any decent compression even after a bitshuffle.  Looking at the filter pipeline options I couldn't think of a way to make this data more attractive to the compressor.  If I convert my input data to a different signed representation (such as protobuf's zig-zag encoding, or a sign-magnitude encoding, then the compression is successful once again (since now many of the MSB's are the same bit).  I wonder if you would consider that out-of-scope for a type-unaware compression library or if since blosc2 (and caterva) seems to be targeted towards binary data if it might be acceptable to propose a new filter that would accomplish something like this.

Blosc is type-unaware yes, but still it brings the possibility to specify the size of the type (typesize) as a parameter.  So what you are proposing is yet another filter and perfectly possible inside Blosc indeed.

Cheers
Francesc

 
Reply all
Reply to author
Forward
0 new messages