C-Blosc2 regressions compared with C-Blosc

52 views

Skip to first unread message

aso...@gmail.com

unread,

Apr 5, 2023, 7:25:12 PM4/5/23

to blosc

I'm the author of the Rust Blosc crate [1], and I'm using it for what is
perhaps an uncommon use case. I'm writing a copy-on-write file system, and I
use Blosc for transparent compression. For user data, it's no better than
using LZ4 or ZSTD directly. But for metadata, it's so much better! Since the
metadata consists largely of arrays of fixed-size structures, the shuffle
filter is a big win. Compared to no shuffle filter, it can improve metadata
compression by up to 9x and speed by 3x. Plus, I prefer blosc's block-oriented
API to the stream-oriented APIs used by most compression libraries.

I was excited to try Blosc-2's new Delta filter, because I thought it might work
well for some of the metadata tables. So this week I got to work adapting the
Rust bindings to Blosc-2. Unfortunately I soon ran into some problems.

* Blosc-2 tries multiple getenv() calls every time it builds a context, even
when using the multithreaded API. That not only slows things down, but it
can also cause incorrect behavior. For example, there's no way that typesize
can be meaningfully overridden by the user, but Blosc-2 allows them to do it.
There's no way to turn off this sensitivity to environment variables. In
Blosc, the multithreaded API only reads the BLOSC_WARN variable.

* Blosc-2 requires an allocation to generate a context. And the
blosc2_compress_[cd]tx functions need mutable access to the context, which
means that it can't be shared by multiple tasks or threads. So basically
every operation requires a new context, which means a new allocation.

* blosc2_create_[cd]ctx takes its argument by value instead of by pointer. So
even the blosc2_[cd]params structure needs to be regenerated for every
operation.

* Every operation also mallocs a thread_context, even if it's single-threaded.
And that mallocs a tmp buffer for the thread context.

Maybe the getenv problem could be fixed. But the others would require API
changes, and they wouldn't be backwards compatible. Plus, I'm getting the
impression that Blosc's developers may just not be very interested in my
use-case. I'm working with blocks from 4kB to a few hundred kB, not multiple
GB. So unless somebody can disabuse me of that idea, I think I'll have to
leave Blosc behind, and instead implement my own shuffle filter on top of LZ4
and ZSTD. So long, and thanks for all the fish!

-Alan

[1] https://crates.io/crates/blosc

Francesc Alted

unread,

Apr 6, 2023, 12:15:00 PM4/6/23

to bl...@googlegroups.com

Hi Alan,

Thank you for sharing your thoughts. It is true that Blosc2 has evolved to compress large datasets, but I agree that addressing the issues you've detected is important. The getenv issue should be easily fixable. Additionally, the context being mutable could be addressed by creating another API that is more friendly towards already multithreaded applications (as other people have suggested).

However, I understand that you may prefer a simpler solution that is easier to tweak. :-)

Francesc

--
You received this message because you are subscribed to the Google Groups "blosc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blosc+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/blosc/7035ca1d-8fc0-4bd9-9f86-e42806ef16a5n%40googlegroups.com.