On 5/29/12 9:07 PM, Valentin Haenel wrote:
> Hi,
>
> * Francesc Alted<
fal...@gmail.com> [2012-05-29]:
>> On 5/27/12 7:31 PM, Valentin Haenel wrote:
>>> Hi,
>>>
>>> for the last couple of weeks I have been working on a command-line
>>> interface to Blosc. Today, I am pre-releasing v0.1.0 as a release
>>> candidate:
>>>
>>>
https://github.com/esc/bloscpack/tree/v0.1.0-rc1
>>>
>>> Maybe someone out their will find it useful---If you do try it out I
>>> would be very, very happy to hear from you! Feedback is welcome!
>> Hey this looks great! I like your implementation. A couple of suggestions:
> Thanks! I am glad you like it!
>
>> - As you say in the README, bloscpack has to load in memory all the
>> file for operation. This can actually defeat the utility of the
>> compressor in large file scenarios. It would be fantastic if
>> bloscpack could compress/decompress in small blocks.
> You can try setting --nchunks or --chunk-size (see ./blpk c -h).
> Although --chunk-size currently only works in bytes and maybe allowing
> for a human readable spec (MB, G etc..) would be nice to have. If you
> use either chunks will be read, compressed and written chunk-wise.
Yes, specifying the chunksize would help. Also using human readable
specs would be nice.
>
> I just chose the maximum chunk size to be the maximum that Blosc
> can handle (2G). I thought this is maybe the best choice for an on-disk
> compressed buffer, since the larger the buffer, the better the
> compression (or did I miss something?)
Nope. Blosc always tries to compress small blocks, typically in the
order of magnitude of level 1 cache for small compression levels, and in
level 2 for larger ones. So even if you pass very large chunks of
memory, Blosc will split it in small blocks for compression.
I'd say that using a default chunksize of 8 MB would be a far more
sensible figure that 1) allows for maximum compression, 2) consumes far
less memory and 3) allows to compress files up to 2**32 * 2**23 = 2 **
45 bytes = 32 TB which is large enough for a default.
> As for the compression level, it
> is 7 by default. I chose this because the synthetic benchmarks seemed to
> indicate that there isn't much to be gained by setting it higher and it
> might actually be faster. I.e. gives good compression ration speed
> trade-off.
>
>> - The default benchmark creates a very large file (1.5 GB), and that
>> can be too much for many machines out there. I'd suggest to reduce
>> it to something more sensible (128 MB, 256 MB). In the future, when
>> the block compression would be implement you could raise this value
>> again.
> OK. How about this:
>
>
https://github.com/esc/bloscpack/commit/dc89b4c
>
https://github.com/esc/bloscpack/commit/3f8ad42
Yes, much better. However, your benchmark updates to the README.md say
that compression is actually affected by using a block size as large as
128 MB (contrarily to what I was saying above). Hmm, I need to think
about what is happening here.
BTW, the output for the benchmark on my Mac OSX box is:
create the test data
testfile is:
-n enlarge the testfile
-n .
-n .
-n .
-n .
-n .
-n .
-n .
-n .
-n .
done.
testfile is:
do compression with bloscpack, chunk-size: 128MB
real 18.43
user 9.79
sys 5.32
testfile.blp is:
do compression with gzip
real 185.21
user 176.09
sys 3.06
testfile.gz is:
Apparently the -n is not support by the /bin/sh. From the manpage manual:
"""
Some shells may provide a builtin echo command which is similar or
identical to this utility. Most notably, the builtin echo in sh(1) does
not accept the -n option. Consult the builtin(1) manual page.
"""
Also, the `cut -f5 -d' '` does not seem to work as intended on Mac OSX.
>
> I changed the way the array is created too, to be a bit nicer on mem.
>
> Also, the memory problem is exaggerated by:
>
>
https://github.com/FrancescAlted/python-blosc/blob/master/blosc/blosc_extension.c#L94
>
> So for a 2G file of non-compressible data, for example coming from
> random number generator, we get 2G input buffer, 2G buffer or compressed
> data and then another 2G for the results string. ;) I was actually able
> to bring my work machine with 6G to its knees! ;) I have perhaps a
> solution using bytearray, which you can pre-allocate and then shrink
> (and maybe also memoryview) but its still in progress and perhaps the
> topic of a later discussion.
Yes, this is an ugly inefficiency in the Python wrapper that I'd be glad
to discuss on how to solve it.
--
Francesc Alted