compress/gzip concurrency

996 views
Skip to first unread message

Vasiliy Tolstov

unread,
Oct 14, 2014, 7:46:11 AM10/14/14
to golang-nuts
Hi. I'm need to pack massive data with gzip (is about 2-5 Gb each
time) as i see in top output i use only one CPU. Does compress/gzip
can use more that one cpu for compression?

--
Vasiliy Tolstov,
e-mail: v.to...@selfip.ru
jabber: va...@selfip.ru

Tamás Gulácsi

unread,
Oct 14, 2014, 9:27:12 AM10/14/14
to golan...@googlegroups.com
Nope. But as gzip understands multiple streams, you may pipeline the compression (compress and send the first half, compress and buffer the second, then send when the first half has been sent...

Klaus Post

unread,
Oct 14, 2014, 10:40:46 AM10/14/14
to golan...@googlegroups.com
Deflate (which is used as compression in gzip) is not something that is easy to do concurrently efficiently.

However, there is a way to do so, by chopping up the input into blocks, and inserting a "sync flush" between them. I looked a bit at the gzip implementation, and it may be doable with a few modifications. The main sacrifice will be compression efficiency, but with, say, 1mb segments that should be well below a percent.

I may give this a shot when I get back to my computer (currently on the road).

Rui Ueyama

unread,
Oct 14, 2014, 12:55:36 PM10/14/14
to Klaus Post, golang-nuts
I don't think Go's deflate is optimized as gzip command. If it's okay to run the command, you may want to compare the Go's performance with gzip. It might be fast enough for your need.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Kortschak

unread,
Oct 14, 2014, 2:55:55 PM10/14/14
to Klaus Post, golan...@googlegroups.com
I have a blocked gzip writer in code.google.com/p/bopgo.bam/bgzf that does concurrent compression of gzip member blocks. The block size is ~64kB and members have additional information stored in the headers (for random member access), but it could be easily modified to change both of these.

Vasiliy Tolstov

unread,
Oct 14, 2014, 5:06:20 PM10/14/14
to Dan Kortschak, Klaus Post, golan...@googlegroups.com
2014-10-14 22:55 GMT+04:00 Dan Kortschak <dan.ko...@adelaide.edu.au>:
> I have a blocked gzip writer in code.google.com/p/bopgo.bam/bgzf that does concurrent compression of gzip member blocks. The block size is ~64kB and members have additional information stored in the headers (for random member access), but it could be easily modified to change both of these.


Link for package not works =(
Can you send me working link?

andrey mirtchovski

unread,
Oct 14, 2014, 5:14:19 PM10/14/14
to Vasiliy Tolstov, Dan Kortschak, Klaus Post, golan...@googlegroups.com
> Link for package not works =(
> Can you send me working link?

http://godoc.org/code.google.com/p/biogo.bam/bgzf

Dan Kortschak

unread,
Oct 14, 2014, 5:47:22 PM10/14/14
to andrey mirtchovski, Vasiliy Tolstov, Klaus Post, golan...@googlegroups.com
Thanks Andrey.

Vasiliy Tolstov

unread,
Oct 14, 2014, 6:08:24 PM10/14/14
to Dan Kortschak, andrey mirtchovski, Klaus Post, golan...@googlegroups.com
2014-10-15 1:47 GMT+04:00 Dan Kortschak <dan.ko...@adelaide.edu.au>:
> Thanks Andrey.


Thanks for link =). I have a question about NewWriterLevel and
NewWriter - what is level - minimum and maximum like in gzip? And what
is wc ?
I need best compression with lowest speed and best decompression speed =)

Dan Kortschak

unread,
Oct 14, 2014, 6:42:22 PM10/14/14
to Vasiliy Tolstov, andrey mirtchovski, Klaus Post, golan...@googlegroups.com
level is as described in compress/gzip.
wc is the writer concurrency - the package write to the block buffer on call to Write and when the block buffer is full compresses and write to the underlying Writer. So wc is the number of available buffers (the minimum is 2, but this is selected is the wc param is less than 2).

At some stage (when the Reader handles concurrent reads) there will be documentation.

hog...@sajari.com

unread,
Oct 14, 2014, 8:11:16 PM10/14/14
to golan...@googlegroups.com
Does it need to be gzip? LZ4 is much faster, but doesn't compress quite as much. The go lib doesn't support streaming though, it's labelled as experimental.

Some testing around compression:
https://github.com/sajari/talks/tree/master/201409/code

Vasiliy Tolstov

unread,
Oct 14, 2014, 11:34:08 PM10/14/14
to Dan Kortschak, andrey mirtchovski, Klaus Post, golan...@googlegroups.com
2014-10-15 2:42 GMT+04:00 Dan Kortschak <dan.ko...@adelaide.edu.au>:
> level is as described in compress/gzip.
> wc is the writer concurrency - the package write to the block buffer on call to Write and when the block buffer is full compresses and write to the underlying Writer. So wc is the number of available buffers (the minimum is 2, but this is selected is the wc param is less than 2).
>
> At some stage (when the Reader handles concurrent reads) there will be documentation.


Thanks

Vasiliy Tolstov

unread,
Oct 15, 2014, 12:57:10 AM10/15/14
to Vasiliy Tolstov, Dan Kortschak, andrey mirtchovski, Klaus Post, golan...@googlegroups.com
2014-10-15 7:33 GMT+04:00 Vasiliy Tolstov <v.to...@selfip.ru>:
> Thanks


Hm, i'm try to create compressed file and get errors : short write,
what does it mean?
code http://play.golang.org/p/m_36pfuVHR

Dan Kortschak

unread,
Oct 15, 2014, 1:19:07 AM10/15/14
to Vasiliy Tolstov, andrey mirtchovski, Klaus Post, golan...@googlegroups.com
On Wed, 2014-10-15 at 08:56 +0400, Vasiliy Tolstov wrote:
> Hm, i'm try to create compressed file and get errors : short write,
> what does it mean?
> code http://play.golang.org/p/m_36pfuVHR

It looks like I'm not returning the correct number of bytes on return
from Write. Would you mind making a complete self-contained reproducer
and submitting an issue to the biogo issue tracker?


Dan Kortschak

unread,
Oct 15, 2014, 1:51:02 AM10/15/14
to Vasiliy Tolstov, andrey mirtchovski, Klaus Post, golan...@googlegroups.com
On Wed, 2014-10-15 at 15:48 +1030, Dan Kortschak wrote:
> It looks like I'm not returning the correct number of bytes on return
> from Write. Would you mind making a complete self-contained reproducer
> and submitting an issue to the biogo issue tracker?
>
I have a reproducer now, but would you mind filing an issue.

Vasiliy Tolstov

unread,
Oct 15, 2014, 1:57:46 AM10/15/14
to Dan Kortschak, andrey mirtchovski, Klaus Post, golan...@googlegroups.com
2014-10-15 9:50 GMT+04:00 Dan Kortschak <dan.ko...@adelaide.edu.au>:
> I have a reproducer now, but would you mind filing an issue.


where i can find issue tracker? https://code.google.com/p/biogo/issues/list ?

Dan Kortschak

unread,
Oct 15, 2014, 2:02:02 AM10/15/14
to Vasiliy Tolstov, andrey mirtchovski, Klaus Post, golan...@googlegroups.com
On Wed, 2014-10-15 at 09:57 +0400, Vasiliy Tolstov wrote:
> where i can find issue tracker?
> https://code.google.com/p/biogo/issues/list ?
>
Yup. I have a fix you can try now though if you want - please file
though.

diff --git a/bgzf/writer.go b/bgzf/writer.go
index a13e7d0..dac1db2 100644
--- a/bgzf/writer.go
+++ b/bgzf/writer.go
@@ -178,10 +178,10 @@ func (bg *Writer) Write(b []byte) (int, error) {
_n = copy(c.block[c.next:], b)
b = b[_n:]
c.next += _n
+ n += _n
}

if c.next == len(c.block) || _n == 0 {
- n += c.buf.Len()
bg.queue <- c
bg.qwg.Add(1)
go c.writeBlock()



Yes. I'm an idiot.

Dan

Vasiliy Tolstov

unread,
Oct 15, 2014, 8:45:07 AM10/15/14
to hog...@sajari.com, golan...@googlegroups.com
I need streaming support

Dan Kortschak

unread,
Oct 15, 2014, 7:07:28 PM10/15/14
to Vasiliy Tolstov, andrey mirtchovski, Klaus Post, golan...@googlegroups.com
Well, fixed anyway. Please try at tip.

Matt Harden

unread,
Oct 15, 2014, 10:38:32 PM10/15/14
to Dan Kortschak, Vasiliy Tolstov, andrey mirtchovski, Klaus Post, golan...@googlegroups.com
pigz (Pig-Zee), by one of the creators of gzip / deflate, does gzip using all CPUs, losing very little in compression performance, and generating archives that can be decompressed by standard gzip tools. It can't be decompressed in parallel, unfortunately; that's in the nature of how it achieves its performance and compatibility.

On Wed, Oct 15, 2014 at 6:07 PM, Dan Kortschak <dan.ko...@adelaide.edu.au> wrote:
Well, fixed anyway. Please try at tip.

Klaus Post

unread,
Oct 16, 2014, 9:06:20 AM10/16/14
to golan...@googlegroups.com, dan.ko...@adelaide.edu.au, v.to...@selfip.ru, mirtc...@gmail.com, klau...@gmail.com
Hi!

I created a modified version of "compress/gzip" that compresses blocks in parallel, but otherwise should be a drop-in replacement for the built-in compressor. It has a much lower size overhead than bgzf, since it doesn't have complete restarts between blocks.


To use, replace import "compress/gzip" --->  import gzip "github.com/klauspost/pgzip"

(it works just as pigz, by adding sync markers and retaining the history for the decoder).


/Klaus

Vasiliy Tolstov

unread,
Oct 16, 2014, 2:17:03 PM10/16/14
to Klaus Post, golan...@googlegroups.com, dan.ko...@adelaide.edu.au, mirtc...@gmail.com
2014-10-16 17:06 GMT+04:00 Klaus Post <klau...@gmail.com>:
> be a drop-in replacement for the built-in compressor. It has a much lower
> size overhead than bgzf, since it doesn't have complete restarts between
> blocks.
>
> https://github.com/klauspost/pgzip
>
> To use, replace import "compress/gzip" ---> import gzip
> "github.com/klauspost/pgzip"
>
> (it works just as pigz, by adding sync markers and retaining the history for
> the decoder).

Thanks. For decompression does i need to use this package, or i can
use original compress/gzip for that? (this is compatibility question,
because i have some software that does not understnd pgzip...)

Klaus Post

unread,
Oct 16, 2014, 2:58:08 PM10/16/14
to golan...@googlegroups.com, klau...@gmail.com, dan.ko...@adelaide.edu.au, mirtc...@gmail.com
Hi

It should be fully gzip compatible. The included decoder is actually just a copy of the built-in "compress/gzip" functions, just there for convenience.

It also passes all the built-in tests, plus some I added some tests for larger files.

Vasiliy Tolstov

unread,
Oct 16, 2014, 4:23:05 PM10/16/14
to Klaus Post, golan...@googlegroups.com, dan.ko...@adelaide.edu.au, mirtc...@gmail.com
2014-10-16 22:58 GMT+04:00 Klaus Post <klau...@gmail.com>:
> Hi
>
> It should be fully gzip compatible. The included decoder is actually just a
> copy of the built-in "compress/gzip" functions, just there for convenience.
>
> It also passes all the built-in tests, plus some I added some tests for
> larger files.


Thank you. I'm add bgzf and pgzip and test =).
Reply all
Reply to author
Forward
0 new messages