snappy.Writer internal buffering

628 views
Skip to first unread message

Nigel Tao

unread,
Feb 6, 2016, 11:14:45 PM2/6/16
to golang-dev
github.com/golang/snappy implements the Snappy compression format. The
snappy.Writer type is:

----

type Writer struct {
// Has unexported fields.
}

func NewWriter(w io.Writer) *Writer
func (w *Writer) Reset(writer io.Writer)
func (w *Writer) Write(p []byte) (n int, errRet error)

----

This API admits for no internal buffering; there is no Flush or Close
method. This means that, out of the box, a Writer is not suitable for
many small writes, only for few large writes. In fact, due to the
framing overhead, the subsequent bytes on the wire can easily be
longer than the input bytes, defeating the whole purpose of using
compression.

Instead, just like when using an os.File, if you're making many small
writes, you should use the standard bufio package to wrap the
snappy.Writer.

At the very least, this is not obvious from the snappy package godoc,
and the easiest 'fix' is to add a note to those docs to suggest using
bufio if needed. Call this Option 0.

In the standard library, the compress/{flate,gzip,zlib} writers'
method sets are all Close + Flush + Reset + Write. The compress/lzw
writer's method set is Close + Write - it is literally an
io.WriteCloser. These writers all have their own internal buffering,
so are good for many small writes, and users are required to call
Close when done.

https://github.com/golang/snappy/pull/21 adds an internal buffer (and
Flush and Close methods) so that a snappy.Writer is good to use out of
the box, regardless of the frequency and shape of the writes. Call
this Option 1. However, this will silently break any existing users of
that package, and so I dismissed that suggestion.

On further thought, though, it is possible to make this internal
buffering opt-in instead of mandatory, so there is no such breakage.
Call this Option 2. There are a number of API possibilities, but one
concrete proposal is:

----

// NewWriter returns a new Writer that compresses to w.
//
// The Writer returned does not buffer writes. There is no need to Flush such a
// Writer.
//
// Deprecated: the Writer returned is not suitable for many small writes, only
// for few large writes. Use NewBufferedWriter instead, which is efficient
// regardless of the frequency and shape of the writes.
func NewWriter(w io.Writer) *Writer

// NewBufferedWriter returns a new Writer that compresses to w, using the
// framing format described at
// https://github.com/google/snappy/blob/master/framing_format.txt
//
// The Writer returned buffers writes. Users must call Flush to guarantee all
// data has been forwarded to the underlying io.Writer.
func NewBufferedWriter(w io.Writer) *Writer

// Flush flushes the Writer to its underlying io.Writer.
func (w *Writer) Flush() error

// Reset and Write are as before.

----

Note that there is no Close method, only Flush. This is similar to a
bufio.Writer but unlike the other compress/* writers. The Snappy wire
format does not need an explicit EOF marker or trailing checksum, so I
don't think a snappy.Writer needs a Close method in addition to a
Flush method, but I am open to being convinced otherwise. Admittedly,
a Close method would give us a place to hook into a sync.Pool of byte
buffers, if we wanted such a place, but I don't think we do.

For completeness, an alternative to the NewBufferedWriter function is
adding a ... argument to NewWriter so that it would become
sw := snappy.NewWriter(w, snappy.Buffered)
or maybe
sw := snappy.NewWriter(w, snappy.BufferSize(8192))
while existing code continued to work (without Flush'ing):
sw := snappy.NewWriter(w)
but I prefer the separate NewBufferedWriter function. I don't
anticipate wanting any other options, including being able to choose
the internal buffer size. The other compress/* writers in the standard
library don't offer anything similar.

Long story short, I intend to implement Option 2 with the API proposed
above, but as I made the original API mistake, I am open to any
further suggestions.

Joe Tsai

unread,
Feb 7, 2016, 12:55:25 AM2/7/16
to golang-dev
Personally, I vote for option 0. (I didn't study the pull request in detail), but in PR#21, it seems that the main performance advantages comes from 1) larger chunks when snappy.Writer.Write is called 2) reusing buffers. As you mentioned, #1 is easily satisfied by using a bufio.NewReader. In that PR, I would like to see benchmark numbers be given, but if reusing buffers shows significant performance gains, we could always add a Reset method without breaking anyone. I think a global sync.Pool is overkill.

That said, I'm not opposed to option 2. I oppose a variadic version of NewWriter, but also feel uncomfortable with NewBufferedWriter. I agree that I don't foresee any more options, but I just hope it doesn't bloat to be like compress/zlib with NewWriter, NewWriterLevel, NewWriterLevelDict.

JT

Nigel Tao

unread,
Feb 7, 2016, 1:10:09 AM2/7/16
to Joe Tsai, golang-dev
On Sun, Feb 7, 2016 at 4:55 PM, Joe Tsai <thebroke...@gmail.com> wrote:
> if reusing buffers
> shows significant performance gains, we could always add a Reset method
> without breaking anyone.

We already have Reset methods.

That pull request makes a number of changes. Even without reusing
buffers via a sync.Pool, though, I can see an argument for internal
buffering to make it very rare instead of very common for the
'compressed' version of many small writes being longer than the
uncompressed version.

Uli Kunitz

unread,
Feb 7, 2016, 4:04:22 AM2/7/16
to golang-dev
The reasons for a data compressor to introduce buffering are:

(a) Encoding is not byte-oriented resulting in buffering.
(b) format is framed
(c) Prevention of match splits at Write borders for Lempel-Ziv methods. Since Writes cannot be incomplete this requires buffering.

Note that (a) may even prevent the use of a Flush function. LZMA cannot support Flush but LZMA2 used in xz by introducting framing. Snappy is neither affected by (a) or (b).

That leaves (c) for discussion. bufio.Writer can be used for mitigation. By choosing a larger buffer match splits can be significantly reduced. The question is whether a potential match split every buffer size bytes justifies the added complexity of introducing a Flush function and an additional "constructor" NewBufferedWriter for Snappy.

IMHO the answer is no, because Snappy's objective hasn't been to achieve the highest compression ratio. So I vote for option 0.

Uli

PS: I'm currently working on a pure Go implementation of the LZMA-based xz format.

Nigel Tao

unread,
Feb 7, 2016, 7:14:47 AM2/7/16
to Uli Kunitz, golang-dev
On Sun, Feb 7, 2016 at 8:04 PM, Uli Kunitz <uli.k...@gmail.com> wrote:
> (b) format is framed
>
> ... Snappy is neither affected by (a) or (b).

There's Snappy as in "compress this block of bytes" (what Go calls
Encode), but they later added a format for "compress this stream of
bytes" (what Go calls NewWriter). I don't know what you mean exactly
by a framed format, but the spec for the latter does say it's framed:
https://github.com/google/snappy/blob/master/framing_format.txt

Brad Fitzpatrick

unread,
Feb 7, 2016, 10:59:17 AM2/7/16
to Nigel Tao, golang-dev
I like option 2 too.

I think I'd make it a io.WriteCloser, anyway, though. That's what feels normal to me for all these sort of types. Close can just be the same as Flush if it's unflushed.



--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Uli Kunitz

unread,
Feb 7, 2016, 2:42:46 PM2/7/16
to golang-dev, uli.k...@gmail.com
You are right. I have looked at the Wikipedia entry, which doesn't mention the framing format. My post has been wrong. The current implementation converts each Write into one or more frames, called in the code chunks. Since a frame has a maximum size of 65536 kByte, one would need a buffered writer that only writes 65536 bytes to generate maximum frames.  The bufio.Writer doesn't guarantee this.

So I support Option 2 with a Flush method. A Close is not really required, snappy.Writer would act more like a bufio.Writer that also has only a Flush method and no Close. 

Travis Johnson

unread,
Feb 8, 2016, 2:26:28 PM2/8/16
to golang-dev, nige...@golang.org
Unless the close passes on to the underlying writer (after a flush if necessary), I don't see a reason to make this a closer.

That said being able to wrap a closeable writer and pass along the close would make some things much nicer, as you wouldn't need to hang on the original writer. Use a type assertion to do the closing and you don't even have to change the interface.

Russ Cox

unread,
Feb 8, 2016, 3:04:56 PM2/8/16
to Nigel Tao, golang-dev
Option 2 SGTM for the rationale you gave. Thanks.

Russ

569...@gmail.com

unread,
Feb 9, 2016, 3:51:46 AM2/9/16
to golang-dev
I wrote the folk and pull recommending the Close(), along with multiple other optimizations.

There is some back-and-forth here concerning whether we intend to stick close to the Snappy specification or idiomatic Go. Go uses Close(), that's what makes sense to a Go programmer even if Flush() is technically more correct. I'd say Close() makes more sense for Go usage and it works with io.WriteCloser, which is important. Basically I'd say that Go has already defined this for us and there is little point trying to keep to one principle when another is defeated. Snappy Go does not need to be compatible in principle with other ports of Snappy; it needs to be compatible in principle with Go.

To touch briefly on the sync.Pool usage, this is irrelevant to Close() because even though my folk returns the slice to the pool on Close() that is actually unnecessary. It can be Get() and Put() on every write with negligible overhead (and with more efficient memory in some applications.) I would still recommend this change because it saves a lot of memory allocation time, which is best to be avoided where possible and easily becomes the bottleneck.

Adding a NewBufferedWriter does solve the problem. It does not break any current usage of Snappy. It allows Snappy Go to work out-of-the-box for everyone in a foolproof manner. It allows Snappy Go to conform to the specification but also have additional features for better application. But it is not beautiful. As I said in the pull request comments: what are we trying to achieve here? Are we trying to stick to an ideal of programming or are we providing an out-of-the-box compression package? There is a compromise to be made here and I would suggest that we should aim for efficient usage and application. My port makes Snappy Go faster and more memory efficient for all users. As Nigel suggested, we can do that without requiring users to update their code anyway, so the compromise is only on the beauty and simplicity of the code in its code form (but makes the execution faster, more efficient and foolproof.)

I suggest option 2 as Nigel put forward to be implemented as follows:
NewWriter() without close or flush
+ NewBufferedWriter() with Close() - note that Flush() is more appropriate but idiomatic Go requires Close(), that's not our problem
+ sync.Pool on both NewWriter & NewBufferedWriter() with Get() and Put() on each write to remove dependency on Close() for the pool

I will make the changes to my port and run some benchmarks on these and the other optimizations individually to test their usefulness, and then submit a new pull request with the changes that proved effective. It'll take me a couple of weeks to do this as I'm very busy right now.

~ Alasdair

Nigel Tao

unread,
Feb 10, 2016, 11:49:11 PM2/10/16
to 569...@gmail.com, golang-dev
On Tue, Feb 9, 2016 at 7:51 PM, <569...@gmail.com> wrote:
> I suggest option 2 as Nigel put forward to be implemented as follows:
> NewWriter() without close or flush
> + NewBufferedWriter() with Close() - note that Flush() is more appropriate
> but idiomatic Go requires Close(), that's not our problem
> + sync.Pool on both NewWriter & NewBufferedWriter() with Get() and Put() on
> each write to remove dependency on Close() for the pool
>
> I will make the changes to my port and run some benchmarks on these and the
> other optimizations individually to test their usefulness, and then submit a
> new pull request with the changes that proved effective. It'll take me a
> couple of weeks to do this as I'm very busy right now.

I have implemented option 2, with both Close and Flush methods, in
https://github.com/golang/snappy/commit/0fd139378be7c52384b7191d82ad04eb93560627

Optimizations like using a sync.Pool belong in separate, follow-up
commits. We can take that discussion back to
https://github.com/golang/snappy/pull/21
Reply all
Reply to author
Forward
0 new messages