design: gzip member reader

541 views
Skip to first unread message

Daniel Krech

unread,
Sep 25, 2013, 4:57:58 PM9/25/13
to golan...@googlegroups.com, Ed Summers
Greetings,

At the Library of Congress we've recently been exploring rewriting a [Java web archiving tool][1] in Go. So far this has involved working with an existing body (~500TB) of data encoded using [ISO/DIS 28500][2] aka the WARC file format. One of the features of WARC is its use of [Gzip][3] as a packaging format, which allows individual WARC records to be represented as separate members in the larger Gzip file. Or as the spec says:

> Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed. Where possible, this property should be exploited to compress each record of a WARC file independently.  This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files.  External indexes of WARC file content may then be used to record each record's starting position in the GZIP  file, allowing for random access of individual records without requiring decompression of all preceding records.

We ran into difficulty using gzip.Reader since it does not provide any insight into when a member has been read. It simply reads through all the members in the file. While fishing around for people with a similar problem we ran across a [go-nuts thread][4] initiated by Dan Kortschak who needed to access members in a gzip file in his [Biogo][5] for processing genomic and metagenomic data sets.

We would like to propose a small addition for gzip that would introduce a MemberReader which would expose when a the end of a member has been reached, as well as the byte offset position in the underlying compressed data.

For example, if you want to print out the header and the end position of each member:

```go
f, _ := os.Open("test.gz")
defer f.Close()
if gz, err := gzip.NewMemberReader(f); err == nil {
        for {
                if _, err := io.Copy(ioutil.Discard, gz); err == nil {
                        return nil
                } else if err == gzip.EndOfMember {
                        fmt.Printf("Header: %#v\n", gz.Header)
                        fmt.Print("End Position:", gz.EndPosition(), "\n")
                } else {
                        return err
                }
        }
} else {
        return err
}
```

Then to read one member at a known position:

```go
f, _ := os.Open("test.gz")
f.Seek(position, 417)
gz, _ := gzip.NewMemberReader(f)
```

Thoughts? We are ready to work on an implementation once the design looks good.

[1]: https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
[2]: http://bibnum.bnf.fr/WARC/warc_ISO_DIS_28500.pdf
[3]: https://tools.ietf.org/html/rfc1952
[4]: https://groups.google.com/forum/#!searchin/golang-nuts/gzip/golang-nuts/VFfzYiI2rDc/EZkt6gguirwJ
[5]: https://code.google.com/p/biogo/

Dan Kortschak

unread,
Sep 25, 2013, 5:04:43 PM9/25/13
to Daniel Krech, golan...@googlegroups.com, Ed Summers
I have a gzip reader that does this at code.google.com/p/biogo.bam/bgzf/egzip.

When I suggested that this kind of thing be included in the std lib the proposal was not accepted. The diff between the std and my fork is not onerous to maintain.
--
 
---
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ed Summers

unread,
Sep 26, 2013, 10:29:34 AM9/26/13
to golan...@googlegroups.com, Daniel Krech
Hi Dan,


On Wednesday, September 25, 2013 5:04:43 PM UTC-4, Dan Kortschak wrote:
I have a gzip reader that does this at code.google.com/p/biogo.bam/bgzf/egzip.

When I suggested that this kind of thing be included in the std lib the proposal was not accepted. The diff between the std and my fork is not onerous to maintain.

Yes, if you read our email a bit closer you could see we cited your previous work on this :-)

One thing your implementation did not do for us was provide the byte offsets for where members began in the compressed file. Were those known to you out of band? We needed to make a small addition to compress/dflate to get access to this. 

Your implementation basically cloned all of bufio and gzip, and although the diffs were relatively modest, it seems like other golang users might potentially find this functionality useful. If two golang users from vastly different domains need it, and it is a feature of the gzip specification, it seems worthy of consideration.

Can you point to the previous design discussion? One thing Dan Krech didn't mention in his previous email is we have a working implementation if others are interested in seeing it.

//Ed

Russ Cox

unread,
Sep 26, 2013, 12:01:17 PM9/26/13
to Daniel Krech, golang-dev, Ed Summers
On Wed, Sep 25, 2013 at 4:57 PM, Daniel Krech <eik...@eikeon.com> wrote:
We would like to propose a small addition for gzip that would introduce a MemberReader which would expose when a the end of a member has been reached, as well as the byte offset position in the underlying compressed data.

Hi. We are in a feature freeze for the upcoming Go 1.2 release, but now that this has come up in two different contexts I think it is worth considering for Go 1.3.

As one possible way to do this, I have included a patch to compress/gzip below that adds a StopAtBoundary method to instruct a reader to stop with EOF at every boundary. Given that patch, you can write a program that retrieves the information you want by tracking the offset in a custom reader. And Dan, the patch also saves the extra information on each header read in that mode.

I'm not saying it will be in this exact form, but it's something to use for now. 

I've created golang.org/issue/6486. If you star it you will get email updates about the issue.

Russ

g% hg diff .
diff -r 351b6fe0ae36 src/pkg/compress/gzip/gunzip.go
--- a/src/pkg/compress/gzip/gunzip.go Tue Sep 24 15:54:48 2013 -0400
+++ b/src/pkg/compress/gzip/gunzip.go Thu Sep 26 11:57:58 2013 -0400
@@ -74,6 +74,8 @@
  flg          byte
  buf          [512]byte
  err          error
+ stop bool
+ hdr bool
 }
 
 // NewReader creates a new Reader reading the given reader.
@@ -89,6 +91,10 @@
  return z, nil
 }
 
+func (z *Reader) StopAtBoundary(stop bool) {
+ z.stop = stop
+}
+
 // GZIP (RFC 1952) is little-endian, unlike ZLIB (RFC 1950).
 func get4(p []byte) uint32 {
  return uint32(p[0]) | uint32(p[1])<<8 | uint32(p[2])<<16 | uint32(p[3])<<24
@@ -200,10 +206,15 @@
  if z.err != nil {
  return 0, z.err
  }
+ if z.hdr {
+ if err := z.resetHeader(); err != nil {
+ return 0, err
+ }
+ }
  if len(p) == 0 {
  return 0, nil
  }
-
+
  n, err = z.decompressor.Read(p)
  z.digest.Write(p[0:n])
  z.size += uint32(n)
@@ -224,16 +235,25 @@
  return 0, z.err
  }
 
- // File is ok; is there another?
- if err = z.readHeader(false); err != nil {
+ z.hdr = true
+ if z.stop {
+ return 0, io.EOF
+ }
+ return z.Read(p)
+}
+
+func (z *Reader) resetHeader() error {
+ // Is there another header?
+ if err := z.readHeader(z.stop); err != nil {
  z.err = err
- return
+ return err
  }
 
  // Yes.  Reset and read from it.
  z.digest.Reset()
  z.size = 0
- return z.Read(p)
+ z.hdr = false
+ return nil
 }
 
 // Close closes the Reader. It does not close the underlying io.Reader.
g% cat x.go
package main

import (
"bufio"
"compress/gzip"
"io"
"io/ioutil"
"fmt"
"log"
"os"
)

type byteCounter struct {
r *bufio.Reader
offset int64
}

func (b *byteCounter) Read(p []byte) (int, error) {
n, err := b.r.Read(p)
b.offset += int64(n)
return n, err
}

func (b *byteCounter) ReadByte() (byte, error) {
c, err := b.r.ReadByte()
if err == nil {
b.offset++
}
return c, err
}

func main() {
f, err := os.Open("test.gz")
if err != nil {
log.Fatal(err)
}
bc := &byteCounter{r: bufio.NewReader(f)}
gz, err := gzip.NewReader(bc)
if err != nil {
log.Fatal(err)
}
gz.StopAtBoundary(true)
var off int64
for {
n, err := io.Copy(ioutil.Discard, gz)
if err != nil {
log.Fatalf("@%d: %d bytes + error: %v", off, n, err)
}
if off == bc.offset {
fmt.Printf("@%d: EOF\n", off)
break
}
fmt.Printf("@%d: %d bytes uncompressed\n", off, n)
off = bc.offset
}
}
g% go run x.go
@0: 4892 bytes uncompressed
@1556: 1989 bytes uncompressed
@2548: 1238 bytes uncompressed
@3174: EOF
g% 

Daniel Krech

unread,
Sep 26, 2013, 3:54:28 PM9/26/13
to Russ Cox, golang-dev, Ed Summers

On Sep 26, 2013, at 12:01 PM, Russ Cox <r...@golang.org> wrote:

As one possible way to do this, I have included a patch to compress/gzip below that adds a StopAtBoundary method to instruct a reader to stop with EOF at every boundary. Given that patch, you can write a program that retrieves the information you want by tracking the offset in a custom reader. And Dan, the patch also saves the extra information on each header read in that mode.

I'm not saying it will be in this exact form, but it's something to use for now. 

I've created golang.org/issue/6486. If you star it you will get email updates about the issue.

We had been going down the path of exposing the compressed input offset in compress/flate that you mention in the ticket. We thought we had to in order for buffering not to obscure the offset of the boundaries. Switched to the custom reader approach and see that one can navigate around the buffering. But I do not think this approach is readily apparent without digging into the implementation of gzip and the approach, I think, depends on the implementation of gzip's makeReader and deflate. It would be great if the offset functionality could also be pushed down into gzip (and flat as you already mentioned). So one could call gz.Offset() when stopping at boundaries.

Thank you for opening the issue and we look forward to helping see the issue through.


Dan Kortschak

unread,
Sep 26, 2013, 4:17:34 PM9/26/13
to Ed Summers, golan...@googlegroups.com, Daniel Krech
On 27/09/2013, at 12:02 AM, "Ed Summers" <e...@pobox.com> wrote:

> Yes, if you read our email a bit closer you could see we cited your previous work on this :-)

Yes, sorry missed that until after I sent.

> One thing your implementation did not do for us was provide the byte offsets for where members began in the compressed file. Were those known to you out of band? We needed to make a small addition to compress/dflate to get access to this.

Yes, it's interesting because I'm at the moment working working on something that does exactly that although it's not ready (this is necessary for concurrent bgzf access).

> Your implementation basically cloned all of bufio and gzip, and although the diffs were relatively modest, it seems like other golang users might potentially find this functionality useful. If two golang users from vastly different domains need it, and it is a feature of the gzip specification, it seems worthy of consideration.

Yes, I agree, and seeing that Russ is positive about this is a good this.

> Can you point to the previous design discussion? One thing Dan Krech didn't mention in his previous email is we have a working implementation if others are interested in seeing it.

There was really very little discussion and it was at the thread that Daniel linked. I'd be interested to see your implementation.

Dan

Dan Kortschak

unread,
Sep 26, 2013, 4:49:05 PM9/26/13
to Ed Summers, golan...@googlegroups.com, Daniel Krech
Just to extend on this. Given the way gzip Reader uses a bufio.Buffer if given a non flate.Reader this is not really something that the gzip Reader can know, unless you ensure that you wrap the initial reader in something that keeps its position. Which can already be done.

Dan

On 27/09/2013, at 12:02 AM, "Ed Summers" <e...@pobox.com> wrote:

Dan Kortschak

unread,
Oct 8, 2013, 7:22:17 AM10/8/13
to Russ Cox, Daniel Krech, golang-dev, Ed Summers
How can a caller know that an EOF is an EOF rather than an end of member with this approach? You expect the return from an end of file EOF to return a byte count of 0, but you will also see this with an empty member, so a (0, io.EOF) return is not definitive.

Dan

Russ Cox

unread,
Oct 14, 2013, 9:42:48 AM10/14/13
to Dan Kortschak, Daniel Krech, golang-dev, Ed Summers
On Tue, Oct 8, 2013 at 7:22 AM, Dan Kortschak <dan.ko...@adelaide.edu.au> wrote:
How can a caller know that an EOF is an EOF rather than an end of member with this approach? You expect the return from an end of file EOF to return a byte count of 0, but you will also see this with an empty member, so a (0, io.EOF) return is not definitive.

In the code I posted, the caller knows it reached EOF because the read offset did not advance.

Dan Kortschak

unread,
Oct 14, 2013, 6:46:44 PM10/14/13
to Russ Cox, Daniel Krech, golang-dev, Ed Summers
On Mon, 2013-10-14 at 09:42 -0400, Russ Cox wrote:
> In the code I posted, the caller knows it reached EOF because the read
> offset did not advance.
>
Missed that, sorry. Could also just keep the last n from the internal
Read(). Thanks.

Kamil Dziedzic

unread,
Aug 25, 2014, 2:26:01 PM8/25/14
to golan...@googlegroups.com, e...@pobox.com
Hi all,

How this ended up? Still there is now way to read gzip/bzip2 chunks in parallel?


I have a gzip reader that does this at code.google.com/p/biogo.bam/bgzf/egzip.
 
I can't see this package? It is not supported anymore?


Kind Regards, Kamil Dziedzic

Dan Kortschak

unread,
Aug 25, 2014, 4:38:54 PM8/25/14
to Kamil Dziedzic, golan...@googlegroups.com, e...@pobox.com
Not yet. I'm waiting on changes to the compress/gzip API that are being thought about by rsc.

How this ended up? Still there is now way to read gzip/bzip2 chunks in parallel?

I have a gzip reader that does this at code.google.com/p/biogo.bam/bgzf/egzip.
 
I can't see this package? It is not supported anymore?

No, the changes that were in egzip to handle seeking are no longer necessary. Have a look in .../biogo.bam/bgzf to see how the standard gzip package can be used to seek (part of the requirement for parallel reading).
Reply all
Reply to author
Forward
0 new messages