Greetings, At the Library of Congress we've recently been exploring rewriting a [Java web archiving tool][1] in Go. So far this has involved working with an existing body (~500TB) of data encoded using [ISO/DIS 28500][2] aka the WARC file format. One of the features of WARC is its use of [Gzip][3] as a packaging format, which allows individual WARC records to be represented as separate members in the larger Gzip file. Or as the spec says: > Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed. Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files. External indexes of WARC file content may then be used to record each record's starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records. We ran into difficulty using gzip.Reader since it does not provide any insight into when a member has been read. It simply reads through all the members in the file. While fishing around for people with a similar problem we ran across a [go-nuts thread][4] initiated by Dan Kortschak who needed to access members in a gzip file in his [Biogo][5] for processing genomic and metagenomic data sets. We would like to propose a small addition for gzip that would introduce a MemberReader which would expose when a the end of a member has been reached, as well as the byte offset position in the underlying compressed data. For example, if you want to print out the header and the end position of each member: ```go f, _ := os.Open("test.gz") defer f.Close() if gz, err := gzip.NewMemberReader(f); err == nil { for { if _, err := io.Copy(ioutil.Discard, gz); err == nil { return nil } else if err == gzip.EndOfMember { fmt.Printf("Header: %#v\n", gz.Header) fmt.Print("End Position:", gz.EndPosition(), "\n") } else { return err } } } else { return err } ``` Then to read one member at a known position: ```go f, _ := os.Open("test.gz") f.Seek(position, 417) gz, _ := gzip.NewMemberReader(f) ``` Thoughts? We are ready to work on an implementation once the design looks good. [1]: https://webarchive.jira.com/wiki/display/Heritrix/Heritrix [2]: http://bibnum.bnf.fr/WARC/warc_ISO_DIS_28500.pdf [3]: https://tools.ietf.org/html/rfc1952 [4]: https://groups.google.com/forum/#!searchin/golang-nuts/gzip/golang-nuts/VFfzYiI2rDc/EZkt6gguirwJ [5]: https://code.google.com/p/biogo/
--
---
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
I have a gzip reader that does this at code.google.com/p/biogo.bam/bgzf/egzip.When I suggested that this kind of thing be included in the std lib the proposal was not accepted. The diff between the std and my fork is not onerous to maintain.
We would like to propose a small addition for gzip that would introduce a MemberReader which would expose when a the end of a member has been reached, as well as the byte offset position in the underlying compressed data.
As one possible way to do this, I have included a patch to compress/gzip below that adds a StopAtBoundary method to instruct a reader to stop with EOF at every boundary. Given that patch, you can write a program that retrieves the information you want by tracking the offset in a custom reader. And Dan, the patch also saves the extra information on each header read in that mode.I'm not saying it will be in this exact form, but it's something to use for now.I've created golang.org/issue/6486. If you star it you will get email updates about the issue.
How can a caller know that an EOF is an EOF rather than an end of member with this approach? You expect the return from an end of file EOF to return a byte count of 0, but you will also see this with an empty member, so a (0, io.EOF) return is not definitive.
How this ended up? Still there is now way to read gzip/bzip2 chunks in parallel?
I have a gzip reader that does this at code.google.com/p/biogo.bam/bgzf/egzip.
I can't see this package? It is not supported anymore?