reading multi-member gzip files

440 views
Skip to first unread message

kortschak

unread,
Jun 23, 2012, 3:18:13 AM6/23/12
to golan...@googlegroups.com
I'm reimplementing bgzf (described here http://samtools.sourceforge.net/SAM1.pdf, but essentiall just a blocked gzip that allows faster random access to compressed data) in Go. It uses an RFC1952 extra field to specify the uncompressed size of the data block.

The Writer side has been no problem, but I've noticed that gzip.Reader does not give an indication that a member has been completed and it silently drops extra field data (actually all new member header data) when reading second and subsequent members.

Why was this behaviour chosen and is it something that is likely to be altered (or should I derive an alternative gzip.Reader)?

thanks
Dan

kortschak

unread,
Jun 23, 2012, 7:10:08 AM6/23/12
to golan...@googlegroups.com
I've found a more significant issue. If you Seek into a gzip file to the start of a member, there is no way that you can use the gzip package to correctly read the file, since a header is only read either at the creation of the Reader or after the completion of a member.

Dan

Andrew Gerrand

unread,
Jun 24, 2012, 5:44:41 PM6/24/12
to kortschak, golan...@googlegroups.com
On 23 June 2012 04:10, kortschak <dan.ko...@adelaide.edu.au> wrote:
> I've found a more significant issue. If you Seek into a gzip file to the
> start of a member, there is no way that you can use the gzip package to
> correctly read the file, since a header is only read either at the creation
> of the Reader or after the completion of a member.

Can't you do the seek before calling gzip.NewReader?

Andrew

Russ Cox

unread,
Jun 24, 2012, 6:39:12 PM6/24/12
to kortschak, golan...@googlegroups.com
You're parsing files that are not quite gzip format, so I think it's
reasonable that compress/gzip is not much help. The hard part about
gzip is the compression algorithm, not the file format, and that's
encapsulated well in compress/flate. My suggestion would be to write
your own version of the compress/gzip package instead of trying to
reuse it. It's miniscule:

$ wc -l gzip.go gunzip.go
217 gzip.go
241 gunzip.go
$

Russ

Dan Kortschak

unread,
Jun 24, 2012, 6:54:20 PM6/24/12
to Andrew Gerrand, golan...@googlegroups.com
Yes, you're right, I can do that, and that's similar to what I have to do on the writer (each member gets its own gzip.Writer, though I plan to keep the writer and reinit the compressor in a later version, but this requires a fork the gzip.Writer as well and this isn't such an issue).

The issue is one of performance, since bgzf is used for genomics work with files on the order of 20-100GB, but query reads returning data that's ca. 200 bytes when performing random access. Remaking a gzip.Reader each seek seems wasteful.

What I have now with my modified gzip.Reader only rquires that the bufio.Reader be remade when a seek is performed (any chance of a (*bufio.Reader).Reset method to zero the buffer to it underlying io.Reader's current position?).

Dan

Dan Kortschak

unread,
Jun 24, 2012, 7:03:13 PM6/24/12
to <rsc@golang.org>, golan...@googlegroups.com

Yep, that's what I've done. Though according the RFC1952 they are gzip format, the issue is that RFC does not specify the exact behavior of the reader with regard to dealing with multi-member gzip file headers - gzip.Reader does the minimum required from the standard

thanks
Dan

kortschak

unread,
Jun 24, 2012, 10:30:22 PM6/24/12
to golan...@googlegroups.com, <rsc@golang.org>
I'd like to add that, again, Go makes coding very enjoyable.

The best metaphor I can come up with is having the lights turned on in an unfamiliar room; all of a sudden you stop bumping your shins.

Dan

Ed Summers

unread,
Aug 8, 2013, 4:16:33 PM8/8/13
to golan...@googlegroups.com, <rsc@golang.org>
Hi Dan,

I realize this thread was from a ways back, but I was wondering if you did end up coding up an alternate gzip reader. I need to be able to seek to particular members in a gzip file, and record byte offsets for the members while reading, and was wondering if what you came up with might help me out.

Figured it couldn't hurt to ask :-)
//Ed

Dan Kortschak

unread,
Aug 8, 2013, 8:51:03 PM8/8/13
to Ed Summers, golan...@googlegroups.com, <rsc@golang.org>
Yup, it's at code.google.com/p/biogo.bam/bgzf/egzip. Please let me know
how it works for you.

Dan
Reply all
Reply to author
Forward
0 new messages