gzip compression query - multi part gzip files

502 views
Skip to first unread message

Dan Kortschak

unread,
Sep 11, 2012, 3:23:08 AM9/11/12
to golan...@googlegroups.com
I have come back to a project I started a couple of months ago that
depends on multi-part gzip compression[1] and have got to the stage of
integrating the data handler with compression back end.

I seem to be having an issue with reading members subsequent to the
first, although this passes in the tests for the blocked gzip
compressor.

I suspect that the issue is that the way that I'm finishing members is
incorrect: I'm writing a whole gzip for each member[2]. This looks to me
like it should be correct from RFC1952, but the problems in my package
occur at member boundaries indicating it's not.

The relevant sections of RFC1952[3] are below.

Can anyone see what I am doing wrong?

thanks
Dan

[1]https://groups.google.com/d/topic/golang-nuts/VFfzYiI2rDc
[2]http://code.google.com/p/biogo/source/browse/bgzf/bgzf.go?repo=bam#108
[3]http://www.ietf.org/rfc/rfc1952.txt


2.2. File format

A gzip file consists of a series of "members" (compressed data
sets). The format of each member is specified in the following
section. The members simply appear one after another in the file,
with no additional information before, between, or after them.

2.3. Member format

Each member has the following structure:

+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+

(if FLG.FEXTRA set)

+---+---+=================================+
| XLEN |...XLEN bytes of "extra field"...| (more-->)
+---+---+=================================+

(if FLG.FNAME set)

+=========================================+
|...original file name, zero-terminated...| (more-->)
+=========================================+

(if FLG.FCOMMENT set)

+===================================+
|...file comment, zero-terminated...| (more-->)
+===================================+

(if FLG.FHCRC set)

+---+---+
| CRC16 |
+---+---+

+=======================+
|...compressed blocks...| (more-->)
+=======================+

0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| CRC32 | ISIZE |
+---+---+---+---+---+---+---+---+


2.3.1.1. Extra field

If the FLG.FEXTRA bit is set, an "extra field" is present in
the header, with total length XLEN bytes. It consists of a
series of subfields, each of the form:

+---+---+---+---+==================================+
|SI1|SI2| LEN |... LEN bytes of subfield data ...|
+---+---+---+---+==================================+

SI1 and SI2 provide a subfield ID, typically two ASCII letters
with some mnemonic value. Jean-Loup Gailly
<gz...@prep.ai.mit.edu> is maintaining a registry of subfield
IDs; please send him any subfield ID you wish to use. Subfield
IDs with SI2 = 0 are reserved for future use. The following
IDs are currently defined:

SI1 SI2 Data
---------- ---------- ----
0x41 ('A') 0x70 ('P') Apollo file type information

LEN gives the length of the subfield data, excluding the 4
initial bytes.


Russ Cox

unread,
Sep 13, 2012, 10:56:49 AM9/13/12
to Dan Kortschak, golan...@googlegroups.com
In general the concatenation of two gzip files is itself a valid gzip
file. Perhaps the problem is that you expect the gzip reader to read
only the first in a concatenation and it is reading all of them?

I tried running your test (go test code.google.com/p/biogo.bam/bgzf)
and it passes. Can you make it fail so I can understand better what is
going wrong?

Thanks.
Russ

Dan Kortschak

unread,
Sep 13, 2012, 5:52:47 PM9/13/12
to Russ Cox, golan...@googlegroups.com
Thanks Russ,

Yes, the bgzf tests passed, but the failing test is not yet coded (it is a command line interaction with another tool). I turns out I had misread the bgzf spec and encoded the uncompressed size in the extra field rather that the compressed block size. Since it was a spec translation error, the tests tested that behaviour and so passed. The application I was interacting with was less confused.

I have fixed this and now the system works (in current tip). Sorry to have not updated the post to indicate this, I didn't think anyone was interested in that problem.

BTW I use a modified gzip that allows you to specify whether to stop at a member end, though this is not currently used in my client.

Dan
Reply all
Reply to author
Forward
0 new messages