Question is, if the size of the data block is given *after* the data block
itself, then how am I supposed to know where to find it since the number of
bytes used by the data block is not yet known? Seems like the only option is
seek through individual bytes until I find a signature beginning a new
header, then back up 12 bytes to get the descriptor. But if I do that, then
by simple subtraction I already know what the size is, so there's little
point in retrieving that information. And of course this would be a fairly
slow operation. It can't be right... can it? Am I reading it wrong?
What is the purpose of setting bit 3, anyway? I don't understand why you
would want to move this data out of the local file header.
By decompressing the deflate data and seeing where it ends. The
deflate data is self-terminating.
> What is the purpose of setting bit 3, anyway? I don't understand why you
> would want to move this data out of the local file header.
So that the zip file can be written as a stream without seeking.
mark
When writing:
The stream (or pipe) is read until EOF is reached. At that point you
should know how to terminate the encoding. just append 12 bytes (crc,
bytes read, bytes written).
When reading:
Read central directory entry. You should know the offset to the
compressed data and its size, its CRC and its uncompressed size. Just
seek to the data bytes and read the compressed data and decompress it.
Check CRC and data size against those stored in central directory
entry.
On the subject of central directory, is there a way to seek directly to it?
Since it's located at the end of the file, along with variable length
comments, it seems that there is no way to access it directly either from
the front or back of the file.
I just get the feeling that there is a lot of overhead in retrieving
metadata from a zipfile, having to do a lot of reading/seeking and even
dry-decompressing just to access the next header. Is this just how zip files
go, or am I missing a concept somewhere?
Specifically, say that I want to provide an explorer-like interface to the
zip file (as with Windows XP). Which means that I typically want to get a
file manifest without actually performing more cpu-intensive operations like
decoding. I can just read in the size of the compressed data (or actually
decompress it in the case that bit 3 is set) and seek past it until I find
the central directory signature. But in doing this, I basically reconstruct
the file manifest from the local headers, and the central directory seems
redundant in this case.
No. You should first seek to the very end and read backward until you
detect the beginning of central directory. When you process all the
entries, you should keep them somewhere you can quickly access (like
std::map<CDEntry> or std::set<CDEntry>). You should find some source
code to see how it is done if you search the web.
>
> I just get the feeling that there is a lot of overhead in retrieving
> metadata from a zipfile, having to do a lot of reading/seeking and even
> dry-decompressing just to access the next header. Is this just how zip files
> go, or am I missing a concept somewhere?
>
> Specifically, say that I want to provide an explorer-like interface to the
> zip file (as with Windows XP). Which means that I typically want to get a
> file manifest without actually performing more cpu-intensive operations like
> decoding. I can just read in the size of the compressed data (or actually
> decompress it in the case that bit 3 is set) and seek past it until I find
> the central directory signature. But in doing this, I basically reconstruct
> the file manifest from the local headers, and the central directory seems
> redundant in this case.
As I said above, when you read the central directory and build your
structures to easily access to the entries, it is just finding the
entry and seeking to the local data. You shouldn't read local headers
which can be necessary when fixing a corrupted zip file for example,
other than that you should read the central directory.
Not reliably, no. This is a flaw in the zip file format, since the end
comment could in principle have contents that look just like a central
directory, fooling a program trying to search from the end.
The only absolutely guaranteed way to interpret a zip file is to read
it from beginning to end (or at least up to the central directory, if
the local headers have enough information for you). In the case where
bit 3 is set, that requires decompressing the data for that entry. I
don't know how common that is though, so that still may in general be a
quick way to scan a zip file if in fact it's not common.
mark
-Mark Adler'in mesaji: > Kevin C. wrote:
> > On the subject of central directory, is there a way to seek directly to it?
> > Since it's located at the end of the file, along with variable length
> > comments, it seems that there is no way to access it directly either from
> > the front or back of the file.
>
> Not reliably, no. This is a flaw in the zip file format, since the end
> comment could in principle have contents that look just like a central
> directory, fooling a program trying to search from the end.
>
> The only absolutely guaranteed way to interpret a zip file is to read
> it from beginning to end (or at least up to the central directory, if
Yes, you need to do it when an entry in the central directory is
invalid but it is not so efficient (in case the zip file has a lot of
entries). So first, the central directory should be read I think. There
are ways to verify if an entry is valid. (Like seeking to the local
header and checking it against the central directory entry when reading
the data bytes)
> the local headers have enough information for you). In the case where
> bit 3 is set, that requires decompressing the data for that entry. I
Or the next local header signature can be searched without
decompressing. When found, if the compressed size justifies the
difference between two offsets, the 12 bytes before the next local
header is supposed to be correct.
You can check for consistency, but a pathological file can meet those
checks and still lead you astray.
> Or the next local header signature can be searched without
> decompressing. When found, if the compressed size justifies the
> difference between two offsets, the 12 bytes before the next local
> header is supposed to be correct.
This too only works with some high probability, not with certainty. My
statement stands, which is that due to a flaw in the design of the zip
file format, the only absolutely guaranteed way to interpret a zip file
is to read it from the beginning to the end, including decompressing
where necessary to find the next entry.
mark
>As in the zip format spec, when the general purpose bit flag has bit 3 set,
>the crc, compressed size, and uncompressed size are set to 0 in the local
>file header and the values are moved to a data descriptor, after the
>compressed data block.
>
>Question is, if the size of the data block is given *after* the data block
>itself, then how am I supposed to know where to find it since the number of
>bytes used by the data block is not yet known? Seems like the only option is
>seek through individual bytes until I find a signature beginning a new
>header, then back up 12 bytes to get the descriptor. But if I do that, then
>by simple subtraction I already know what the size is, so there's little
>point in retrieving that information. And of course this would be a fairly
>slow operation. It can't be right... can it? Am I reading it wrong?
Actually this data descripter has it's own 4 byte header, which is
similar to the other headers. If you forcibly create a zip file that
has these (eg through a pipe) then you'll see what I mean.
The zip format spec doesn't mention this at all, which causes all
sorts of problems.
>What is the purpose of setting bit 3, anyway? I don't understand why you
>would want to move this data out of the local file header.
For when you can't seek in the output stream to go back & fill in the
values.
Errol Smith
errol <at> ros (dot) com [period] au
I didn't know this? What are those bytes?
'P' 'K' ? ?
Yes, bitten by this myself. From http://www.cs.tut.fi/~albert/Dev/puzip/
15.1.2000
Fooled by the docs again. Appnote.txt omitted the information that
the data descriptor also has a header: $50,$4b,$07,$08.
-Pasi
--
/Silent glances passed between Vandene and Adeleas, in the manner
of people who had spent so much time together they hardly needed
to speak aloud any longer./
-- The Wheel of Time:"The Path of Daggers"