reading zip file

Eric Douglas

unread,

Aug 8, 2019, 10:12:59 AM8/8/19

to

I'm working on a new client-server application which requires saving client files to disk for every client that runs it. All files required by the client are in a zip file, hosted by a web service on a local domain server.
I'm writing the unzipped files to the client machine using this loop.

ZipEntry zipEntry = null;
try (ByteArrayInputStream is = new ByteArrayInputStream(zipBytes); ZipInputStream zis = new ZipInputStream(is)) {
while ((zipEntry = zis.getNextEntry()) != null) {

The zipBytes is a byte[] array generated from org.apache.http.util.EntityUtils.toByteArray(). The zip file is currently over 400 MB. I'm not sure how big this could potentially get. Do I have to get the entire zip file into memory before extracting? Should I be concerned how much memory this could require? Is there a better way?

Arne Vajhøj

unread,

Aug 8, 2019, 10:48:15 AM8/8/19

to

If the binary ZIP file is URL encoded then it is a big mess.

If the binary zip file is present as binary data, then I think you can
change from:

ZipInputStream zis = new ZipInputStream(new
ByteArrayInputStream(EntityUtils.toByteArray(entity)));

to:

ZipInputStream zis = new ZipInputStream(entity.getContent());

(code simplified)

Arne

Eric Douglas

unread,

Aug 8, 2019, 11:00:21 AM8/8/19

to

On Thursday, August 8, 2019 at 10:48:15 AM UTC-4, Arne Vajhøj wrote:
> If the binary ZIP file is URL encoded then it is a big mess.
>
> If the binary zip file is present as binary data, then I think you can
> change from:
>
> ZipInputStream zis = new ZipInputStream(new
> ByteArrayInputStream(EntityUtils.toByteArray(entity)));
>
> to:
>
> ZipInputStream zis = new ZipInputStream(entity.getContent());
>
> (code simplified)
>
> Arne

I'm not sure what you mean by "is present as binary data". I was wondering what the purpose of EntityUtils is there, if I could just use getContent().

If it matters, I am reading the stream twice, once to check the total size of the zip files for a progress bar and once to extract them. Should this mean it's much more efficient to convert the HttpEntity to a byte[] array first? Should using getContent() as the input to the zip stream make the second read slower but use much less memory?

Arne Vajhøj

unread,

Aug 8, 2019, 11:30:40 AM8/8/19

to

On 8/8/2019 11:00 AM, Eric Douglas wrote:
> On Thursday, August 8, 2019 at 10:48:15 AM UTC-4, Arne Vajhøj wrote:
>> If the binary ZIP file is URL encoded then it is a big mess.
>>
>> If the binary zip file is present as binary data, then I think you can
>> change from:
>>
>> ZipInputStream zis = new ZipInputStream(new
>> ByteArrayInputStream(EntityUtils.toByteArray(entity)));
>>
>> to:
>>
>> ZipInputStream zis = new ZipInputStream(entity.getContent());
>>
>> (code simplified)
>

> I'm not sure what you mean by "is present as binary data".

Binary data 0x01 0x02 0x03 is 3 bytes - in URL encoding it would be
"%01%02%03" 9 bytes.

> I was wondering what the purpose of EntityUtils is there, if I could just use getContent().
>
> If it matters, I am reading the stream twice, once to check the total
> size of the zip files for a progress bar and once to extract them.
> Should this mean it's much more efficient to convert the HttpEntity
> to a byte[] array first? Should using getContent() as the input to
> the zip stream make the second read slower but use much less memory?

You will need to find it in the documentation or do a test.

But my guess is that getContent() will work the first time
and fail the second time.

Meaning that toByteArray may make sense if you really need that
progress bar.

Question: can't you just use HTTP content length instead??

Arne

Eric Douglas

unread,

Aug 8, 2019, 11:42:53 AM8/8/19

to

On Thursday, August 8, 2019 at 11:30:40 AM UTC-4, Arne Vajhøj wrote:
> You will need to find it in the documentation or do a test.
>
> But my guess is that getContent() will work the first time
> and fail the second time.
>
> Meaning that toByteArray may make sense if you really need that
> progress bar.
>
> Question: can't you just use HTTP content length instead??
>
> Arne

If tying the zip stream to getContent() will read it as needed versus the toByteArray() which of course pulls the entire zip file contents into memory, that may be something to consider. A second zip stream may just need to call HttpResponse.getEntity() again, possibly HttpClient.execute() again.

I expect HTTP content length would give me the size of the zip file. I want to know the total size of the files in the zip file, which is what the updates to the progress bar would be based on.
(within the doInBackground() of a SwingWorker)
zipProgress += zipEntry.getSize();
publish(zipProgress);

Daniele Futtorovic

unread,

Aug 8, 2019, 1:50:03 PM8/8/19

to

Yes. Don't do it like that. Save the ZIP file locally. Then extract it.
Don't do `EntityUtils.toByteArray()`. Stream the stuff. Same for the
extraction: streams, streams, streams.

Incidentally, a IMHO cleaner and more modern approach to the problem
would be this:
One endpoint on the server that returns the list of files and the URL
*each* can be retrieved at. Another endpoint that returns each
individual file. You can add a GZIP filter at the transport level if
compression is required. But then it would be transparent to your code,
as well as to any tooling you might develop around this (think executing
requests manually in the browser for debugging purposes).

--
DF.

Andreas Leitgeb

unread,

Aug 8, 2019, 1:51:29 PM8/8/19

to

Eric Douglas <e.d.pro...@gmail.com> wrote:
> I expect HTTP content length would give me the size of the zip file.
> I want to know the total size of the files in the zip file, which is
> what the updates to the progress bar would be based on.

Are you serious? You'd download the 400MB zip file twice - once
just for measuring, and then again for the contents?

Are you sure, you cannot cache it locally on disk?
Are you sure you cannot request some kind of digest first,
that would just have nr of files and maybe total uncompressed
size?

Do you expect the download to be faster than saving the extracted
files on disk?

Eric Douglas

unread,

Aug 8, 2019, 3:58:37 PM8/8/19

to

On Thursday, August 8, 2019 at 1:50:03 PM UTC-4, Daniele Futtorovic wrote:
> Yes. Don't do it like that. Save the ZIP file locally. Then extract it.
> Don't do `EntityUtils.toByteArray()`. Stream the stuff. Same for the
> extraction: streams, streams, streams.
>
> Incidentally, a IMHO cleaner and more modern approach to the problem
> would be this:
> One endpoint on the server that returns the list of files and the URL
> *each* can be retrieved at. Another endpoint that returns each
> individual file. You can add a GZIP filter at the transport level if
> compression is required. But then it would be transparent to your code,
> as well as to any tooling you might develop around this (think executing
> requests manually in the browser for debugging purposes).
>
> --
> DF.

The download has to be a single zip file. I didn't decide that. That is how it's hosted on a third party application service. Using an index and downloading separate files without the compression could make sense if the index can tell you the complete size of the files for the progress bar, and if we're worried about dropping the download in the middle and restarting.

Reading the file from the 'web service' would be faster than saving it to disk, as it's a Jetty based web service on a local domain server, and most users still have 7200 rpm hard drives. I have recommended getting more ssds. Writing it to disk only to stream it back in would only seem to make sense if the hard disk is definitely faster than the web service and there's likely not enough RAM to store the entire zip (and there should be plenty of disk space to store the zip), or they need to save the whole zip file for later anyway (in this case they don't).

I know I grabbed the unzip logic from the example (https://www.baeldung.com/java-compress-and-uncompress), not sure where I got the EntityUtils.toByteArray() method, not something I recall finding myself.

Joerg Meier

unread,

Aug 8, 2019, 4:06:22 PM8/8/19

to

On Thu, 8 Aug 2019 08:00:10 -0700 (PDT), Eric Douglas wrote:

> If it matters, I am reading the stream twice, once to check the total
> size of the zip files for a progress bar and once to extract them.

Without wanting to be rude, I really think you need to take a step back
here and consider what you are doing. This is insane. By the time you have
enough information to display the progress bar, you are already COMPLETELY
FINISHED with what you want to display the progress of. Then you throw it
all away to repeat the exact process, making people wait and watch a
progress bar for something that is already done before the bar shows up.

You are, effectively, making people watch a progress bar that goes from
101% to 200%.

Liebe Gruesse,
Joerg

--
Ich lese meine Emails nicht, replies to Email bleiben also leider
ungelesen.

Daniele Futtorovic

unread,

Aug 8, 2019, 4:39:14 PM8/8/19

to

On 2019-08-08 21:58, Eric Douglas wrote:
> On Thursday, August 8, 2019 at 1:50:03 PM UTC-4, Daniele Futtorovic wrote:
>> Yes. Don't do it like that. Save the ZIP file locally. Then extract it.
>> Don't do `EntityUtils.toByteArray()`. Stream the stuff. Same for the
>> extraction: streams, streams, streams.
>>
>> Incidentally, a IMHO cleaner and more modern approach to the problem
>> would be this:
>> One endpoint on the server that returns the list of files and the URL
>> *each* can be retrieved at. Another endpoint that returns each
>> individual file. You can add a GZIP filter at the transport level if
>> compression is required. But then it would be transparent to your code,
>> as well as to any tooling you might develop around this (think executing
>> requests manually in the browser for debugging purposes).
>

> The download has to be a single zip file. I didn't decide that. That is how it's hosted on a third party application service. Using an index and downloading separate files without the compression could make sense if the index can tell you the complete size of the files for the progress bar, and if we're worried about dropping the download in the middle and restarting.
>
> Reading the file from the 'web service' would be faster than saving it to disk, as it's a Jetty based web service on a local domain server, and most users still have 7200 rpm hard drives. I have recommended getting more ssds. Writing it to disk only to stream it back in would only seem to make sense if the hard disk is definitely faster than the web service and there's likely not enough RAM to store the entire zip (and there should be plenty of disk space to store the zip), or they need to save the whole zip file for later anyway (in this case they don't).
>
> I know I grabbed the unzip logic from the example (https://www.baeldung.com/java-compress-and-uncompress), not sure where I got the EntityUtils.toByteArray() method, not something I recall finding myself.

You're picking nickels in front of a steamroller.

--
DF.

Eric Douglas

unread,

Aug 8, 2019, 4:50:08 PM8/8/19

to

We can't tell them how long it will take to 'download' the files but that should be the fast part, since the web server is on the local domain network. If there was an easy way to see the total for that I'd add a progress bar for that part also. I'm iterating over the file objects in the zip stream to find the total size to show the progress of writing them to disk which could be slow if they don't have an ssd. Sure that iteration adds a few seconds but it is fast as long as it can be done in memory. If you copy a large file(s) in Windows it always starts with 'calculating' total time..

Arne Vajhøj

unread,

Aug 8, 2019, 5:17:21 PM8/8/19

to

On 8/8/2019 11:42 AM, Eric Douglas wrote:
> On Thursday, August 8, 2019 at 11:30:40 AM UTC-4, Arne Vajhøj wrote:
>> Question: can't you just use HTTP content length instead??

> I expect HTTP content length would give me the size of the zip file. I want to know the total size of the files in the zip file, which is what the updates to the progress bar would be based on.
> (within the doInBackground() of a SwingWorker)
> zipProgress += zipEntry.getSize();
> publish(zipProgress);

What about using HTTP content length and entry.getCompressedSize()?

Arne