Question about handling file compression variations

68 views
Skip to first unread message

wendy...@gmail.com

unread,
Jun 17, 2024, 4:26:02 PMJun 17
to Dataverse Users Community
Hi everyone.

I'm working with some researchers who are in the process of uploading a series of compressed files which contain a large number of files in the archive to the George Mason University Dataverse. We are currently running v. 5.14 build 1471-9f4ddbb

I've told them that they need to double-compress their archive so that Dataverse doesn't unpack it (each archive has ~ 38,000 files). They are zipping and zipping again. However when they Add the file to our Dataverse (I have also tested it) we get the spinning blue wheel that never resolves.

The file will add if I zip and gzip, no problem. So I suggested that they do that. 

The issue is that they have also been trying it out in the Demo Dataverse (which is a different version) and there is no issue with their double-zipped file and the file is added there.

Do you all have an explanation as to why we might be having the file not adding issue with the GMU Dataverse? I think that our Dataverse is getting hung up on the file being zipped twice and stalling when attempting to unpack the 1st zip. We've tried it with a 2.5 and 1.5 GB file. It seems to me that the other issue is the difference between the versions. Also, our upload limit has been lifted to 7GB to accommodate their project.

They don't like my explanation and suggestion (that works). Any additional information would help and I would like to know if I'm on the wrong track.

I hope that all made sense.

Thanks so much,

Wendy Mann
George Mason University

James Myers

unread,
Jun 17, 2024, 5:38:01 PMJun 17
to dataverse...@googlegroups.com

Wendy,

I’m not aware of any specific code changes since 5.14 that would affect this. (There has been a lot of change to add storage quotas, etc. so it’s a bit hard to tell, but I don’t think the basic code to unzip once has changed. Perhaps others will remember something.)

 

One possibility would just be that you’re running out of temporary space – normal upload would copy the file, and then unzip to get the inner zip, so you might need ~2x the file size to be successful. (There could be enough space in your persistent file store and not in the directories/on the volume assigned as temporary space).I’m not sure that’s consistent with you seeing that a zip with a gzip inside works unless that results in smaller files (or they have just been trying when there’s other activity using up temp space.).  Hopefully there is something in the server.log file that would indicate what’s causing things to hang, or you might be able to see disk usage going to 100% for some volume.

 

In terms of recommendations, it probably is useful to have the inner file be a .zip as there is a Zip Previewer available that would let you see the 38K files inside. (I don’t think .gz is currently supported.) That may be less relevant if you’re not running that previewer.

 

The other general recommendation that’s more at the installation config level would be to consider S3 storage, which allows direct upload. That by default doesn’t unzip at all and it avoids any temporary copies that could cause issues on the Dataverse server (not to mention faster/more robust upload). I would say beyond a GB is where that starts to be worth the effort, but it is obviously a significant technical change.

 

Hopefully that helps. If you get more clues from the log, or have a sharable file that consistently fails on 5.14, let us know and we might be able to identify some other issue.

 

-- Jim

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/05553d16-53a7-451b-825c-6c276d6905abn%40googlegroups.com.

Paul Boon

unread,
Jun 18, 2024, 3:15:33 AMJun 18
to dataverse...@googlegroups.com
Hi Wendy,

It might be a good idea to check the apache httpd config for a ProxyPass line. If I remember correctly it does have a default timeout of 300 seconds, so if the upload takes longer you get the never ending spinner.
The guide is here: 
It states:
"You may wish to also add a timeout directive to the ProxyPass line within ssl.conf. This is especially useful for larger file uploads as apache may prematurely kill the connection before the upload is processed.
"

Success,
Paul


From: dataverse...@googlegroups.com <dataverse...@googlegroups.com> on behalf of James Myers <qqm...@hotmail.com>
Sent: Monday, June 17, 2024 11:37 PM
To: dataverse...@googlegroups.com <dataverse...@googlegroups.com>
Subject: RE: [Dataverse-Users] Question about handling file compression variations
 

wendy...@gmail.com

unread,
Jun 18, 2024, 9:37:14 AMJun 18
to Dataverse Users Community
Thank you for the responses - they're a big help!

Best,

Wendy

Sherry Lake

unread,
Jun 18, 2024, 9:53:37 AMJun 18
to dataverse...@googlegroups.com
Hi Wendy,

I am adding notes on how I have (in the past) successfully double zipped files for upload (to keep zipped) to Dataverse. Not sure if the "zip method" is a problem in your situation?

UVa now has S3 storage with direct uploads so we no longer need to double zip (as Jim says in his email - direct upload keeps zips  zipped).

From my notes on double zipping.

 

On a Mac - a couple of ways:

After using the "right-click" to Compress a folder..... (which creates a zip'd file of the folder contents),

I use the command line and gzip (which zips a zip'd file):

-   gzip   FileName.zip    =>  creates FileName.zip.gz 

-   Then upload the file Filename.zip.gz to Dataverse

 

You can also use the "zip" command (twice), where "originalFolder" is the unzipped folder:

-  zip   new.zip   originalFolder

-  Creates a new file "new.zip"

-  Then zip that file:    zip    newdouble.zip   new.zip

-  Upload "newdouble.zip" to Dataverse - it will do one "unzip" and leave "new.zip in its place

 

Here are instructions from the Australian Data Archive (which is also a Dataverse installation) on how to double zip via Windows. Note - I have not tried this personally:


--

wendy...@gmail.com

unread,
Jun 19, 2024, 9:40:32 AMJun 19
to Dataverse Users Community
Thanks Sherry. The double-zip method is not a problem. The issue is with uploading the double-zipped file and Dataverse stalling at that point.

Best,

Wendy

wendy...@gmail.com

unread,
Jun 24, 2024, 11:31:47 AMJun 24
to Dataverse Users Community
Hi everyone. Paul's suggestion resolved the problem for us. Thanks again!

Wendy

On Tuesday, June 18, 2024 at 3:15:33 AM UTC-4 paul...@dans.knaw.nl wrote:
Reply all
Reply to author
Forward
0 new messages