Big data and DVUploader

Valentina Pasquale

unread,

Mar 10, 2022, 6:44:12 AM3/10/22

to Dataverse Users Community

Dear Dataverse Users,

I have a question about upload/download of a big dataset. A researcher from our institute has asked to upload in Dataverse a ~300 GB dataset, that is organized in approx. 180 files of 1.5 GB each, stored in a hierarchical folder tree. All these files are compressed archives (.7z, so they would not be unzipped by Dataverse and the authors are ok with that).

Unfortunately we do not have direct upload enabled in our instance and I was wondering if I could accommodate this request anyway (without direct upload/download) or if I should reject it while waiting to enable and test direct upload.

In our instance, we have set MaxFileUploadSizeInBytes to 2.5 GB, but we have already encountered problems when uploading more than 5 GB altogether in a single upload operation from the UI, because temporary copies of the files were created in the system disk and they saturated it. We managed to upload up to 50 GB from the UI in the same dataset, but in multiple 5-GB upload operations (and by saving after each upload). The same with the download, because we set a :ZipDownloadLimit of 5 GB.

With 300 GB it would be impossible to follow the same strategy and we would be forced to use scripts or the DVUploader.

Therefore, I was wondering if I would encounter the same memory problems as in the UI by using the DVUploader. If I run the DVUploader in a folder that contains the full dataset, does the tool try to upload all files at once (by saving at the end and thus saturating the memory with temp copies) or does it save the dataset after each file's upload, thus avoiding memory saturation?

For the download, we would provide a script based on the APIs to download the full dataset in steps.

Thanks for your precious help.

All the best,

Valentina Pasquale

IIT Dataverse

Philip Durbin

unread,

Mar 10, 2022, 7:34:29 AM3/10/22

to dataverse...@googlegroups.com

Hi Valentina,

You might want to check out the thread at https://groups.google.com/g/dataverse-community/c/yXDpdg-thqw/m/XBh8AAg2AwAJ

A variety of workarounds are described but I'll paste below the "placeholder file" workaround we've used at Harvard Dataverse. We using S3 but you can do the same with file system storage.

We use a manual process that involves uploading a placeholder file and replacing it with the real file. Something like this:

- Upload a small placeholder file
- Look up the placeholder file info in db
- Directly upload the large file to a front end machine
- Use amazon command line utility to copy large file to location where placeholder file is
- Update db info (md5, contenttype, filesize) to match large file

I hope this helps,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/a705d4b5-a263-43d7-9896-3383bd667d98n%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Péter Király

unread,

Mar 10, 2022, 7:55:06 AM3/10/22

to dataverse...@googlegroups.com

Dear Valentina,

according to our investigation the problem can not be fixed by
increasing any Dataverse settings. The problem stands in the Java
application server (Glassfish, Payara) level, and it is cased by the
timeout limit, which is general to the whole service, not limited to
download or upload process.

We follow the steps Phil described, however we have created some scripts:
https://github.com/pkiraly/dataverse-largefile-insert/

Best,
Péter

> To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CABbxx8FjUQdc%3D-1RZfid9DSsXtZQ8Fk3B_V3jg3anXoAnNkPdA%40mail.gmail.com.

--
Péter Király
software developer
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
http://linkedin.com/in/peterkiraly

James Myers

unread,

Mar 10, 2022, 8:28:50 AM3/10/22

to dataverse...@googlegroups.com

Valentina,

Your expectation for how the DVUploader works is currently true - the original Dataverse API only allows one file to be uploaded before it saves the dataset which would release your temp space. (pyDataverse or any other tool has to work the same way). (With a recent API addition from Scholars Portal, external apps (the pre-release DVUploader specifically) can now send several files at once – faster for many small files but not good for your case.)

FWIW: DVUploader will wait for locks on the dataset before sending a new file, but the asynchronous indexing after the save is complete can sometimes interfere with a subsequent upload to the same dataset (e.g. with thousands of files). With DVUploader, you can always restart to send any files missed on a first run, but a more conservative approach would be to use the –limit=1 flag with DVUploader and the new View the Timestamps on a Dataset API call to script DVUploader (or pyDataverse) to send one file and then loop until the lastindexed timestamp from the API call is after your update, and then send the next file, etc. (Hopefully I/someone will get this into DVUploader itself at some point). This is a little less efficient (DVUploader starts by getting the list of a files in the dataset to compare with the local dir structure, so restarting means calling this again), but should be able to send all files in one run (barring network/other issues).

As Peter implies, having enough temp space and all timeouts set long enough are important, but if you can upload one file of the size you want now, you should be able to handle the repeated uploads by scripting.

The only other upload option I can think of is, if you are on S3 and just haven’t enabled direct upload, you could configure a second s3 store, going to the same bucket, with directupload enabled and only assign it to this dataset. The only difference here is just that it would not impact your other users.

(I can also imagine an interesting work-around scenario: the only change in the db when you use a different store is that that store name is what is preprended on the file storageidentifiers. So once could do something like create a temporary bucket at Amazon, configure a directupload store with that and assign it to this dataset, transfer the files as normal, and then, once done, use the aws command line to sync the bucket to you original bucket or file store, then edit the db entries to swap from the new store to the original. This is basically a variant on the workarounds where people have uploaded a small file, swapped it manually/without Dataverse for a bigger one, and then updated the size/hash in the db to match. Somewhat simpler perhaps since you don’t have to manage file size/hashes)

For download, scripting sounds like a good way to go. Other options that others can provide more detail about would be to use the new directory listing page that lets standard download tools handle the multi-file download (versus you having to write a script), or trying the standalone Zip util Harvard created that avoids Dataverse being involved in creating the zipped download file (although I assume you’d still need the 180GB of temp space on some machine for that.)

Hope that helps!

-- Jim

--

Valentina Pasquale

unread,

Mar 10, 2022, 10:44:15 AM3/10/22

to Dataverse Users Community

Dear Jim, dear all,

thanks for all the useful insights, the situation is much clearer to me now!

Best wishes,

Valentina

Reply all

Reply to author

Forward