Uploading large datasets

84 views
Skip to first unread message

Venkatachalam Kannadasan

unread,
Oct 24, 2017, 4:31:35 AM10/24/17
to Dataverse Users Community
Hi

We are trying to upload a zip file of 3.5GB through the web and the upload fails every time. I understand from version 4.7 onwards we can use DCM to upload larger datasets via rsync. I am not able to understand the documentation on how to set it up. If anyone has set it up for your installation can you please help to share the steps. 

Also the installation guide says its experimental, if so what are the potential problems that may arise if we use this method? Like incomplete uploads, server error etc.

Thank you for your kind help.

Regards
Venki

Pete Meyer

unread,
Oct 24, 2017, 11:53:37 AM10/24/17
to Dataverse Users Community
Hi Venki,

The potential problems (aka - why it's referred to as experimental) have to do primarily with some trade-offs that were made, which means that datasets where files are uploaded through a DCM behave slightly different than those with files uploaded through HTTP:

- Directory hierarchy and filenames are preserved, but not displayed in the UI.  For the specific user base this was targeting, this wasn't a problem (researchers would be interested in the files for the entire dataset, but much more rarely if ever individual files within that dataset).

- There is an assumption that there will only be a single version of the files within a dataset; which is a reflection of the fact that this was initially targeting primary data (aka - these are the files the depositor received from the instrument/detector/camera) instead of processed data.

- Datasets uploaded through a DCM require another component (RSAL) for downloads; one side effect of this is that all published datasets are public to everyone (in other words, datasets can't have "restricted files").  Guestbooks and their metrics are also unavailable for these types of downloads.

- A dataverse installation (currently) needs to be configured for either DCM/RSAL data transfers, or native/HTTP data transfers.  We're in the early design stages for how to allow these to play together in a single installation.

- The documentation needs some improvements, both in terms of what functionality it currently provides, the system requirements, and a more generalized installation process (at the moment, it assumes you're using the same automated provisioning system that was used for development). 

In terms of your specific questions about failure modes.  Incomplete or corrupted uploads are detectable (client-side checksums are transferred along with data files), so are cases where the depositor is uploading from unreliable storage (aka - a failing external hard drive).  Data transfers are resumable; but if the network drops out then the depositor would have to resume it.  Server errors are a possibility, but the precursor to the DCM has been in operation for a few years without any.

Please let me know if you have additional questions, or if my explanations confused things further.

Best,
Pete

Rebeca Barros

unread,
Oct 24, 2017, 2:08:46 PM10/24/17
to Dataverse Users Community

Hi, Venki


I do not know if my experience will be of much help but this topic is interesting for me since I'm facing the challenge of upload large dataset and I am also a little confused by how to do this using DCM approach. Thank you Pete for some clarification.

 

But regarding the topic, let me ask you:

 

a) Are you trying to upload this 3.5gb file in Harvard Dataverse? If so, theirs settings only allow 2gb limited size.

 

b) If you are in your local installation you may want to check this property:


curl -X GET http://localhost:8080/api/admin/settings/:MaxFileUploadSizeInBytes

By default this property is not set, meaning that you may be able to upload any size. In the case that is setted, you could increase this value and try again. 


The reason why I said that is because I was able to upload a 17gb file without further problems. I've faced uploads errors when I try a bigger file, like 90gb, that's why I'm going to need to use DCM.


My local installation consists on a server with Intel Xeon 2.0Ghz and 256GB of RAM.  

danny...@g.harvard.edu

unread,
Oct 24, 2017, 2:45:43 PM10/24/17
to Dataverse Users Community
Thanks all for the good discussion on this. I'd like to put some addition emphasis on Pete's fourth warning below:

"A dataverse installation (currently) needs to be configured for either DCM/RSAL data transfers, or native/HTTP data transfers.  We're in the early design stages for how to allow these to play together in a single installation."

I'd discourage hacking on this to allow rsync to be used in an installation using HTTP upload or switching between the two. It may be possible, but it's really designed to be either/or for now.
Reply all
Reply to author
Forward
0 new messages