Example(s) of "Large" dataset file in Harvard's DV?

175 views
Skip to first unread message

Sherry Lake

unread,
Sep 2, 2020, 10:16:07 AM9/2/20
to Dataverse Users Community
Is there an example of a "Large" file in Harvard's DV? Or any other DV?

I am looking for examples using current solutions (in the base code - not forked versions) on either using SWIFT or S3 with large files.

Thanks,
Sherry

Philip Durbin

unread,
Sep 2, 2020, 10:41:46 AM9/2/20
to dataverse...@googlegroups.com
Hi Sherry,

I'm not sure what counts as large these days but if I use the following search for files over 50 GB in Harvard Dataverse (which is on S3), I find a file that's 132.4 GB.

fileSizeInBytes:[53687091200 TO *]

I'll attach a screenshot of the search. Here's a link to the file: https://doi.org/10.7910/DVN/DXJIGA/VSKJWC

I hope this helps,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/a80c3575-0fd8-4db3-a24a-a86214d4cfefn%40googlegroups.com.


--
Screen Shot 2020-09-02 at 10.35.18 AM.png

Sherry Lake

unread,
Sep 2, 2020, 11:05:23 AM9/2/20
to Dataverse Users Community
Thanks, Phil.

I guess what I meant by "large" is "Larger than the UI upload size". This example works, thanks. Do you have any idea how it got uploaded (or on S3)?

I've had two inquires this week about files over 300GB and Dataverse. Trying to think outside the box with our installation as we are using local file storage at the moment. I'll report back with our solution - which seems to be headed toward a pointer (URL) on the dataset record to a Docker container, providing the access - landing page, to backend storage. 

--
Sherry

Philip Durbin

unread,
Sep 2, 2020, 1:44:01 PM9/2/20
to dataverse...@googlegroups.com
That 132 GB file was not uploaded through the GUI. Rather we use a manual process that involves uploading a placeholder file and replacing it with the real file. Something like this:

- Upload a small placeholder file
- Look up the placeholder file info in db
- Directly upload the large file to a front end machine
- Use amazon command line utility to copy large file to location where placeholder file is
- Update db info (md5, contenttype, filesize) to match large file

The pointer you mentioned sounds fine to me. That's basically what's going on with https://doi.org/10.7910/DVN/TDOAPG and its 17 trillion cell values.

I hope this helps,

Phil

Sherry Lake

unread,
Sep 3, 2020, 8:56:21 AM9/3/20
to Dataverse Users Community
Thanks, Phil. 

This worked! Is this written up anywhere? If not, I've got detailed notes (including Postgres commands) I can share. 

It will come in handy for us for other "larger" ( > 6GB files). But the current "Large" file, that I'm going to use the pointer for, is 500GB and our local file store is only 2TB, so will do pointers until we get S3 (for an alternate store - now that we are at 4.20 and can do that).

--
Sherry

Don Sizemore

unread,
Sep 3, 2020, 9:13:14 AM9/3/20
to dataverse...@googlegroups.com
Hi Sherry,

FWIW I've had success uploading a ~20GB file by uploading it to temporary space on an NFS mount,
then calling http://guides.dataverse.org/en/latest/api/native-api.html#add-a-file-to-a-dataset from there.


We recently found out that a 3.6GB CSV exhausted a heap memory of 36GB.

Dreaming of unlimited system resources and network throughput,
Don


Sherry Lake

unread,
Sep 3, 2020, 9:55:19 AM9/3/20
to dataverse...@googlegroups.com
Thanks, Don - will try.

Now I have a question about upload file size - there is a config to increase that "MaxFileUploadSizeInBytes", but is that for the UI only? Is there a max size for api upload - add dataset? Or is that controlled by the system & glassfish timeouts?

--
Sherry

You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/yXDpdg-thqw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CAPfMOaxmzC%3D839S_i51OGPB5B-rOCwAFmXL-_avHmj%2Be6%3Ddb7g%40mail.gmail.com.

Don Sizemore

unread,
Sep 3, 2020, 10:24:23 AM9/3/20
to dataverse...@googlegroups.com

Philip Durbin

unread,
Sep 3, 2020, 10:51:10 AM9/3/20
to dataverse...@googlegroups.com
I hadn't looked at the :MaxFileUploadSizeInBytes code in a while but generally, yes, it applies everywhere in both the UI and API. It can also now be configured* per store, which is neat (thanks, Jim).

The only caveat has to do with the SWORD API. If you set the :MaxFileUploadSizeInBytes limit higher than 2 GB (let's say 4 GB) the SWORD API will still be limited to 2 GB. This is because an integer (small) is being used instead of a "long" (large) in the SWORD library we use: https://github.com/IQSS/dataverse/issues/2169 . If anyone is curious about the code: https://github.com/IQSS/dataverse/blob/v5.0/src/main/java/edu/harvard/iq/dataverse/api/datadeposit/SwordConfigurationImpl.java#L139

Sherry, if you've got notes, I'd say go ahead and paste them here. It's a hack, of course, but I'm glad you got it working!

Thanks,

Phil


Philipp at UiT

unread,
Sep 11, 2020, 10:46:19 AM9/11/20
to Dataverse Users Community
Hi all,

Interesting discussion. We have been asked by a research group whether they can use our repository to publish files of approx. 80-200 GB. We have been testing this with DVUploader, but are struggling because of time outs. This made me wondering about several things:
- What method would you recommend for files of that size?
- Are there any constrains on *down*loading files of that size? Is size limit something you only have to configure for uploading files, not for file download?
- How would a user download a 150 GB file? By UI? Are there any API-based tools one can use for downloading files? Maybe Globus?

Best, Philipp

Sherry Lake

unread,
Sep 11, 2020, 11:32:54 AM9/11/20
to dataverse...@googlegroups.com
Hi Philipp,

Because of time outs for "large" files (uploads and downloads), we have gone with putting the file(s) on another server and just including a link in the dataset record. Our IT department is putting a 1TB directory of files ("smallest" file 330GB) on a container cluster (DCOS) connected over a very fast 40GbE connections. They are then providing us with a URL (ftp-like interface) for direct download. In the dataset, we just include the link in the "Other ID" field.

And we did a little customization for the "Other ID" field: 1) changed the dataverseAlias (label) to "Other Location for Dataset", 2) added Other ID field (and the Notes field) to the custom summary fields 

Then we make sure the dataset has a readme file and the description describes where the files live. In our case, which isn't published yet, we will also add links (highlights) to selected files in the "directory".

Here's an example of how we think it will work (look): 

Screen Shot 2020-09-10 at 11.20.20 AM.png

These work arounds, of course will not record downloads since it doesn't go through the UI or have any of the other UI benefits, but the large files are discoverable AND accessible without bothering my researcher about sharing their files.

I am waiting for the TRSA (just for the handling of big data - not needing the "trusted" part) and/or globus solutions. But while we wait for one of these solutions, we will just link.

Best,
Sherry

James Myers

unread,
Sep 11, 2020, 11:48:09 AM9/11/20
to dataverse...@googlegroups.com

Philipp,

 

(All – please add/edit/comment – as I mentioned I’d like to get this type of guidance organized in a wiki or the guides, so any additional info you have is helpful)

 

I think the current options for something of that size are:

 

1)      Deal with timeouts, and the needs for temporary storage space

2)      Upload a small placeholder file and then manually swap the file and update the database entries

3)      Use S3 on minio and use direct upload – this is still a single part upload but the only timeouts would be those for the S3 connection (you can’t use AWS S3– it has a limit of 5 GB for single part uploads which is what is currently implemented for direct uploads in Dataverse)

4)      Use a link (per Sherry’s email)

5)      Setup rsync/Data Capture Module (as discussed in http://guides.dataverse.org/en/latest/developers/big-data-support.html)

 

Things that are in the works:

 

·         ~v5.1 – multipart direct S3 uploads: The UI and DVUploader are both capable of sending a large file to the S3 server in multiple parts which allows files up to a theoretical limit of 5TB. The part size can be adjusted so the site admin can shift that to limit the time required for each part. DVUploader has some limited retry capabilities if some of the parts fail.

·         Scholar’s  Portal is working on a Globus integration that provides dual access by S3 and Globus to an underlying store.

·         Gerrick Teague, working for Kansas State University is developing Synapse, a different take on Globus integration that would use Globus for transfer and then push the file to Dataverse locally (perhaps automating the idea in option 2) above)

·         UNC/Odum’s trusted remote data storage may also be able to address large data since it never transfers the file to Dataverse at all

 

For downloads, options 1-3 above only support a normal, single part download, with any S3 store also supporting direct download from S3 if configured (potentially faster if the connection to AWS/the S3 store is faster than to Dataverse itself). Using a link could potentially refer to a site that has alternate download methods. I think the rsync mechanism can be used in both directions.

 

For things in the works –

 

·         With S3, it should be possible to do multipart downloads as well as multipart uploads. I haven’t looked into that yet. Nominally breaking into parts does three things – limits the time for a given part, enables restarts – a failure part way through only affects some parts of the file so one doesn’t have to restart the download from the beginning, and it allows for parts to be downloaded in parallel which can help use more of the available bandwidth. (This is one of the ways Globus gets more performance – parallel download of multiple file parts.)

·         Of the Globus options, I think Scholar’s Portal’s design maintains Globus access for downloads. I’m not sure if Synapse addresses that.

·         The trusted remote data store may also be able to support non-HTTP download methods.

 

 

-- Jim

Philipp at UiT

unread,
Sep 11, 2020, 12:37:07 PM9/11/20
to Dataverse Users Community
Sherry and Jim, thanks for useful advice!

I think for the time being we'll have another go at dealing with timeouts. 
Looking forward to solutions like Globus and TRSA!

Best, Philipp
Reply all
Reply to author
Forward
0 new messages