Different Documentation for S3 Direct Upload

32 views
Skip to first unread message

Sherry Lake

unread,
Jul 8, 2021, 1:10:25 PM7/8/21
to Dataverse Users Community
What is the difference between the documentation for direct upload here:


And the 3 step process here?


Maybe the 1st link is how to set up the software (GUI) to do direct upload to S3 and the 2nd one for use with API calls?

And then what is used for direct download from S3? assuming the 1st link has that covered? Then how would a direct download work via API?

Thanks as all ways!
Sherry Lake
LibraData, UVa's Institutional Dataverse Repository

Philip Durbin

unread,
Jul 8, 2021, 2:52:49 PM7/8/21
to dataverse...@googlegroups.com
Hi Sherry,

These docs about direct upload could probably use some clean up. (Please feel free to open an issue about this!)

You linked to two pages in the Developer Guide. You probably don't want either of these unless you're doing some development.


- dataverse.files.<id>.upload-redirect
- dataverse.files.<id>.url-expiration-minutes

My understanding is that while direct upload was originally an API-only feature, these days it "just works" in the GUI as well. I'll probably need Jim to clarify for me if all those steps you found are necessary for the API or not. (If they are, this content should probably be moved to the API Guide.)

Hope this helps,

Phil



--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/39dbb215-9ff9-45a6-8e42-233e548f1111n%40googlegroups.com.


--

James Myers

unread,
Jul 8, 2021, 3:46:22 PM7/8/21
to dataverse...@googlegroups.com

Sherry,

~yes – the basic split is that there are parameters you need to set as an admin to enable direct upload/download, and ways you use the API to actually do direct upload /download.

 

The basic documentation of direct up/down started in the Big Data section as it is primarily used for that. As direct download is both simple and generally useful even for smaller files, I think it got documented in the basic install guide as well and I added the JVM option needed for direct upload to the table there, but there’s no discussion of direct upload in the install guide.

 

For the install/config documentation, I think things could get rearranged/merged as Phil said, though I suspect that more people will want to enable direct download and that direct upload will continue to be specific to places that want to support larger data, so perhaps some way to indicate that direct upload is an ‘advanced option’ would be useful if we move discussion of it out of the dev/Big Data part.

 

For using them:

 

Both download and upload work in the UI and API.

 

Download is simpler in that what happens is, when you use the normal download API call, if direct download is enabled, the response an http ‘redirect’ rather than the bytes from the file. In the UI, your browser automatically follows the redirect and starts getting the file bytes from s3. For the API, your code has to be smart enough to follow the redirect (For example curl has a ‘-L’ flag that means follow redirects, and some Java libraries can automatically follow the redirect as well.) In either case though, it’s one simple, web standard, and easy to automate step.

 

Direct upload is a bit more complex (the 3 step process): basically you have to ask Dataverse to let you upload a file, do the direct upload to s3, and then tell Dataverse you did it. (Not counting the multi-part upload for even larger files where the upload step itself is actually multiple calls.) As with direct upload overall, the API is somewhat ‘advanced’ and probably only worth dealing with if you have big data and/or are going to use a toolkit/app (like DVUploader, and hopefully pyDataverse at some point). So the documentation is still in the dev guide – it could move but it would still be good to make it clear that it is ‘advanced’.

 

The UI works for direct upload because it is doing something similar to those three steps – it basically does the first two and then relies on the ‘save’ button to tell Dataverse everything is done – the same as with normal uploads.

 

Probably more info than you wanted. I guess the bottom line is that direct upload is probably advanced enough to keep at least the API info separate in an advanced/Big Data related section somewhere. The main docs could probably say more about how you set it up and point to any tools that support the API rather than going into details.

 

Hope that helps,

-- Jim

Sherry Lake

unread,
Jul 9, 2021, 8:54:06 AM7/9/21
to dataverse...@googlegroups.com
Thanks Phil and Jim,

Now my question is...

If I configure the 2 JVM options in Phil's email below, then BOTH upload and download use the redirect for S3?

--
Sherry

James Myers

unread,
Jul 9, 2021, 10:03:56 AM7/9/21
to dataverse...@googlegroups.com

Those two options are for a given store – if you set both to true then any datasets in Dataverse collections using that store will do both direct upload/download. (Note the CORS setting needed on your S3 bucket to do direct upload (and to enable previewers to work with direct download) that is described in the Big Data section you linked to.) (Also – note that you can set up two s3 stores using the same bucket – one with direct upload (and probably a higher size limit) and one without – that’s a way to limit who gets to do direct upload (and use the higher limit) while not opening up big data for everyone.)

 

-- Jim

 

From: dataverse...@googlegroups.com [mailto:dataverse...@googlegroups.com] On Behalf Of Sherry Lake
Sent: Friday, July 9, 2021 8:54 AM
To: dataverse...@googlegroups.com
Subject: Re: [Dataverse-Users] Different Documentation for S3 Direct Upload

 

Thanks Phil and Jim,

Philip Durbin

unread,
Jul 9, 2021, 2:45:28 PM7/9/21
to dataverse...@googlegroups.com
There's actually a third JVM option I didn't mention:

- dataverse.files.<id>.download-redirect

So that is to say, I *think* this is the complete list (plus the CORS thing Jim mentioned):

- dataverse.files.<id>.download-redirect
- dataverse.files.<id>.upload-redirect
- dataverse.files.<id>.url-expiration-minutes

And the "expiration" one is optional.

I've never actually played with all this so I defer to Jim. :)

Again, the guides could probably use some clean up in this area so please feel free to open an issue about this.

Hope this helps,

Phil

Reply all
Reply to author
Forward
0 new messages