Enable direct S3 upload for files over a certain size

58 views
Skip to first unread message

Michel Bamouni

unread,
Jun 8, 2020, 10:55:21 AM6/8/20
to Dataverse Users Community
Hi,

During my update to dataverse 4.20, I enabled direct S3 upload/download option. 
I notice that regardless of the size of my files, direct S3 upload mecanism is enable and so file type or ingestion don't work.
So I would want to know if it's possible to enable the direct S3 upload mecanism for a file above a given limit. My goal is to get a non direct S3 upload when the file size is small and trigger direct S3 upload when it's a big file.

Best regards,

Michel

James Myers

unread,
Jun 8, 2020, 4:26:47 PM6/8/20
to dataverse...@googlegroups.com

Michel,

 

Switching upload mechanism based on file size is not currently possible. Let me first discuss what works today and some shorter-term plans and then talk about other options and work-arounds.

 

Right now direct upload does do ingest of tabular and FITS files (configurable to only ingest when below a given -Ddataverse.files.<id>.ingestsizelimit ). You’re right that it doesn’t unzip zip files as normal uploads do, and it doesn’t try to run the mimetype detection. There is already an issue (https://github.com/IQSS/dataverse/issues/6762 ) for fixing the mime-type detection (doing it efficiently by only retrieving the bytes required rather than the whole file). Once that is resolved, the only functionality you’ll miss with direct upload is unzipping.

 

As you may have read, the current mechanism for deciding when to use direct upload is per store, with the ability to assign different stores to different Dataverses. That doesn’t give you per-file control but you can have two stores that point to the same bucket, with one using direct upload and the other using the normal mechanism. That could be a work-around that doesn’t require new developments.

 

In terms of potential developments:

 

It probably wouldn’t be too hard to make the ingestsizelimit option include unzipping zip files. Using direct upload to then transfer zip files to the Dataverse server to unzip them and then return those to S3 isn’t very efficient, but it could allow you to fully mirror normal upload  processing (once the issue noted above is resolved) , up to the size limit, when using direct upload. I suspect that might be OK with others in the community – it helps handle cases where you can’t separate big data into separate Dataverses and, while it’s not as efficient as doing a normal upload, if you’re only inefficient with small files and do so to gain efficiency with big data, it could be a reasonable compromise.

 

It might also be possible to have different ways of assigning stores, i.e. rather than assigning a store to a Dataverse, have it assigned per person, or configurable per upload session, etc. Those types of changes may be harder to make general since some community members are using direct upload for big data and using the access control of who can create Datasets within a given Dataverse as a way to limit who can submit large data (which could be costly to store). Making the choice user configurable would make controlling access to larger stores difficult as would a per Dataset model. Per user might be OK as an alternate way to control who can upload large data.

 

Lastly, while I like the idea of just handling small files through the normal upload mechanism, I think it would be technically challenging to handle things per-file during a single upload session. Normal uploads are handled using functionality in the PrimeFaces library that Dataverse uses, whereas direct uploads are managed by javascript I created with support from the Texas Digital Library. Trying to coordinate those two so that they agree, when given a list of files by the browser, which mechanism should trigger to upload each file would not be easy. Someone else in the community might have better ideas of how that might be done, but I think this would be significantly more work than the options above.

 

If the work-around of using multiple stores, or the functionality of having ingest work without the unzip capability is good enough, you might be set. If you think something more is needed, I’d recommend opening an issue for this and perhaps commenting on which of the suggestions above would work for you. We could use that issue to get some further community discussion about how these or other options would work at various installations given their intended use. If it looks like there’s a consensus around some option, we could consider adding it to the development queue for IQSS/GDCC (or a pull request would be welcome if someone else is able to do the development).

 

Hope that’s helpful,

 

-- Jim

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/48a5c8e3-a9ad-458c-b537-de445526ce7ao%40googlegroups.com.

Michel Bamouni

unread,
Jun 9, 2020, 11:13:29 AM6/9/20
to Dataverse Users Community
Hi Jim,

Thanks for this clear answer.
I think i will create 2 stores using the same S3 bucket. One for normal uploads and another one for direct S3 uploads.

Best regards

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Michel Bamouni

unread,
Jun 11, 2020, 5:39:02 AM6/11/20
to Dataverse Users Community
Hi Jim,

I set up my two stores as I mentionned in my previous post and that's work well.
But now I want to know if it's possible for dataverse admin to switch from one mode to another (normal upload to direct S3 and inversely)?
In the principle does the switching is a good idea?

James Myers

unread,
Jun 11, 2020, 9:28:39 AM6/11/20
to dataverse...@googlegroups.com

Michel,

 

Short answer – yes but other alternatives might be better.

 

There’s an API to switch a Dataverse from using one store to another and admins can see a store option under Edit/General Information on the Dataverse page. Once that value is changed, any new file uploads to Datasets in that Dataverse (or a child Dataverse that doesn’t override the store setting of its parent) go to the new store. So you could, as an admin, switch one Dataverse back and forth.

 

With two stores configured to store files in the same bucket, just using different upload methods, the one issue I can think of is that switching as a user is uploading may break direct uploads (the upload page calls the Dataverse server repeatedly to get a presigned S3 URL for each file. If the store is switched while this is happening the server will refuse to give out these URLs once the assigned store doesn’t support direct uploads).

 

An alternative I’d suggest considering would be to set up a separate Dataverse for the second store and to then have someone with appropriate permissions move the Dataset to a new Dataverse after uploads. (Nominally the same issue I mentioned above could occur if you’re moving the Dataset during an upload, but in the case it’s only one Dataset that could be affected rather than a whole Dataverse and its included Datasets and Dataverses.) It might also be possible to use linking as a way to manage uploading to different stores but organizing the Datasets together – I haven’t thought this through.

 

(For the community, I’ll note that switching stores on a Dataverse when the stores send files to different places is also possible. The outcome would be that files for different Datasets in the Dataverse and even different files in the same Dataset uploaded at different times would be in different buckets (or some in your file store and some in S3). This wouldn’t break anything in Dataverse (aside from the potential to mess up uploads in progress as discussed above), but there is no automated tool for moving files between stores at this point, so system admins would want to be aware that they may need to check multiple stores /look in the database to see where files are located. My sense is the added complexity is probably not worth it except perhaps for some specific use cases, such as adding a new S3 store to use for all new files while leaving existing content in a file store.)

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/cb563a85-8737-4274-848f-928da32412dco%40googlegroups.com.

Michel Bamouni

unread,
Jun 19, 2020, 6:59:03 AM6/19/20
to Dataverse Users Community
Hi Jim,

Thanks again for the answer.
I will see with my stakeholders how we can deal with the two stores switching keeping in mind yours alternatives.

best regards,

Michel

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages