Direct S3 uploads

89 views

Skip to first unread message

Jim Myers

unread,

Nov 5, 2019, 11:56:51 AM11/5/19

to Dataverse Big Data

All,

As part of a project for TDL, I've implemented a proof-of-concept allowing direct upload of files from the command-line DVUploader to an S3 store, bypassing streaming the data through glassfish and avoiding a temporary local file. Nominally this should help in managing larger files, but may generally be useful in reducing server load. The intent is to eventually support this through the Dataverse web interface as well. There are limitations to the current implementation, but I thought it would be worth letting others know about the work and hopefully getting feedback regarding some of the open issues/design decisions that are still TBD. (And perhaps learning more about where there's overlap with other approaches.) I'm creating a wiki page @ TDL, but to start I'm going to cut/paste the initial content of that here:

Thanks in advance for any feedback!

-- Jim

The initial development/conceptual model validation work follow the design as shown. It requires changes to Dataverse and the DVUploader.

Detailed Event Sequence

From their local machine (where the data resides as files) user runs the DVUploader.
DVuploader scans the directories/files specified (as it normally does) and, for each file, requests a pre-signed upload URL from Dataverse to upload that file for a given Dataset.
Dataverse, using the secret keys for its configured S3 storage, creates a short-lived URL that allows upload of one new file directly to the storage area in S3 specified for the Dataset.
DVUploader uses the URL to do an HTTP PUT of the data directly to S3 (avoiding streaming the data through glassfish and to a temporary file on the Dataverse server) with transfer speed governed by the network speed between the local machine and S3 store (not the bandwidth to/from the Dataverse server or the disk read/write speed at Dataverse).
DVUploader calls the existing Dataverse /api/datasets/{dataset id}/add call but, instead of sending the file bytes, it sends the ID of the file as stored in S3 (along with it’s name, mimetype, and MD5 hash (and any directoryLabel(path) that would normally be sent).
Dataverse runs through its normal steps to add the file and it’s metadata to the Dataset, currently skipping steps that would require access to the file bytes (e.g. unzipping the file, inspecting it to infer a better mimetype, extracting metadata, creating derived files, etc.). The net result for a file that would not trigger such special processing is exactly the same as if the file had been uploaded via the web interface through Dataverse.

Proof-of-Concept (POC) Achievements:

The work so far shows that it is possible to upload data directly from a local machine to an S3 store without going through Glassfish, using temporary local storage at Dataverse, or using the network between the local machine and Dataverse, or Dataverse and the S3 store. Performance testing needs to be done but from previous testing that shows Glassfish and/or the temporary local storage add delays/server load, etc. , should make uploads faster. If the network between the data and S3 store is faster (e.g. the data is local to the S3 store), additional performance enhancement would be expected.

The POC also shows that this design works with both Amazon’s S3 implementation and the Minio S3 implementation (which is in use at TACC). (There are minor differences that are handled in the Dataverse and DVUploader software).

The design itself was intended to allow direct upload without creating a security concern that a user could upload/edit/delete other files in S3. Unlike designs in which the S3 keys used by Dataverse, or derivative keys doe a specific user, would have to be sent to the user’s machine, where they could potentially be misused or stolen, this design sends a presigned URL that only allows a PUT HTTP call to upload one file, with the location/id of that file specified by Dataverse. (The S3 keys at Dataverse are used to create a cryptographic signature that is included as a parameter in the URL. That signature can be used by the S3 implementation to verify that the PUT, for this specific file, was authorized by Dataverse. Any change to try reading/deleting/editing this or any other file would invalidate the signature.) The signature is also set to be valid for a relatively short time (configurable, default is 60 minutes), further limiting opportunities for misuse. (Note that using the Dataverse API requires having the user’s Access Key (generated via the Dataverse GUI). That key allows the user to do anything via the API call that they can do via the Dataverse GUI. For the discussion here, the important point is that this access key, which is already required for using the DVUploader with the standard upload mechanism, is more powerful/more important to keep safe than the presigned URLs added by the new design. (FWIW: There are discussions at IQSS/GDCC about how to provide more limited API keys from Dataverse that would mimic the presigned URL mechanism.))

In addition to validating the design, the POC involved working through Dataverse’s 2 phase, ~10 step upload process and learning how to separate and, for now, turn off, steps that involve reading the file itself while keeping the processing to add the file to the dataset, record it’s metadata, create a new dataset version if needed, etc. While this code will probably need further modification/clean-up, it’s a significant step to have the POC working.

Next Steps:

There is additional functionality that will be important to creating a production capability. Some are a ‘simple matter of programming’, where the functionality needed is probably not controversial, while others may need further requirements/design discussion.

While the API to add a file to a Dataset checks whether the user (as identified by their access token) has permission to add a file to the specified Dataset, the api call to retrieve a presigned S3 key currently only checks that the user is a valid Dataverse user. It should deny the request unless the user has permission to add files to the dataset. (This is trivial to do, but until then, a valid dataverse user could add files to S3 that would not be associated with any Dataverse entries.)
Dataverse was originally designed to use one (configurable) store for files (could be a local file system, S3, Swift, etc.). The POC works with Dataverse configured with S3. However, as is, all files must be in the same store. To support sending some files to a different store, Dataverse will need to be modified to work with multiple stores. This is potentially useful in general, e.g. to support sending new data to a new store without having to move existing files, but, for the remote storage case, if the use case is to send only some files to the new store (specific datasets, only files larger than a cut-off size, as decided by an admin/user based on preference or knowledge of where the data initially exists, etc.), then additional work would be needed to implement that policy. Dataverse does already keep track of the store used for a given dataset, so some of the code required to identify which store a file is in already exists.
To support upload via the Dataverse web interface, additional work will be needed. This could be significant in that the current Dataverse upload is managed via a third-party library and it may be difficult to replace just the upload step without impacting other aspects of the current upload process (e.g. showing previews, allowing editing of file names and metadata, providing warnings if/when files have the same content or colliding names.) If this is too complex, it will possible to create an alternate upload tab - Dataverse already provides a mechanism to add alternate upload mechanisms that has been used to support uploads from Dropbox, rsync, etc.
Depending on whether the normal processing that Dataverse does during upload (e.g. thumbnails, metadata extraction, mimetype analysis, derived file creating, unzipping, etc.) are desirable for large files, additional work will be needed to reinstate those steps. Simply turning everything back on, which would involve Dataverse retrieving the entire file from S3 one or more times, would be relatively simple though it would have performance impacts. It may make sense to add configuration options that would allow any of these steps to be turned on/off per store, or up to a given file size limit, etc. It would also be possible to shift more of this processing to the background (e.g. creating a .tab file is already done after the HTTP call to upload the file returns) although doing steps like unzipping this way would mean the Dataverse web interface could not show the list of files inside the zip during upload. More complex options, such as moving such processing to a machine local to the S3 store, are also possible (e.g. an app that would inspect the remote file and only send a new mimetype or extracted metadata to Dataverse instead of Dataverse having to pull the entire file from S3 itself.
With the POC, an MD5 hash is created on the local machine as the file is streamed and this is sent to Dataverse to store as metadata (thus allowing the file contents to be compared with the original MD5 hash in the future to validate it’s integrity). Dataverse currently allows other algorithms (e.g. SHA-1, SHA-512). It should be possible to create an MD5 hash during upload through the Dataverse web interface as well. Allowing the hash algorithm to change would require adapting the DVUploader and new upload code for the Dataverse web interface to determine Dataverse’s selected algorithm and to generate the appropriate hash. (S3 does calculate a hash during upload as well, but it varies depending on whether the upload was done in multiple pieces. In theory, one could leverage that instead, but having a has from the original machine seems like a stronger approach.) Dataverse also allows you to change the hash algorithm used and to then update the hash for existing files. This requires retrieving the file and computing the hash locally, so it may be something that should not be done for large files/ for files in some stores, etc.

Reply all

Reply to author

Forward

0 new messages