--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/a705d4b5-a263-43d7-9896-3383bd667d98n%40googlegroups.com.
Valentina,
Your expectation for how the DVUploader works is currently true - the original Dataverse API only allows one file to be uploaded before it saves the dataset which would release your temp space. (pyDataverse or any other tool has to work the same way). (With a recent API addition from Scholars Portal, external apps (the pre-release DVUploader specifically) can now send several files at once – faster for many small files but not good for your case.)
FWIW: DVUploader will wait for locks on the dataset before sending a new file, but the asynchronous indexing after the save is complete can sometimes interfere with a subsequent upload to the same dataset (e.g. with thousands of files). With DVUploader, you can always restart to send any files missed on a first run, but a more conservative approach would be to use the –limit=1 flag with DVUploader and the new View the Timestamps on a Dataset API call to script DVUploader (or pyDataverse) to send one file and then loop until the lastindexed timestamp from the API call is after your update, and then send the next file, etc. (Hopefully I/someone will get this into DVUploader itself at some point). This is a little less efficient (DVUploader starts by getting the list of a files in the dataset to compare with the local dir structure, so restarting means calling this again), but should be able to send all files in one run (barring network/other issues).
As Peter implies, having enough temp space and all timeouts set long enough are important, but if you can upload one file of the size you want now, you should be able to handle the repeated uploads by scripting.
The only other upload option I can think of is, if you are on S3 and just haven’t enabled direct upload, you could configure a second s3 store, going to the same bucket, with directupload enabled and only assign it to this dataset. The only difference here is just that it would not impact your other users.
(I can also imagine an interesting work-around scenario: the only change in the db when you use a different store is that that store name is what is preprended on the file storageidentifiers. So once could do something like create a temporary bucket at Amazon, configure a directupload store with that and assign it to this dataset, transfer the files as normal, and then, once done, use the aws command line to sync the bucket to you original bucket or file store, then edit the db entries to swap from the new store to the original. This is basically a variant on the workarounds where people have uploaded a small file, swapped it manually/without Dataverse for a bigger one, and then updated the size/hash in the db to match. Somewhat simpler perhaps since you don’t have to manage file size/hashes)
For download, scripting sounds like a good way to go. Other options that others can provide more detail about would be to use the new directory listing page that lets standard download tools handle the multi-file download (versus you having to write a script), or trying the standalone Zip util Harvard created that avoids Dataverse being involved in creating the zipped download file (although I assume you’d still need the 180GB of temp space on some machine for that.)
Hope that helps!
-- Jim
--