Hi,
We have been trying to upload large files to Dataverse. On our current scenario, we feel that implementing direct upload to and S3-like solution (either to a cloud service or using emulation with Minio) is a bit of an overkill, hence we are trying to user dcm/rsync+ssh.
We have read the documentation in
https://guides.dataverse.org/en/latest/developers/big-data-support.html#id3 but haven't been able to get the process working end-to-end.
Here are the steps we took:
- We setup the mock DCM
- Configured :DataCaptureModuleUrl and :UploadMethods
- Downloaded the placeholder rsync scripts, which caused the Dataset to present a File Upload in Progress – This dataset is locked while the data files are being transferred and verified. in the GUI
- Put the files in place (data and .sha files)
- Sent the checksum validation to Dataverse:
curl -H "X-Dataverse-key: <superuser_token>" -X POST -H 'Content-type: application/json' --upload-file checksumValidationSuccess.json <dataverse_host_url>/api/datasets/:persistentId/dataCaptureModule/checksumValidation?persistentId=doi:10.5072/FK2/YFFVJP
and got:
{}
The requests are received by the mock DCM and do not show any errors:
... [03/Feb/2022 11:53:07] "POST /ur.py HTTP/1.1" 200 -
debug: recieved script request for dataset "FK2/YFFVJP"
... [03/Feb/2022 11:53:07] "POST /sr.py HTTP/1.1" 200 -
After this step, we expected the files to be displayed in the dataset, but it was not the case
We tried to troubleshoot the import code using the provided command and got a message that the import was in progress:
curl -H "X-Dataverse-key: <superuser_token>" -X POST "<dataverse_host_url>/api/batch/jobs/import/datasets/files/42?uploadFolder=YFFVJP&totalSize=12"
{"status":"OK","data":{"message":"FileSystemImportJob in progress","executionId":3}
, however, the files never get displayed in the Dataset GUI
Could You please share some advice on what we might be doing wrong or what else should we check for?
Also, as stated in the documentation, the DCM method is experimental. From Your experience, should we invest time in putting direct upload to S3 working or should rsync work seamlessly?
Thank You,
José