DCM rsync uploads

j-n-c

unread,

Feb 3, 2022, 8:40:45 AM2/3/22

to Dataverse Big Data

Hi,

We have been trying to upload large files to Dataverse. On our current scenario, we feel that implementing direct upload to and S3-like solution (either to a cloud service or using emulation with Minio) is a bit of an overkill, hence we are trying to user dcm/rsync+ssh.
We have read the documentation in https://guides.dataverse.org/en/latest/developers/big-data-support.html#id3 but haven't been able to get the process working end-to-end.

Here are the steps we took:

We setup the mock DCM
Configured :DataCaptureModuleUrl and :UploadMethods
Downloaded the placeholder rsync scripts, which caused the Dataset to present a File Upload in Progress – This dataset is locked while the data files are being transferred and verified. in the GUI

Put the files in place (data and .sha files)
Sent the checksum validation to Dataverse:

curl -H "X-Dataverse-key: <superuser_token>" -X POST -H 'Content-type: application/json' --upload-file checksumValidationSuccess.json <dataverse_host_url>/api/datasets/:persistentId/dataCaptureModule/checksumValidation?persistentId=doi:10.5072/FK2/YFFVJP

and got:

{}

The requests are received by the mock DCM and do not show any errors:

... [03/Feb/2022 11:53:07] "POST /ur.py HTTP/1.1" 200 -
debug: recieved script request for dataset "FK2/YFFVJP"
... [03/Feb/2022 11:53:07] "POST /sr.py HTTP/1.1" 200 -

After this step, we expected the files to be displayed in the dataset, but it was not the case

We tried to troubleshoot the import code using the provided command and got a message that the import was in progress:

curl -H "X-Dataverse-key: <superuser_token>" -X POST "<dataverse_host_url>/api/batch/jobs/import/datasets/files/42?uploadFolder=YFFVJP&totalSize=12"
{"status":"OK","data":{"message":"FileSystemImportJob in progress","executionId":3}

, however, the files never get displayed in the Dataset GUI

Could You please share some advice on what we might be doing wrong or what else should we check for?
Also, as stated in the documentation, the DCM method is experimental. From Your experience, should we invest time in putting direct upload to S3 working or should rsync work seamlessly?

Thank You,

José

Philip Durbin

unread,

Feb 4, 2022, 10:09:04 AM2/4/22

to Dataverse Big Data

Hi José,

First, a word of warning. We built the DCM/rsync solution years ago and it hasn't been widely adopted. It's possible that it simply doesn't work with new versions of Dataverse. (We aren't in the habit of testing it with each new release.) What version of Dataverse are you using?

I asked our main collaborator for thoughts on your situation and this is what he had to say:

"It's not immediately clear to me either :( If I was going to guess though, I suspect that they've got a mix of DCM/posix and DCM/s3 that's confusing things. The only place it looks (to me, for what that's worth) that empty JSON is returned from the checksumValidation API is in the S3 storage driver conditional; and that user mentioning "Put the files in place" makes me think they're not the the zip file on S3 that DCM/s3 produces (and the API may be expecting). It might be worth asking for info on the dataverse config and server log to see if that's consistent with this being the source of the problem."

I'm not sure if that's helpful or not. You are welcome to send any configs or logs to sup...@dataverse.org

Finally, one thing I'll add is that in development we used the Docker images described here: https://guides.dataverse.org/en/5.9/developers/big-data-support.html#steps-to-set-up-a-dcm-via-docker-for-development

If we were to dive into troubleshooting this (and I'm not sure that we'll have time to), we would probably start with those Docker images. So, you might want to see if you can get them running.

Thanks for trying the rsync/DCM feature! I'm sorry it didn't "just work". :(

Thanks,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Big Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-big-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-big-data/e80de540-5822-45d3-8610-45e46bc622a3n%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

j-n-c

unread,

Feb 8, 2022, 11:49:35 AM2/8/22

to Dataverse Big Data

Hi Philip,

Thank You very much for Your feedback.

Currently we are running Dataverse v5.9.

From Your comments and from what we have seen in community discussions, we decided to abandon the rsync/DCM approach and are going to try to implement uploading directly to S3 using either MinIO or Gluster.

Once we get this up and running we could contribute to the documentation on how to setup with this scenario (direct uploads to S3 using emulation on local storage). Do You think this would be useful?

Best Regards,

José

Philip Durbin

unread,

Feb 8, 2022, 12:11:12 PM2/8/22

to Dataverse Big Data

Yes! More documentation is absolutely appreciated!

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-big-data/f5093fc6-5f67-4991-b808-fe7df84ef5a4n%40googlegroups.com.

Reply all

Reply to author

Forward