S3 direct upload with CEPH not working

603 views
Skip to first unread message

Patrick Vranckx

unread,
Apr 21, 2021, 6:54:43 AM4/21/21
to dataverse...@googlegroups.com
Hi,

We are using Dataverse v. 5.0 build 175-993d0a3. Data is stored on a
CEPH bucket. It's ok without S3 direct upload.

I'm trying to configure direct upload & download as explained in the
Big Data Support documentation on our dev server.

Here the relevant lines in the domain.xml:

<jvm-options>-Ddataverse.files.storage-driver-id=s3</jvm-options>
<jvm-options>-Ddataverse.files.s3.custom-endpoint-url=http://10.0.0.1:7480</jvm-options>
<jvm-options>-Ddataverse.files.s3.type=s3</jvm-options>
<jvm-options>-Ddataverse.files.s3.label=s3</jvm-options>
<jvm-options>-Ddataverse.files.s3.bucket-name=dataversetest</jvm-options>
<jvm-options>-Ddataverse.files.s3.upload-redirect=true</jvm-options>
<jvm-options>-Ddataverse.files.s3.download-redirect=true</jvm-options>

and the bucket info:

[glassfish@dataverse-poc ~]$ s3cmd info s3://dataversetest
s3://dataversetest/ (bucket):
Location: us-east-1
Payer: BucketOwner
Expiration Rule: none
Policy: none
CORS: <CORSConfiguration
xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><CORSRule><AllowedMethod>GET</AllowedMethod><AllowedMethod>PUT</AllowedMethod><AllowedOrigin>*</AllowedOrigin><AllowedHeader>*</AllowedHeader><ExposeHeader>ETag</ExposeHeader></CORSRule></CORSConfiguration>
ACL: *anon*: READ
ACL: First User: FULL_CONTROL
URL: http://10.0.0.1:7480/dataversetest/

Using dataverse-uploader with '-directupload' option, I get an error
404 for "GET /api/datasets/:persistentId/uploadurls?persistentId=doi:10.14428/DVN/LKRIAF...".
Without the option, uploading files works as previously.

Using the web gui, I got those lines in httpd log:
[21/Apr/2021:12:46:10 +0200] "POST
/editdatafiles.xhtml?datasetId=1643&mode=UPLOAD HTTP/1.1" 200 822
[21/Apr/2021:12:46:11 +0200] "POST
/editdatafiles.xhtml?datasetId=1643&mode=UPLOAD HTTP/1.1" 200 5641

and the upload fails (Network error). Nothing relevant in Payara logs.


Any idea ?

Patrick Vranckx
https://dataverse.uclouvain.be

James Myers

unread,
Apr 21, 2021, 9:22:54 AM4/21/21
to dataverse...@googlegroups.com
Patrick,

I don't have a specific fix, but some suggestions:

FWIW: There's more detail on the direct upload mechanism at https://guides.dataverse.org/en/latest/developers/s3-direct-upload-api.html (valid for 5.0 as well), which is used in the UI as well as when you call the API directly.

It looks like your call is failing on the first call - to the Dataverse server - to get signed URLs and, looking in the code, it seems like it may be just in trying to set up the s3 client and prior to trying to sign URLs. At a minimum, failing on this call means it isn't CORS related as there hasn't been communication from the browser to s3 directly yet.

Assuming the s3 setup step is failing, it may be that CEPH requires some of the other optional S3 config parameters listed at: https://guides.dataverse.org/en/latest/installation/config.html#s3-storage-options (again, valid in 5.0 but may not have shown in the 5.0 docs due to some typos). Things like chunked-encoding, payload signing, and path style access had to be changed for other non-AWS S3 servers. It is a little odd if these would only affect direct upload and not also cause problems with normal uploads, but I don't know - payload signing in particular sounds like it could possibly affect direct upload only. (Same thing with issues such as accidentally running the Dataverse server as a user that doesn't have s3 credentials - both normal and direct should be affected.)

If you find out that the problem is with some of these optional jvm options, please add an issue/PR so we can document what's needed.

Both direct upload and download require the s3 server to support url pre-signing. If I'm missing something in the code and your server is connecting to S3 OK in that call and there's just a failure at the url signing stage (where I would have expected a 500 error, not a 404), then it could be that CEPH doesn't allow/isn't configured to support url pre-signing.

One other note, which is really just for future: Once that initial call to Dataverse completes, the uploads themselves are managed by Javascript in the browser and any status/error messages will show up in the browser dev console/network tab, etc. Those steps are the ones where CORS issues could occur and you should see indications in the browser if that happens. The final calls, after the files/file parts are successfully on S3 involve Dataverse again, so any issues there could be back in the Dataverse logs.

Hopefully something there is helpful,

-- Jim
--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CAAyRWZ45k%2Bfq5sZ6FP0Gy06dife_XrrUTJbt7PAs1x95HU-Wbw%40mail.gmail.com.

Patrick Vranckx

unread,
Apr 21, 2021, 10:45:18 AM4/21/21
to dataverse...@googlegroups.com
Jim,

Having set the env variables, the request  :

#curl -H "X-Dataverse-key:$API_TOKEN" $SERVER_URL/api/datasets/:persistentId/?persistentId=$PERSISTENT_IDENTIFIER

succeeds:

{"status":"OK","data":{"id":1643,"identifier":"DVN/LKRIAF","persist........

But this one fails :

curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/datasets/:persistentId/uploadurls?persistentId=$PERSISTENT_IDENTIFIER&size=$SIZE"

{"status":"ERROR","code":404,"message":"API endpoint does not exist on this server. Please check your code for typos, or consult our API guide at http://guides.dataverse.org.","requestUrl":"https://my-server-url/api/v1/datasets/:persistentId/uploadurls?persistentId=doi:10.14428/DVN/LKRIAF&size=1000000000","requestMethod":"GET"

Could it be a problem with the absence of an API ? An upgrade problem ? This version is the result of multiple upgrades from v4.18.

On the other hand, I tried pre-signed url and it seems ok:


In the config file, I see no mandatory options not set or incorrect in domain.xml. Everything works fine without S3 direct upload.

Hope this helps !
Thank you for your help,

Patrick



--

"Personne n'est à l'abri d'une bonne idée"

James Myers

unread,
Apr 21, 2021, 11:56:53 AM4/21/21
to dataverse...@googlegroups.com

Ah – sorry. At least a partial answer: 5.0 was prior to the introduction of multipart uploads in 5.1 and so the /uploadurls endpoint indeed doesn’t exist. However there is a (now deprecated but still existing) /uploadsid endpoint (which should work up to your s3 server’s partSize limit (configurable on minIO and maxed at 5GB for AWS)). The one difference is that uploadsid has no size param, so just:

 

curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/datasets/:persistentId/uploadsid?persistentId=$PERSISTENT_IDENTIFIER "

 

 

The UI should be using that older method internally in 5.0, so that doesn’t explain why the UI upload didn’t work. It would be useful to confirm in the browser console/network pane that the error is actually in that first call back to Dataverse and not, for example, a separate issue when the presigned URL gets used with the S3 store (which could be CORS, etc.)

 

FWIW: I think you can do a complete API test if you just use the /uploadsid endpoint as the first step. That gives you a single signed URL which you can then use to upload to S3 (same as the single part case in the 5.4 docs) and then finish with the final call to add the file to your dataset (again the same as in the 5.4 docs).

 

Hopefully either checking in the console for UI errors or trying the /uploadsid endpoint and subsequent API calls per the docs will uncover some other issue that is preventing this form working.

 

-- Jim

Reply all
Reply to author
Forward
0 new messages