Dataverse and S3 Buckets

177 views
Skip to first unread message

Zacarias Benta

unread,
May 4, 2022, 4:25:04 AM5/4/22
to Dataverse Users Community
Cheers everyone,
We a currently deploying a Dataverse instance and are having some issues regarding the integration of S3 in our implementation.
The Dataverse web interface keeps hanging randomly for about 20 seconds and it only does that when we deploy it with a S3 bucket as the default storage medium.
It also has a strange behavior when we try to upload files to a dataset, sometimes when it finishes uploading them  there are no files to be found anywhere, whether we look for them in the web interface or the S3 bucket.
If we try using only local storage, it works like a charm, no random hangs and no timeouts.
The weird part is that it sometimes works like a charm and we can't seem to find a pattern that triggers the "freezing" of the web interface.
Did you guys ever experience any similar situation?

j-n-c

unread,
May 4, 2022, 11:32:23 AM5/4/22
to Dataverse Users Community
Hi,

That never happened to me.
Are you using S3 on AWS or emulated on the filesystem (MinIO, ...)?
If you are using S3 on AWS, here are some things you can test when it freezes:
  • Does the Dataverse server have any issues connecting the internet?
  • Can you list your buckets and their contents from the Dataverse server using the AWS CLI?
  • What is the servers CPU, RAM and I/O consumption? Could it be that the server is too busy?
Regards,

José

Don Sizemore

unread,
May 4, 2022, 2:27:04 PM5/4/22
to dataverse...@googlegroups.com
Have you looked for S3AccessIO in your Payara server.log?
Do you have upload-redirect enabled, and if so, do the bucket permissions allow CORS?

Don

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/8e058504-cc23-413c-9bb3-7b9ae9651400n%40googlegroups.com.

Zacarias Benta

unread,
May 5, 2022, 4:20:04 AM5/5/22
to Dataverse Users Community
Hi j-n-c,

Our S3 is a ceph to S3 implementation, so this might be the issue here.
The machine is dedicated for this purpose, it is a "beefy" one 16 cores 32GB of ram.
There is no considerable load on the machine is consumes 4 GB and 2 cores with the current setup, while the web interface freezes, we see no significant increase of resource consumption.
We are not using direct upload, although we've tried it without sucess.

Zacarias Benta

unread,
May 5, 2022, 4:20:41 AM5/5/22
to Dataverse Users Community
Thanks for the tip Don, we'll look into it.

James Myers

unread,
May 5, 2022, 6:57:04 AM5/5/22
to dataverse...@googlegroups.com

Another thought would be to set dataverse.files.<id>.connection-pool-size to >256. That was introduced in v5.1.1 after Dataverse changed to using a pool of S3 connections in 5.1 to be more efficient. If you are on Dataverse >=5.1 and you are doing tests that open many S3 connections (uploads, thumbnail retrievals, etc.) and/or those connections aren’t getting closed quickly (more possibilities for that in older Dataverse versions as we have found/fixed cases where Dataverse leaves a connection open for a while, but also possibly something where your Ceph implementation has a longer timeout than AWS that might make these worse for you), then increasing this value should help. Although increasing the pool uses some memory, making it 10-20 times bigger should still be fine if that helps the freezing problem.

 

-- Jim

Zacarias Benta

unread,
May 9, 2022, 5:36:30 AM5/9/22
to Dataverse Users Community
Thanks for the tips Jim.

Here is a simplified architectur of our deployment:

Untitled Diagram.jpg

We are now trying to use direct upload to the bucket, so we can spare Dataverse the job of collecting all the data and then send it to the bucket.
CORS is now our main issue.
We enabled CORS on the bucket:

{
    "CORSRules": [
        {
            "AllowedHeaders": [
                "x-requested-with"
            ],
            "AllowedMethods": [
                "GET",
                "PUT",
                "DELETE",
                "POST"
            ],
            "AllowedOrigins": [
                "https://mydataverse.com"
            ],
            "ExposeHeaders": [
                "ETag"
            ]
        }
    ]
}

And also added rules to nginx (we use as a reverse proxy to our container running Dataverse) to append the "Access-Control-Allow-Origin" header to the requests.

We tried adding this setting in the server section:
add_header 'Access-Control-Allow-Origin' 'mystoragehost.com';

We also tried adding the settings everyone recommends:


Still we get the same error reporting that :

Any ideas on how to solve this issue?

James Myers

unread,
May 9, 2022, 7:10:50 AM5/9/22
to dataverse...@googlegroups.com

The simplest solution would be to open up the CORS origin as discussed in https://guides.dataverse.org/en/latest/developers/big-data-support.html. (You can drop POST and DELETE though.).  FWIW: Direct up and download both use signed URLs created by Dataverse only for users who should be able to get access to a given resource and which are only valid for a short time (configurable), hence allowing * does not open access as much as it would with endpoints that offer public access or have simple username/password controls, etc.)

 

If you want to tighten things up, you may also want to look at https://github.com/GlobalDataverseCommunityConsortium/dataverse-previewers/wiki/Using-Previewers-with-download-redirects-from-S3 which discusses other security mechanisms you’d have to address (specifically, Content-Security-Policy by default prohibits Javascript (used for direct upload) from adding an Origin header, which I think is why you’re seeing ‘missing’ in the error). If you figure out what’s needed there, we’d be happy to have a PR to get that into the guides for others to follow.

Zacarias Benta

unread,
May 13, 2022, 5:10:17 AM5/13/22
to Dataverse Users Community
We came about a new approach.
We are now using a minio server with CORS enabled and it works like a charm.
The only weird thing we notice is that the files are uploaded to the bucket on minio, we can see them in the bucket structure, but the dataverse web interface progress bar keeps moving.
As you can see bellow, the last PUT was successfull, it took about 3 minutes.
Screenshot from 2022-05-13 09-56-31.png
The file is already on the bucket as you can see bellow:

Screenshot from 2022-05-13 09-57-24.png

If we open the dataset on another window the file is not there yet:

Screenshot from 2022-05-13 10-00-37.png


We have no idea what dataverse is doin the progress bar on the UI keep moving and ther is no load on the container or on the minio machine.

The minio machine:
Screenshot from 2022-05-13 10-04-05.png

The dataverse container:
Screenshot from 2022-05-13 10-03-34.png


All the containers stats
Screenshot from 2022-05-13 10-03-45.png

It took about 7 minutes since we selected the file to upload and finalizing the whole process. The time it took to upload to minio was of about 3 minutes and the web interface only finished completing the progress bar 4 minutes after the file was uploaded. This is a weird behavior, we see no load in any of the machines and have no idea what is going on in the background. Any thoughts?

James Myers

unread,
May 13, 2022, 7:23:09 AM5/13/22
to dataverse...@googlegroups.com

Direct upload via the UI includes a second pass through the file to calculate the file hash (usable to verify that Dataverse/future downloaders have the exact same file as what was on the uploader’s disk). Depending on the relative speed of your network and machine, the first half (uploading to S3) and the second half (calculating the hash) of the bar can proceed at fairly different rates.

 

Also, as with normal upload, the files are uploaded prior to being added to the dataset, i.e. it is only when you hit save that the dataset registers that the new files are part of it.

 

FWIW: The DVUploader combines upload with hash calculation in one pass through the file, hence it can be somewhat faster. The decision to upload and hash sequentially in the UI was solely due to not knowing of any Javascript library that would allow doing both in parallel (which is straight forward in Java/the DVUploader.)

Zacarias Benta

unread,
May 16, 2022, 5:06:53 AM5/16/22
to Dataverse Users Community
Thanks for the tips Jim.
And now, "for something completely different", we have another issue. Whenever we try to upload severall files at the same time, we get no error on the interface, the last few files on the list are never processed and if we press done the files that had the luck to be processed are deleted from the minio bucket and the dataset has no trace of any new files ever being uploaded to it.

Screenshot from 2022-05-16 10-02-27.png
Here are the logs from  our dartaverse and nginx containers:

dataverse        | [#|2022-05-16T09:02:52.401+0000|INFO|Payara 5.2021.1|edu.harvard.iq.dataverse.util.FileUtil|_ThreadID=134;_ThreadName=http-thread-pool::http-listener-1(8);_TimeMillis=1652691772401;_LevelValue=800;|
dataverse        |   Deleting minio://dataversefccn:180cc1987b5-9974eb3ad4a8|#]
dataverse        |
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "GET /dataversefccn/?acl HTTP/1.1" 200 309 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "GET /dataversefccn/?acl HTTP/1.1" 200 309 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "DELETE /dataversefccn/10.82210/H70SOA/180cc1987b5-9974eb3ad4a8 HTTP/1.1" 204 0 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "GET /dataversefccn/?prefix=10.82210%2FH70SOA%2F180cc1987b5-9974eb3ad4a8.&encoding-type=url HTTP/1.1" 200 335 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
dataverse        | [#|2022-05-16T09:02:52.421+0000|INFO|Payara 5.2021.1|edu.harvard.iq.dataverse.util.FileUtil|_ThreadID=134;_ThreadName=http-thread-pool::http-listener-1(8);_TimeMillis=1652691772421;_LevelValue=800;|
dataverse        |   Deleting minio://dataversefccn:180cc1989a3-74473b01e127|#]
dataverse        |
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "GET /dataversefccn/?acl HTTP/1.1" 200 309 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "GET /dataversefccn/?acl HTTP/1.1" 200 309 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "DELETE /dataversefccn/10.82210/H70SOA/180cc1989a3-74473b01e127 HTTP/1.1" 204 0 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "GET /dataversefccn/?prefix=10.82210%2FH70SOA%2F180cc1989a3-74473b01e127.&encoding-type=url HTTP/1.1" 200 335 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
dataverse        | [#|2022-05-16T09:02:52.441+0000|INFO|Payara 5.2021.1|edu.harvard.iq.dataverse.util.FileUtil|_ThreadID=134;_ThreadName=http-thread-pool::http-listener-1(8);_TimeMillis=1652691772441;_LevelValue=800;|
dataverse        |   Deleting minio://dataversefccn:180cc198c63-322a6658b1e3|#]
dataverse        |
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "GET /dataversefccn/?acl HTTP/1.1" 200 309 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "GET /dataversefccn/?acl HTTP/1.1" 200 309 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "DELETE /dataversefccn/10.82210/H70SOA/180cc198c63-322a6658b1e3 HTTP/1.1" 204 0 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "GET /dataversefccn/?prefix=10.82210%2FH70SOA%2F180cc198c63-322a6658b1e3.&encoding-type=url HTTP/1.1" 200 335 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "GET /dataversefccn/?acl HTTP/1.1" 200 309 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"
nginx            | 192.168.1.101 - - [16/May/2022:09:02:52 +0000] "GET /dataversefccn/10.82210/H70SOA/dataset_logo.thumb140 HTTP/1.1" 404 405 "-" "aws-sdk-java/1.11.762 Linux/5.4.0-109-generic OpenJDK_64-Bit_Server_VM/11.0.7+10-LTS java/11.0.7 vendor/Azul_Systems,_Inc." "-"

And as you can see, no files named Screenshot* are present on the dataset:

Screenshot from 2022-05-16 10-05-34.png

Don Sizemore

unread,
May 16, 2022, 7:10:41 AM5/16/22
to dataverse...@googlegroups.com
Hello,

Are you running 5.10+? I'm wondering if you're hitting this:
https://github.com/IQSS/dataverse/pull/8409

Don

Zacarias Benta

unread,
May 23, 2022, 7:22:58 AM5/23/22
to Dataverse Users Community
Thanks for the heads up Don,

Just applied the patch and it works like a charm.
Reply all
Reply to author
Forward
0 new messages