Migrate dataverse file storage from local filesystem to S3 storage

Michel Bamouni

unread,

Jan 22, 2019, 8:16:50 AM1/22/19

to Dataverse Users Community

Hi,

We use dataverse in production since one year and uploaded files are stored in local filesystem.

With the ability to use custom S3 storage in the last dataverse release, we are planning to replace local filesystem to storage with S3.

So I want to know if there is a way to migrate existing files on fileSystem to S3 and ensure that these files are still accessible from dataverse?

Best regards,

Michel

Philip Durbin

unread,

Jan 22, 2019, 8:56:17 AM1/22/19

to dataverse...@googlegroups.com

Hi Michel,

Harvard Dataverse migrated from physical servers with files stored on NFS to virtual servers (AWS EC2 instances) with files stored on S3. There's a checklist about this migration linked from https://github.com/IQSS/dataverse/issues/4309 and it mentions a "files-to-S3 script" (which I wasn't able find, unfortunately) but I think the checklist does a good job of explaining the steps that were taken.

I hope this helps,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/76597081-1b1a-442f-99df-0b1ecee88476%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

James Myers

unread,

Jan 22, 2019, 9:50:39 AM1/22/19

to dataverse...@googlegroups.com

Michel,

Phil pointed me at the DV issue and scripts (somewhere in google docs) and here’s my summary, with a slight script update.

The key steps are to move the files to S3 and then to update the database to indicate that those files are to be retrieved from S3.

For QDR, I cd’d to the Dataverse files/ directory and did

Do a dry run:

aws s3 sync ./10.5072 s3://qdr-dataverse-dev/10.5072 --dryrun

Run the sync of everything in the 10.5072 dir tree to the bucket:

aws s3 sync ./10.5072 s3://qdr-dataverse-dev/10.5072

and then the same thing for our production authority. (The temp file directory is still needed/doesn’t get transferred to S3). (--dryrun is a great way to make sure you’ve got paths correct and the command is doing what you want before you run it for real.)

To be cautious, I’ve just moved the file directories rather than deleting the data in the file system until I had everything working in S3.

I then ran the following updates in sql. The first updates the storageidentifier for DataFiles and the second does Datasets. (As far as I know, the latter isn’t really needed since the storageidentifier for Datasets is not used, but leaving them as file://... doesn’t make much sense and could cause problems if it ever gets used/I’ve missed somewhere it is used.)

You should definitely backup your databases first – always a good idea but it’s easy to cut/paste with the wrong bucket (i.e. from a dev/test server to production) and then it can be tricky to recover. From my notes, I think these queries came from Dataverse with the exception that I think I added a check for NULL storageidentifiers (QDR had some from older versions/testing)

Then run the following update queries in postgres:

UPDATE dvobject SET storageidentifier='s3://qdr-dataverse-dev:' || storageidentifier WHERE id in (SELECT o.id FROM dvobject o, dataset s WHERE o.dtype = 'DataFile' AND s.id = o.owner_id AND s.harvestingclient_id IS null AND o.storageidentifier NOT LIKE 's3://%' AND storageidentifier IS NOT NULL)

UPDATE dvobject SET storageidentifier=overlay(storageidentifier placing 's3://' from 1 for 7) WHERE id in (SELECT o.id FROM dvobject o, dataset s WHERE o.dtype = 'Dataset' AND s.id = o.id AND s.harvestingclient_id IS null AND o.storageidentifier LIKE 'file://%')

Also, FWIW, since the cached thumbnails and metadata exports will also be in S3, you’ll need an aws command to delete them rather than a file system rm (e.g. to refresh when a new version updates the content of metadata files.) The following S3 command deletes just the cached files:

aws s3 rm s3://qdr-dataverse-dev/ --recursive --exclude "*" --include "*.cached"

Hope that helps,

--Jim

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CABbxx8ED8p%3D9jRUAZ833hxcVqnbk1SMc_LLHgasOj4oB_TkUsw%40mail.gmail.com.

Michel Bamouni

unread,

Jan 24, 2019, 4:09:39 AM1/24/19

to Dataverse Users Community

Hi Jim and Philip,

Thanks for your answers. I will try the steps you describe Jim.

best regards,

Michel

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/76597081-1b1a-442f-99df-0b1ecee88476%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Reply all

Reply to author

Forward