Migration of records with out moving the file data?

63 views

Skip to first unread message

Bethany Seeger

unread,

Aug 28, 2024, 4:56:23 PM8/28/24

to Dataverse Users Community

** cross posted from zulip. There was the thought I might get more visibility here with these questions **

Hello,

We have a few collections we'd like to migrate into Dataverse where the files are already in an S3 bucket and curated by another application. Ideally we wouldn't have to move the files, as they could be, in theory, accessed from where they already are, plus they already have handles pointing to them there (not that we couldn't change this pointer, I think). We'd like to just give Dataverse access to this other bucket, in addition to it's other datastores.

I know it'd be straightforward, via the native API, to move the *metadata* into Dataverse. For the files, if we didn't want to migrate them, would we essentially be following the process for moving a large data set?
0) ensure that the second S3 bucket is configured to be access by Dataverse

1) have the metadata migration create place holder files for the datasets

2) have a script that manipulates the Dataverse database to point to the right S3 bucket and location w/i it. (This would be more than just replacing a placeholder, as the files wouldn't be where the place holder was set)

Would this work?

There are a few unknowns for us --
1. Can dataverse link to multiple S3 buckets? Yes - Phil Durbin already confirmed this.
2. Is the only way to make the connection from the datasets to the file in S3 be by manipulating the database?

Note: As mentioned, we do have Handles on the files that point directly to the files in the buckets, and one thought we've had is to just use those as links to the data in the Dataverse record.

(I don't think OAI-PMH harvesting would be enough for this collection because the Datasets wouldn't technically be hosted elsewhere to point to. The goal here is to have the dataset in one place and the curation tool and the public access website (Dataverse) access it from there)

I'm still very new to Dataverse, so there might be other options I missing. Would love to hear some perspectives on this.

Best,

Bethany

qqm...@hotmail.com

unread,

Sep 9, 2024, 1:16:03 PM9/9/24

to Dataverse Users Community

Bethany,

There are different options depending on what your overall goal is w.r.t. what Dataverse manages. If you just have files that are currently web accessible and want to refer to them in Dataverse, you could configure a remote store. There's an api to add files when using such a store that lets you submit the URL where they can be accessed. That is best for a case where you're creating the dataset in Dataverse as the official one and just want to refer to public files elsewhere.

Given that your files have Handles, another possible option would be to consider moving the files and configuring Dataverse to be able to manage their Handles, which would allow Dataverse to redirect those Handles to the new file locations. Configuring Dataverse for multiple PID providers is described in the Guides (Multiple so that you can continue to use whatever Handle/DOI provider you do now for new datasets and just use this Handle provider for these files.). We have migration APIs that can retain existing PIDs for datasets - I don't recall if they also work for file PIDs - if not, this approach would mean adding the Handles to the db directly. In this case, Dataverse would keep the Handle records up-to-date if you ever change file names/paths/metadata, etc. and the files would be stored and managed the standard way Dataverse does for all files (could still be a completely separate bucket/even the original bucket if that is needed).

While it may still be technically possible to configure Dataverse to use your existing bucket and to edit the db to refer to the files via their existing paths/names (treating them the same as internal files, including allow Dataverse to delete them, store auxiliary files and thumbnails in that bucket, etc.) I wouldn't recommend this as we usually store files in paths associated with the dataset PID with names that are UUIDs. You can't add files with other path/names via API (which was allowed at one time) and there may be newer code that will break when this is not true. (Basically I think this should be considered deprecated now that we have remote stores and other options.)

I think you're right about Harvesting not being good for your use case. That said, there are groups looking at what I would call an OIA-PMH proxy that would let you define a dataset via an XML metadata file and provide that to Dataverse via OAI-PMH, which would allow treating a website with files as a harvested dataset. This approach would make sure the dataset is indexed in Dataverse (based on the machine readable metadata in your XML file) but there would be no dataset or file pages in Dataverse - Dataverse would just redirect people to the URLs you provide for those.

Hopefully that helps some. If you start to narrow down which approach looks best, there are groups who have used remote stores, multiple PID providers, etc. who could help with any given approach.