Dataverse Migration Scripts (4.20 to 5.1)

85 views
Skip to first unread message

Kaitlin Newson

unread,
Dec 1, 2021, 12:40:04 PM12/1/21
to Dataverse Users Community
Hi everyone,

The Scholars Portal Dataverse and University of Alberta Dataverse teams recently worked through a project to migrate around 700 datasets between the two Dataverse installations, and I wanted to share the scripts developed for this project with the community for those that may find them useful in their own migration projects. These were developed by Victoria Lubitch at Scholars Portal, and were used to migrate datasets from version 4.20 to 5.1.

Get in touch if you have any questions about them or the overall migration process! We plan to share more about the project with the community in the future as well, so stay tuned!

Philip Durbin

unread,
Dec 2, 2021, 10:00:02 AM12/2/21
to dataverse...@googlegroups.com
Hi Kaitlin,

Great stuff! I do have a few questions.

I'm thinking we should link to these scripts from the guides somewhere. Maybe a subheading called "migrating datasets from one Dataverse installation to another"? What do you think? Perhaps a future version of this page: https://guides.dataverse.org/en/5.8/admin/dataverses-datasets.html

How are you dealing with DOIs? I assume a dataset had a DOI associated with the University of Alberta and then got a new DOI under Scholars Portal? Did you put the old DOI in the alternativepersistentidentifier table? (This table was added in https://github.com/IQSS/dataverse/pull/5064 .)

For a long time I've thought about how some installations of Dataverse could be incubators for future installations. That is Harvard Dataverse could host Whatever University for a while until that university launches their own installation of Dataverse. At that point, maybe the datasets get migrated, using your scripts. Again, great stuff.

Thanks,

Phil


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/118ae3cd-7fbb-42d5-9855-10c9c903e048n%40googlegroups.com.


--

Kaitlin Newson

unread,
Dec 3, 2021, 10:18:44 AM12/3/21
to Dataverse Users Community
Hi Phil,

Linking in the guides sounds great! I don't expect we'll be updating the scripts for future versions of DV unless we have a use case, but of course anyone could take these and update them.

For DOIs, datasets that were published in University of Alberta kept their DOIs - this is possible since we migrated the DOIs from the UofA Datacite account over to the Scholars Portal account. If your Datacite account doesn't own these DOIs, then you'll get an error when trying to publish. The Dataverse import APIs support importing the PID. We didn't migrate file-level DOIs because these can't currently be imported in DV. For unpublished datasets in UofA, they got a new DOI since the DOI for them had never been minted.

So if Harvard wanted to go this route and wanted to migrate DOIs to the new repository, then at the time of migration you'd have to get a list of DOIs and transfer them to the Datacite account of the new institution if you wanted to maintain them.

Philip Durbin

unread,
Dec 3, 2021, 10:53:41 AM12/3/21
to dataverse...@googlegroups.com
Thanks, Kaitlin! Interesting about the DataCite arrangements. I did go ahead and create an issue to add a link to your scripts to the guides: https://github.com/IQSS/dataverse/issues/8275

John Huck

unread,
Dec 3, 2021, 1:50:05 PM12/3/21
to dataverse...@googlegroups.com
Hi all,

And thanks, Kaitlin, for sharing the migration news and details about the scripts.

Jim had previously set up a repo on the DGCC gitHub at my request to share these kinds of utilities. We never got around to adding the things we were going to put there, but maybe this would be a good place for Victoria's scripts. I believe this was the repo (since there's nothing in it):


Regards,

John
--
John Huck, MLIS
Metadata Librarian / University of Alberta Library / he/him/his
ᐊᒥᐢᑲᐧᒋᐋᐧᐢᑲᐦᐃᑲᐣ / Amiskwaciwâskahikan / Edmonton
The University of Alberta respectfully acknowledges that we are situated on Treaty 6 territory, traditional lands of First Nations and Métis people.


Sherry Lake

unread,
Dec 6, 2021, 10:39:22 AM12/6/21
to Dataverse Users Community
Another question about workflow. & DOIs ...

So on migration are the datasets deleted (deaccessioned) or destroyed in the original repository? 

If deaccessioned, then the original DOI (location) would show the deaccessioned landing page. OR does the script destroy (totally remove) the datasets from their original location?

--
Sherry Lake

John Huck

unread,
Dec 6, 2021, 11:33:54 AM12/6/21
to dataverse...@googlegroups.com
Hi Sherry,

The source Dataverse repository has been decommissioned, so they are no longer available at the old location. This is our "home" on the Scholars Portal platform.

Best Regards,

John

--
John Huck, MLIS
Metadata Librarian / University of Alberta Library / he/him/his
ᐊᒥᐢᑲᐧᒋᐋᐧᐢᑲᐦᐃᑲᐣ / Amiskwaciwâskahikan / Edmonton
The University of Alberta respectfully acknowledges that we are situated on Treaty 6 territory, traditional lands of First Nations and Métis people.

Philipp at UiT

unread,
Dec 14, 2021, 10:50:16 AM12/14/21
to Dataverse Users Community
Thanks for sharing this, Kaitlin!

You mentioned that file-level DOIs currently cannot be imported in Dataverse. Does this mean that datasets that were published in the Alberta Dataverse, don't have file-level DOIs anymore in the Scholars Portal Dataverse?

What is the reason why file-level DOIs cannot be imported/migrated? Is it because there is no alternativepersistentidentifier table at file level?

We are planning to move the TROLLing repo (https://trolling.uit.no/), which currently is a sub-collection of DataverseNO (https://dataverse.no/), to its own installation. I guess we'd like to migrate the file-level DOIs as alternativepersistentidentifiers. Also, when we started with TROLLing, we still used Handles, which are now placed into the alternativepersistentidentifier field. When TROLLing is migrated to its own installation, I guess all datasets will need to get a new DOI. So, I'm wondering where we put the current DOI, as the alternativepersistentidentifier field already is occupied with the old Handles. Do we need a second alternativepersistentidentifier field?

Best,
Philipp

John Huck

unread,
Dec 14, 2021, 12:20:50 PM12/14/21
to dataverse...@googlegroups.com
Hi Philipp,

I can answer your question. The reason file-level DOIs were not migrated is because Scholars Portal has chosen not to have this feature activated. U of A decided earlier this year that it was not a priority for us to continue assigning them either and so we had also deactivated the feature in our installation before the migration. So we didn't investigate whether they could be migrated with the scripts. It's likely possible with some additional work.

This probably doesn't pertain to your question, but might be of interest to other folks: the solution we came up with for handling the approximately 2,000 existing file-level DOIs was to update the associated URL registered with DataCite via the DataCite API to point to the URL form of the DOI for its parent dataset. The DOIs of the datasets will remain active and updated, and so it should not be necessary to update the file-level DOIs again. If necessary, they could be updated via the DataCite API.

We also changed the state of these DOIs to "registered" (i.e., active, but not findable or indexed), and ultimately we will transfer ownership of these DOIs to Scholars Portal as well (we haven't finished the task yet).

This is not a perfect solution, but it was a practical option, since it means we did not need to create and maintain a tombstone page (or pages) for these DOIs, and it gets a user as close to the file as possible (files which still exist and haven't been deaccessioned). Now that all the file-level DOIs are not searchable, it is only DOIs that have been cited somewhere that will need to be dereferenced, and we think the number of files that might have been cited in this way is probably quite small, so, on balance, we think this was a reasonable solution.

Regards,

John

--
John Huck, MLIS
Metadata Librarian / University of Alberta Library / he/him/his
ᐊᒥᐢᑲᐧᒋᐋᐧᐢᑲᐦᐃᑲᐣ / Amiskwaciwâskahikan / Edmonton
The University of Alberta respectfully acknowledges that we are situated on Treaty 6 territory, traditional lands of First Nations and Métis people.

Philipp at UiT

unread,
Dec 14, 2021, 12:42:11 PM12/14/21
to Dataverse Users Community
Hi John,

Thanks for your answer. Your approach sounds reasonable.

File-level DOIs are a kind of double-edged sword for us. One the one hand, granular PIDs are recommended, but on the other hand, file DOIs can cause trouble in cases where users want to publish many files in one dataset and don't want them to be archived packed into as container files (e.g. because they want them to be accessible for previewers). We usually experience (DataCite? time-out?) problems when trying to publish datasets with more than 250-300 files or so. So, I guess we'll need to reconsider file DOIs before we migrate TROLLing to its own Dataverse installation.

Best,
Philipp
Reply all
Reply to author
Forward
0 new messages