OAI-ORE and BagIt for archiving - feedback requested

Jim Myers

unread,

May 22, 2018, 10:52:58 AM5/22/18

to Dataverse Users Community

Hi All,

As I mentioned in the last community call, in support of QDR's interest in being able to archive published datasets in DPN (http://dpn.org), I've done some proof-of-concept work to generate an OAI-ORE map file and BagIt bag (which uses and includes the ORE map file) for published datasets that I hope can form the basis for a DPN submission. We're very interested in any feedback on the conceptual design as well as on the specifics of code and details of the generated files.

I've posted some documentation which describes the use case and design rationale and has a run-down of some of the choices we've made to get to a proof-of-concept and some open issues. That documentation links to two example files - a json-ld ORE map and a BagIt bag for a test dataset. (FWIW: We're developing in the QDR fork of Dataverse at https://github.com/QualitativeDataRepository/dataverse/tree/feature/QDR-953. For now, this branch is 4.8.6 compatible and makes it possible to generate the ore map and bagit files from the metadata export menu. ).

To start, I'd be happy to get feedback and answer general questions through this community forum. If anyone is interested in more in-depth/detailed discussions, we can jump out to email or github issues and I'll periodically update things here.

Thanks,

-- Jim

Philip Durbin

unread,

May 22, 2018, 11:53:04 AM5/22/18

to dataverse...@googlegroups.com

Wow! This is a lot of great detail, Jim! Thanks! I haven't had time to dig into it but I wanted to let you know that I just added this project to the "dev efforts by the Dataverse community" spreadsheet: https://groups.google.com/d/msg/dataverse-community/X2diSWYll0w/ikp1TGcfBgAJ

Thanks!

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/c415504b-66c5-489f-aba5-22ea7f03918a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Jim Myers

unread,

Jun 12, 2018, 2:10:50 PM6/12/18

to Dataverse Users Community

A quick update before the Dataverse 2018 Meeting:

I've been continuing to work on this and have posted new example files on github (see the end of https://github.com/QualitativeDataRepository/dataverse/wiki/Data-and-Metadata-Packaging-for-Archiving ). The changes from the initial post include updating to the BagIt 1.0 spec, use of namespaces in the ResourceMap json-ld file, inclusion of all of the license/terms of access/terms of use entries from Dataverse, consolidation of terms and vocabulary mapping into a couple of classes, and use of sha256 hashes (required by DPN).

For those attending the meeting, I'm presenting on this topic Friday and look forward to the chance to discuss before or after that. If you won't be there, I'll post slides and capture the results of any discussion there.

Thanks,

Jim

On Tuesday, May 22, 2018 at 11:53:04 AM UTC-4, Philip Durbin wrote:

Wow! This is a lot of great detail, Jim! Thanks! I haven't had time to dig into it but I wanted to let you know that I just added this project to the "dev efforts by the Dataverse community" spreadsheet: https://groups.google.com/d/msg/dataverse-community/X2diSWYll0w/ikp1TGcfBgAJ

Thanks!

Phil

On Tue, May 22, 2018 at 10:52 AM, Jim Myers <qqm...@hotmail.com> wrote:

Hi All,
As I mentioned in the last community call, in support of QDR's interest in being able to archive published datasets in DPN (http://dpn.org), I've done some proof-of-concept work to generate an OAI-ORE map file and BagIt bag (which uses and includes the ORE map file) for published datasets that I hope can form the basis for a DPN submission. We're very interested in any feedback on the conceptual design as well as on the specifics of code and details of the generated files.

I've posted some documentation which describes the use case and design rationale and has a run-down of some of the choices we've made to get to a proof-of-concept and some open issues. That documentation links to two example files - a json-ld ORE map and a BagIt bag for a test dataset. (FWIW: We're developing in the QDR fork of Dataverse at https://github.com/QualitativeDataRepository/dataverse/tree/feature/QDR-953. For now, this branch is 4.8.6 compatible and makes it possible to generate the ore map and bagit files from the metadata export menu. ).

To start, I'd be happy to get feedback and answer general questions through this community forum. If anyone is interested in more in-depth/detailed discussions, we can jump out to email or github issues and I'll periodically update things here.

Thanks,

-- Jim

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/c415504b-66c5-489f-aba5-22ea7f03918a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pete Meyer

unread,

Jun 12, 2018, 4:46:11 PM6/12/18

to Dataverse Users Community

On Tuesday, June 12, 2018 at 2:10:50 PM UTC-4, Jim Myers wrote:

A quick update before the Dataverse 2018 Meeting:

I've been continuing to work on this and have posted new example files on github (see the end of https://github.com/QualitativeDataRepository/dataverse/wiki/Data-and-Metadata-Packaging-for-Archiving ). The changes from the initial post include updating to the BagIt 1.0 spec, use of namespaces in the ResourceMap json-ld file, inclusion of all of the license/terms of access/terms of use entries from Dataverse, consolidation of terms and vocabulary mapping into a couple of classes, and use of sha256 hashes (required by DPN).

This reminds me of the discussions around supporting alternative checksum algorithms (aka - md5 or sha1), vs supporting multiple checksum algorithms (aka - md5, sha1, sha256, ... ). For https://github.com/IQSS/dataverse/issues/3354 we went with the first approach for lower-complexity. Any thoughts on if it makes sense to revisit that?

Jim Myers

unread,

Jun 19, 2018, 12:35:54 PM6/19/18

to Dataverse Users Community

Pete - yeah, I think only one algorithm is needed, so I think I will plan to just add sha-256 and sha-512 as options along with the current md5 and sha1. I may add a test and upgrade method that will verify an existing hash in one algorithm and replace it with a hash in the current algorithm. In any case - I'll post this as a separate issue linked to the one you mention and the overall ORE/BagIt issue to get feedback...

Thanks,

Jim

Amber Leahey

unread,

Jul 30, 2018, 3:31:57 PM7/30/18

to Dataverse Users Community

Hi Jim, thanks again for sharing this and opening the discussion with the community. At SP / OCUL we are planning something similar but a bit different using the Archivematica tool to generate bagIT AIPs of our Dataverse datasets. Archivematica inherently produces a METS metadata file for tracking data and metadata transfers and different kinds of preservation events (it uses the PREMIS vocabulary to track preservation events and encapsulates this with some descriptive metadata in a METS file). It has been described as a bit of a complicated and lengthy standard (METS) but it has a lot of components to it that we think will help us to better recreate and reuse digital objects being preserved in the future. It will be interesting to compare to your approach no doubt.

For now, we are working with the Artefactual team to finalize the data and metadata transfer procedures in Archivematica, which uses the Dataverse Data Access API to get files. This grabs the DV and user generated metadata in JSON format, along with original and derivative data files, processes this and creates METS and PREMIS metadata to accompany the data, and then as a final step packages the files into the AIP structure and transfers it to our locally hosted distributed cloud system for long-term preservation. You can review the basic steps on the Archivematica wiki: https://wiki.archivematica.org/Dataverse

One of the more interesting things about the project is the mapping between DV generated DDI metadata and METS for storing the original descriptive metadata. We are envisioning reusing this METS metadata in some kind of archival storage / digital asset management layer for identifying preserved objects in our shared digital research archive (similar to DPN). Open metadata about the preserved objects will be key to ensuring integration between repositories and preservation systems in all likelihood. Either way our projects sound very similar and I'm looking forward to digging into your OAI-ORE generated metadata and AIPs!

We are anticipating a production release of Archivematica sometime in the fall, so things are moving quickly in terms of testing on our end. We will keep you posted on what we find, for now I've attached a METS file that does capture some additional things but I haven't fully dug into it all yet! :)

Best,

Amber

METS.c5b5eba7-43e7-4cdc-9ca9-2ae81569071c.xml

Reply all

Reply to author

Forward