Request for Comments re: proposed OAI-ORE and archival Bag updates

29 views
Skip to first unread message

James Myers

unread,
Nov 26, 2025, 5:00:14 PMNov 26
to Dataverse Users Community

All,

 

With support from DANS and QDR, I'm going to be making changes to the archival bag processing to address a couple relatively minor issues with current bags and to generally improve scaling to larger bags, probably in time for Dataverse 6.10 next spring.

 

The scaling changes should generally be transparent to any Dataverse instances using archival bags, but the issues require making a change to the OAI_ORE metadata export (which is included in the bag) and the directory structure within the bag.

 

With this email, I'm writing to ask if anyone using the OAI_ORE export and/or archival bags has concerns/ has a reason why these changes need to be optional or otherwise provide backward compatibility. Please contact me if you have concerns or need additional information.

 

Thanks,

-- Jim

 

James D. Myers

qqm...@hotmail.com

217-417-1786

 

 

Change 1: Use a URI for the hash algorithm used for files in the OAI_ORE metadata export.

 

Currently the OAI_ORE export includes file checksums in the following format:

"dvcore:checksum":{"@type":"SHA-512","@value":"eb76035019729f6b24cec6c792e95b4e1b5108d0f6895d02ab2732c77980f1ee2249570c5e8964c009dbb58a677d46444cffc41a1137f3e200f8e2b04d52fe40"}}

 

However, in the JSON-LD 1.1 specification, an @type entry must be an IRI. The current plan would be to adopt URIs from https://www.w3.org/TR/xmlsec-algorithms/#digest-method-uris for the algorithms used in Dataverse, probably via the use of an @context entry for readability, e.g.

 

"@context":{"SHA-512":"http://www.w3.org/2001/04/xmlenc#sha512"}

 

"dvcore:checksum":{"SHA-512","@value":"eb76035019729f6b24cec6c792e95b4e1b5108d0f6895d02ab2732c77980f1ee2249570c5e8964c009dbb58a677d46444cffc41a1137f3e200f8e2b04d52fe40"}}

 

Nominally this change would not break JSON parsing of the OAI-ORE (e.g. code expecting the value "SHA-512") and would enable JSON-LD 1.1 validation/parsing.

 

(The Schema.org additionalType reported for the OAI-ORE file would be incremented to “Dataverse OREMap Format v1.0.2” as we’ve recently begun versioning the OAI-ORE export.)

 

Change 2: Drop use of the dataset title as a directory under the /data directory in the bag

 

Currently, the Dataverse bag structure places data files in the /{dataset PID}/data/{dataset title}/{file directory path including internal / chars}/{original file name} . Long titles, i.e. > 255 characters, can cause errors in unzipping the bag on a local file system. (As the directory path and file name fields are limited to 255 characters in Dataverse, they do not cause this problem.)

 

As including the title is nominally for readability and does not affect processing the archival bags, the proposal is to drop use of the title and to place files at /{dataset PID}/data/{file directory path including internal / chars}/{original file name}. (In this case, the inclusion of the dataset PID still assures that bags for multiple datasets could be unzipped in the same directory.)

 

It would be possible to retain some backward compatibility by using a truncated title, or by allowing use of the full title as a configurable option (which would mean that long titles could still cause unzipping issues).

 

None of these should affect automated processing: generic bag processing should find files via the manifest-{alg}.txt file and Dataverse-specific processing should find file @id and file metadata via the metadata/oai-ore.jsonld file and then use the metadata/pid-mapping.txt file to find their locations within the bag based on their @id. In either case, the directory structure/conventions should not be hardcoded.

 

 

 

 

Julie

unread,
Dec 18, 2025, 2:34:41 PM (6 days ago) Dec 18
to Dataverse Users Community
Hi Jim,

Borealis makes use of the archival bag export feature alongside the Archivematica integration to support institutional preservation workflows. We have no concerns with the proposed changes, thank you for sharing your plans!

That said, there are other changes we're interested in seeing for the bag export workflow, e.g. 
- the option to include full DDI XML, variable metadata and other metadata files in the bag export
- the option to include tabular derivatives in the export (currently only the originals are present)
- the option to include auxiliary files (e.g. file provenance) in the export
- the ability for all users to see the "archived" status of a dataset version

We're still discussing these other changes and will come back with a better picture when that's available! We're also interested in any conversations about OCFL and handling versioning :)

Best,
Julie

On behalf of the Borealis team

James Myers

unread,
Dec 19, 2025, 10:26:20 AM (5 days ago) Dec 19
to dataverse...@googlegroups.com

Thanks Julie,

I think all these other changes make sense – just haven’t been done. FWIW: I think there is probably a split between the content required to recreate the dataset in Dataverse (which includes any human generated content like DDI variable metadata changes and submitted aux files) and the extra info Dataverse creates (tab files, other metadata exports, etc.) that are useful for preservation. The former should always be added to the bags whereas the latter probably needs a switch as some places won’t want to store info that can be rederived.

-- Jim

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/6f7f6957-3504-49aa-bb15-ca9d7c2079den%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages