Handling of RData derivatives with Archivematica integration

45 views
Skip to first unread message

Julie

unread,
Mar 10, 2025, 10:59:07 AMMar 10
to Dataverse Users Community
Hello hello!

We have been testing the Dataverse integration with Archivematica and have encountered an issue with processing the RData derivatives that DV generates for tabular data files. 

We would like to get some feedback on how the community would like these derivatives to be handled so that we can implement a fix to push to the main codebase.

Archivematica context: Archivematica is a preservation processing system that ingests files selected for preservation, runs various tasks (format identification, metadata extraction, normalization, etc.), and generates an archival package (AIP) and metadata record (METS file) for long-term management.

The integration was sponsored by Scholars Portal/OCUL from 2015-2018 to support research data preservation.

The issue: The integration uses the DV API to query DV, display a list of published datasets in the Archivematica dashboard, and fetch the selected dataset for processing in Archivematica. 

For each dataset, Archivematica references the dataset.json file to confirm that all expected dataset files are received and to parse metadata uploaded to or generated by DV into the METS file for the archival package.

The dataset.json file does not mention the RData derivatives, but when Archivematica finds a tabular data file in dataset.json, it si set up to automatically add an entry for it. Archivematica, however, is not requesting to have an RData derivative generated on the fly -- it assumes that the RData derivative already exists.

This presents us with a problem: DV only creates the RData file when a user manually requests it through the DV interface. Once generated, DV then caches the RData file for future use and can send the file to AM for processing. In these cases, processing proceeds as normal. If an RData derivative has not been requested before, however, Archivematica does not receive an RData file and so processing fails.

The possible fixes: We haven’t fully explored the technical feasibility of the below, but we're weighing a few possibilities:
  • Derivatives are required: If the JSON file lists tabular files, DV creates RData derivatives on the fly (if they don’t already exist) and sends them to Archivematica for processing with the rest of the dataset (this has been suggested by folks from the University of Victoria)
  • Derivatives are optional: If RData derivatives are not available, Archivematica skips the file, excludes it from subsequent jobs, and continues processing. A log of skipped RData derivatives should also be generated
  • Derivatives are excluded: Archivematica only receives user-uploaded data files (i.e. tab-delimited and RData derivatives from DV are not sent to AM for processing), with the rationale that they could all be recreated.
In all cases, an exception should be made for files that are originally in RData format: an RData derivative is unnecessary and should not be expected. Generating an RData derivative may also cause an overwrite of the original, which will create a separate problem.

What we would like to know: which approach makes the most sense to you? Is there another approach that we might take instead? In general, how are you using or planning to use the Archivematica integration?

Related conversations: We are aware of recent conversations about ongoing support for generating RData derivatives for tabular data files, so it would be great to hear if there are any updates on this front that we should consider for our integration work. 

At Scholars Portal, we're also interested to hear if others would like to see derivatives generated by Dataverse included in BagIt exports as well. At the moment, BagIt exports only include the original files uploaded by users. Having derivatives included would mean more consistency in the contents of preservation packages created via the Archivematica integration and BagIt export workflows.

Many thanks in advance!

Julie
Scholars Portal

Reply all
Reply to author
Forward
0 new messages