Hello hello!
We have been testing the Dataverse integration with
Archivematica and have encountered an issue with processing the RData
derivatives that DV generates for tabular data files.
We would like to get some feedback
on how the community would like these derivatives to be handled so that
we can implement a fix to push to the main codebase.
Archivematica context: Archivematica is a preservation processing system that ingests files selected for preservation, runs various tasks (format identification, metadata extraction, normalization, etc.), and generates an archival package (AIP) and metadata record (METS file) for long-term management.
The integration was sponsored by Scholars Portal/OCUL from 2015-2018 to support research data preservation.
The issue: The integration uses the DV API to query DV, display a list of
published datasets in the Archivematica dashboard, and fetch the selected dataset
for processing in Archivematica.
For each dataset, Archivematica references the dataset.json file to confirm that all expected dataset files are received and to
parse metadata uploaded to or generated by DV into the METS file for the archival package.
The dataset.json file does not mention the RData derivatives, but when Archivematica finds a tabular data file in dataset.json, it si set up to
automatically add an entry for it. Archivematica, however, is not requesting to have an RData derivative generated on the fly -- it assumes that the RData derivative already exists.
The possible fixes: We haven’t fully explored the technical feasibility of the below, but we're weighing a few possibilities:
- Derivatives are required:
If the JSON file lists tabular files, DV creates RData derivatives on
the fly (if they don’t already exist) and sends them to Archivematica
for processing with the rest of the dataset (this has been suggested by folks from the University of Victoria)
- Derivatives are optional:
If RData derivatives are not available, Archivematica skips the file,
excludes it from subsequent jobs, and continues processing. A log of
skipped RData derivatives should also be generated
- Derivatives are excluded:
Archivematica only receives user-uploaded data files (i.e.
tab-delimited and RData derivatives from DV are not sent to AM for
processing), with the rationale that they could all be recreated.
In
all cases, an exception should be made for files that are originally in
RData format: an RData derivative is unnecessary and should not be
expected. Generating an RData derivative may also cause an overwrite of
the original, which will create a separate problem.
What we
would like to know: which approach makes the most sense
to you? Is there another approach that we might take instead? In
general, how are you using or planning to use the Archivematica integration?
Related conversations: We are aware of
recent conversations about ongoing support for generating RData derivatives for tabular data files, so it would be great to hear if there are any updates on this front that we should consider for our integration work.
At Scholars Portal, we're also interested to hear if others would like to see derivatives generated by Dataverse included in
BagIt exports as well. At the moment, BagIt exports only include the original files uploaded by users. Having derivatives included would mean more consistency in the contents of preservation packages created via the Archivematica integration and BagIt export workflows.
Many thanks in advance!
Julie
Scholars Portal