Discussion: Handling of RData derivatives with Dataverse integration

47 views

Skip to first unread message

Julie S

unread,

Mar 3, 2025, 4:54:17 PM3/3/25

to archivematica

Hello!

We have been testing the Dataverse integration with Archivematica and have encountered an issue with processing RData derivatives generated by Dataverse. We would like to get some feedback on how the community would like these derivatives to be handled so that we can implement a fix to push to the main codebase.

Dataverse context:
Dataverse is a repository software for depositing and sharing datasets. For each dataset, a JSON file (dataset.json) is created with a list of all files (user-uploaded and Dataverse generated) in the dataset and the metadata assigned to each.

When certain tabular data file types (e.g. CSV, XLSX, Stata, etc.) are ingested, Dataverse creates derivatives in tab-delimited and RData formats.

Dataverse integration:
The AM-DV integration uses the DV API to query DV, display a list of published datasets in the AM dashboard, and fetch the selected dataset for processing in Archivematica.

Archivematica references the JSON file to confirm that all expected dataset files are received and to parse metadata uploaded to or generated by DV into the AIP METS.

The issue:
When Archivematica finds a tabular data file in dataset.json, it automatically adds an entry to the Dataverse METS XML for an RData derivative but does not request to have an RData derivative generated on the fly. The assumption is that the RData derivative already exists.

Processing in AM then sometimes fails during the “Parse Dataverse METS XML” job in the “Parse external files” microservice, with the error:
parsedataverse_v0.0: INFO 2024-04-24 15:21:04,195 archivematica.mcp.client.parse_dataverse_mets.get_db_objects:57 Looking for file type: 'Item' using path: originalFormatStata/originalFormatStata.RData parsedataverse_v0.0: ERROR 2024-04-24 15:21:04,200 archivematica.mcp.client.parse_dataverse_mets.get_db_objects:119 Could not find file type: 'Item' in the database: originalFormatStata.RData with path: %transferDirectory%objects/originalFormatStata.RData. Checksum: 'None' parsedataverse_v0.0: ERROR 2024-04-24 15:21:04,201 archivematica.mcp.client.parse_dataverse_mets.parse_dataverse_mets:287 Exiting. Returning the database objects for our Dataverse files has failed. Exiting. Returning the database objects for our Dataverse files has failed.Traceback (most recent call last): File "/usr/lib/archivematica/MCPClient/job.py", line 103, in JobContext yield File "/usr/lib/archivematica/MCPClient/clientScripts/parse_dataverse_mets.py", line 321, in call job.set_status(init_parse_dataverse_mets(job)) File "/usr/lib/archivematica/MCPClient/clientScripts/parse_dataverse_mets.py", line 307, in init_parse_dataverse_mets return parse_dataverse_mets(job, transfer_dir, transfer_uuid) File "/usr/lib/archivematica/MCPClient/clientScripts/parse_dataverse_mets.py", line 288, in parse_dataverse_mets raise ParseDataverseError(no_map) clientScripts.parse_dataverse_mets.ParseDataverseError: Exiting. Returning the database objects for our Dataverse files has failed.

We’ve determined that this error appears when the dataset that AM receives is missing the RData derivative for one or more tabular files. DV only creates the RData file when a user manually requests it through the DV interface. Once generated, DV then caches the RData file for future use and can send the file to AM for processing. In these cases, processing proceeds as normal.

Folks at the University of Victoria came to a similar conclusion.

The possible fix:
With the caveat that we haven’t fully explored the technical feasibility of the below, we're weighing a few approaches:

Derivatives are required: If the JSON file lists tabular files, DV creates RData derivatives on the fly (if they don’t already exist) and sends them to Archivematica for processing with the rest of the dataset
Derivatives are optional: If RData derivatives are not available, Archivematica skips the file, excludes it from subsequent jobs, and continues processing. A log of skipped RData derivatives should also be generated
Derivatives are excluded: Archivematica only receives user-uploaded data files (i.e. tab-delimited and RData derivatives from DV are not sent to AM for processing), with the rationale that they could all be recreated.

In all cases, an exception should be made for files that are originally in RData format: an RData derivative is unnecessary and should not be expected. Generating an RData derivative may also cause an overwrite of the original, which will create a separate problem.

What we would like to know is: which approach makes the most sense to you? Is there another approach that we might take instead? In general, how are you using or planning to use the DV integration?

Many thanks in advance!

Julie

Scholars Portal

Corey Davis

unread,

Apr 15, 2025, 5:25:15 AM4/15/25

to archivematica

Our preferred approach at UVic Libraries would be to treat RData derivatives as optional and log their absence when they are not present in the dataset at the time of export from Dataverse.

Rationale:

Archivematica workflows should only process what is explicitly present in the dataset. Automatically generating RData derivatives on the fly introduces preservation risks, especially when there are original .RData files that could be unintentionally overwritten, as you mention.
Logging the absence of a derivative provides clarity and auditability for curators and end users reviewing the preservation history.
This approach accommodates varying depositor practices and doesn’t assume consistent use of the RData generation feature across datasets.

If technically feasible, we’d also recommend incorporating a brief PREMIS event in the METS file to note when a derivative was expected but not available. This would ensure that any skipped processing steps are clearly recorded in the preservation metadata, supporting future audit and validation processes.