Hello!
We have been testing the Dataverse integration with
Archivematica and have encountered an issue with processing RData
derivatives generated by Dataverse. We would like to get some feedback
on how the community would like these derivatives to be handled so that
we can implement a fix to push to the main codebase.
Dataverse context:Dataverse
is a repository software for depositing and sharing datasets. For each
dataset, a JSON file (dataset.json) is created with a list of all files
(user-uploaded and Dataverse generated) in the dataset and the metadata
assigned to each.
When certain tabular data file types (e.g. CSV,
XLSX, Stata, etc.) are ingested, Dataverse creates derivatives in
tab-delimited and RData formats.
Dataverse integration:The
AM-DV integration uses the DV API to query DV, display a list of
published datasets in the AM dashboard, and fetch the selected dataset
for processing in Archivematica.
Archivematica references the
JSON file to confirm that all expected dataset files are received and to
parse metadata uploaded to or generated by DV into the AIP METS.
The issue:When Archivematica finds a tabular data file in dataset.json, it
automatically adds an entry to the Dataverse METS XML for an RData derivative
but does not request to have an RData derivative generated on the fly.
The assumption is that the RData derivative already exists.
Processing
in AM then sometimes fails during the “Parse Dataverse METS XML” job in
the “Parse external files” microservice, with the error:
parsedataverse_v0.0:
INFO 2024-04-24 15:21:04,195
archivematica.mcp.client.parse_dataverse_mets.get_db_objects:57 Looking
for file type: 'Item' using path:
originalFormatStata/originalFormatStata.RData
parsedataverse_v0.0: ERROR 2024-04-24 15:21:04,200
archivematica.mcp.client.parse_dataverse_mets.get_db_objects:119 Could
not find file type: 'Item' in the database: originalFormatStata.RData
with path: %transferDirectory%objects/originalFormatStata.RData.
Checksum: 'None'
parsedataverse_v0.0: ERROR 2024-04-24 15:21:04,201
archivematica.mcp.client.parse_dataverse_mets.parse_dataverse_mets:287
Exiting. Returning the database objects for our Dataverse files has
failed.
Exiting. Returning the database objects for our Dataverse files has
failed.Traceback (most recent call last):
File "/usr/lib/archivematica/MCPClient/job.py", line 103, in
JobContext
yield
File
"/usr/lib/archivematica/MCPClient/clientScripts/parse_dataverse_mets.py",
line 321, in call
job.set_status(init_parse_dataverse_mets(job))
File
"/usr/lib/archivematica/MCPClient/clientScripts/parse_dataverse_mets.py",
line 307, in init_parse_dataverse_mets
return parse_dataverse_mets(job, transfer_dir, transfer_uuid)
File
"/usr/lib/archivematica/MCPClient/clientScripts/parse_dataverse_mets.py",
line 288, in parse_dataverse_mets
raise ParseDataverseError(no_map)
clientScripts.parse_dataverse_mets.ParseDataverseError: Exiting.
Returning the database objects for our Dataverse files has failed.We’ve
determined that this error appears when the dataset that AM receives is
missing the RData derivative for one or more tabular files.
DV only creates the RData file when a user manually requests it through the DV interface.
Once generated, DV then caches the RData file for future use and can
send the file to AM for processing. In these cases, processing proceeds
as normal.
Folks at the University of Victoria came to a
similar conclusion.
The possible fix:With the caveat that we haven’t fully explored the technical feasibility of the below, we're weighing a few approaches:
- Derivatives are required:
If the JSON file lists tabular files, DV creates RData derivatives on
the fly (if they don’t already exist) and sends them to Archivematica
for processing with the rest of the dataset
- Derivatives are optional:
If RData derivatives are not available, Archivematica skips the file,
excludes it from subsequent jobs, and continues processing. A log of
skipped RData derivatives should also be generated
- Derivatives are excluded:
Archivematica only receives user-uploaded data files (i.e.
tab-delimited and RData derivatives from DV are not sent to AM for
processing), with the rationale that they could all be recreated.
In
all cases, an exception should be made for files that are originally in
RData format: an RData derivative is unnecessary and should not be
expected. Generating an RData derivative may also cause an overwrite of
the original, which will create a separate problem.
What we
would like to know is: which approach makes the most sense
to you? Is there another approach that we might take instead? In
general, how are you using or planning to use the DV integration?
Many thanks in advance!
Julie
Scholars Portal