Failure to serialise large METS files

40 views
Skip to first unread message

frank....@gmail.com

unread,
Jan 4, 2021, 7:01:02 AM1/4/21
to archivematica
Hi,

Running Archivematica 1.11.0 on a single server with 4 Xeon CPUs and 24 GB RAM.

We are trying to archive a large dataset; 229 GB, with 998 directories and 211305 files. It fails in "Ingest", at the microservice "Generate AIP METS" with the error:

"SerialisationError: unknown error -1953173928". 

After a little bit of googling, I found this discussion on the Archivematica Github, and it seems to describe our problem:


I checked the failed-directory, and found the METS file in the package, and it seems to be generated correctly. The size is 2.2 GB.

Is there a work around for this problem, or do we have to start splitting the material into smaller chunks? We are about to receive several more sets that are much larger, so this is a problem we need to solve in one way or another.

Sincerely,
Frank Skagemo

Ross Spencer

unread,
Jan 4, 2021, 12:00:35 PM1/4/21
to archivematica
Hi Frank,

Depending on what the data in the dataset looks like, and its different preservation requirements, e.g. are we talking a large group of plain-text-like files, or images, video, and other complex types? Then you might find some success turning off tools like FITS in the Format Policy Register. We know that for each output that is switched-off there will be a reduction in the number of lines in the resulting METS. 

Given you have a 2.2GB METS to look at you can inspect it to understand where the preservation output of other tools is limited for the data that you are working with. You probably don't want to do a huge amount of trial and error, so you might want to cut a lot out early and then see if that creates an AIP at least.

In terms of outputting that much data then it is something we have been looking at. We don't think there is a quick fix in code that will simply enable that much to be output easily. Not without some refactoring of the XML representation in memory in the module doing the heavy lifting. We have investigated different changes to the METS to reduce redundancy, e.g. removing unused optional PREMIS containers. That work is yet to result in a current project and still requires sponsorship. 

It will be interesting to hear from others what they're doing in a similar situation and what they'd like to see to improve this. In the meantime, if reducing some of that tool output works for you, or playing with some of the other settings in the processing configuration, e.g. not documenting empty directories. Then it would be useful to hear how that goes on this forum, or on the ticket you linked to.

Best,
Ross

frank....@gmail.com

unread,
Jan 19, 2021, 4:12:34 AM1/19/21
to archivematica
Hi, and thanks for the reply Ross,

FITS is already turned off in our setup. We haven't had the time to dig furthter into this problem yet, but I will try to split up the input later. Not an ideal solution, but if it's necessary, then so be it :)

Regards,
Frank

Reply all
Reply to author
Forward
0 new messages