Should we change Archivematica's default format normalization rules?

412 views
Skip to first unread message

Evelyn McLellan

unread,
Oct 15, 2019, 2:56:44 PM10/15/19
to archivematica

As Archivematica users know, a standard installation of Archivematica includes a Format Policy Registry (FPR) that contains rules, commands and tools for a wide variety of preservation actions that are performed automatically during ingest. One type of rule is normalization: there are hundreds of rules for normalizing (converting file formats to a select set of preservation formats) during ingest. If the user chooses to normalize during ingest, these rules are invoked automatically on any ingested file for which there is a normalization rule. 


There are valid reasons to normalize extensively upon ingest. First, it means narrowing your holdings down to a smaller number of formats for long-term preservation, formats that are today considered to be sustainable and “preservation-friendly”. This means keeping an eye on, say, a dozen formats rather than several dozen or even hundreds of formats, depending on the diversity of your content producers. Second, it allows you to spot and address issues with formats during ingest, rather than discovering them years down the road when they may be harder to address. For example, that image file may not normalize properly because it has a colourspace issue; better to fix that issue now, with current tools and knowledge, than discover and attempt to fix it sometime in the future. Third, it means a certain amount of work up front, permitting a higher level of confidence that a lot of the heavy lifting on digital preservation has been done by the time the content is placed into long-term storage - that AIP is DONE and it won’t have to be touched for a long time. 


The downside of extensive use of normalization is the size of your AIPs, particularly when it comes to video files. Nearly all ingested born-digital video files are compressed, and when Archivematica runs the default normalization rule - convert to ffv1/lpcm in an mkv wrapper - a small video file can produce a very large master derivative. If you’re interested to find out more about why this happens, see Ashley Blewer’s blog post at https://bits.ashleyblewer.com/blog/2019/09/19/ffv1-bigger-than-before/. The same can be true for raster images - a JPEG file can be highly compressed, and an uncompressed TIFF preservation copy can be much larger than the JPEG file. On a small scale this might not make much of a difference, but JPEGs are ubiquitous, and a few thousand JPEGS across a few SIPs can have a noticeable impact on processing time and storage.


Ubiquity is the key here, and this brings us to the main point of this post. Should we change the default settings in Archivematica to skip normalization for highly ubiquitous files like JPEGs and h264-encoded mp4 files? Keep in mind that the settings could always be changed: the normalization rules would still be there but they would just be disabled for certain formats. However, we are aware that not all users edit FPR rules, and that the defaults Archivematica ships with are often considered de facto recommendations by Artefactual Systems.


We would love to hear from digital curators and preservationists out there. What is your opinion on normalizing everything that can be normalized? Do you edit the default FPR rules, and if so, why? Would such a change in Archivematica’s default rules have a negative impact on you, or, in your opinion, on the wider community of users? Do you have opinions about specific formats? An open discussion on this discussion list would be great, but if you’re feeling shy, please email me at evelyn[at]artefactual[dot]com.


Regards,


Evelyn McLellan

Systems Archivist & Metadata Specialist

Artefactual Systems


Message has been deleted
Message has been deleted
Message has been deleted

Timothy Walsh

unread,
Oct 17, 2019, 9:57:30 AM10/17/19
to archivematica
Hi Evelyn,

Thanks for starting this conversation! For some reason Google Groups keeps marking my post as spam and deleting it, so I've published some of my thoughts on my blog here:


Looking forward to hearing from others!

Tim

Sarah Romkey

unread,
Oct 17, 2019, 10:00:15 AM10/17/19
to archiv...@googlegroups.com
Thanks for doing that Tim! I hit publish on your post twice and it still kept being deleted ¯\_(ツ)_/¯

Sarah Romkey, MAS,MLIS
Archivematica Program Manager
Artefactual Systems
604-527-2056
@archivematica / @accesstomemory




--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/archivematica/35b4d602-674b-4ceb-ad34-3cd95094b09a%40googlegroups.com.

Timothy Walsh

unread,
Oct 17, 2019, 11:17:02 AM10/17/19
to archivematica
No problem, Sarah, and thanks for your help!

I should add re: format normalization, there is also one category of documents that we are not currently normalizing but very much wish we were: legacy (pre-Office Open XML) Microsoft office documents. Given the ubiquity of these formats and the problems we already face with them, legacy Word, Wordpad, Excel, and Powerpoint files are pretty clearly at the forefront of the "Watch" category of preservation planning for us. I have some experience in previous positions using libreoffice to normalize these formats to their Office Open and/or OpenOffice/LibreOffice equivalents to mixed results. For now we are keeping the formats as-is, but have identified this as a priority target for research when possible.

Tim

On Thursday, October 17, 2019 at 10:00:15 AM UTC-4, Sarah Romkey wrote:
Thanks for doing that Tim! I hit publish on your post twice and it still kept being deleted ¯\_(ツ)_/¯

Sarah Romkey, MAS,MLIS
Archivematica Program Manager
Artefactual Systems
604-527-2056
@archivematica / @accesstomemory




On Thu, Oct 17, 2019 at 9:57 AM Timothy Walsh <timothyr...@gmail.com> wrote:
Hi Evelyn,

Thanks for starting this conversation! For some reason Google Groups keeps marking my post as spam and deleting it, so I've published some of my thoughts on my blog here:


Looking forward to hearing from others!

Tim

--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archiv...@googlegroups.com.

Evelyn McLellan

unread,
Oct 17, 2019, 3:30:50 PM10/17/19
to archivematica
Hi Tim,

Thanks very much for your replies and blog post. Your comment about pre-OOXML docs is interesting. Back in the early days of Archivematica we tried to incorporate normalization to ODF (Open Document Format), but the results were so unreliable that we gave up. Same with converting them to PDF/A (not ideal when you're talking about presentations and spreadsheets anyway, though). I think it's still the case that the best way to convert a proprietary office document format to another format is to use a plugin with the original software. That's something that could be done in certain cases prior to ingest into Archivematica, but of course that's only a partial solution.

I liked this suggestion in your blog post: "I wonder if it might be useful to have a few easily selectable FPR “profiles” available to users. I can imagine a future where users are asked on setup to select how conservative they want to be with formats generally (say “Normalize everything possible” vs “Do not normalize ubiquitous formats”), and based on that selection the individual FPR rules for common formats like jpeg, png, H264 mp4 are enabled/disabled accordingly in one action." We've talked about that kind of feature here at Artefactual - it would be great to get funding for something like that.

I don't know if you've seen Jenny Mitcham's blog post on the FPR. There are a number of interesting comments there on this topic, including the following: 

"We had a discussion about the benefits (or not) of normalising a compressed file (such as a JPEG) to an uncompressed format (such as TIFF). I had already mentioned in my presentation earlier that this default migration rule was turning 5GB of JPEG images into 80GB of TIFFs - and this is without improving the quality or the amount of information contained within the image. The same situation would apply to compressed audio and video which would increase even more in size when converted to an uncompressed format.

"If storage space is at a premium (or if you are running this as a service and charging for storage space used) this could be seen as a big problem. We discussed the reasons for and against leaving this rule in the FPR. It is true that we may have more confidence in the longevity of TIFFs and see them as more robust in the face of corruption, but if we are doing digital preservation properly (checking checksums, keeping multiple copies etc) shouldn't corruption be easily spotted and fixed?"

Anyway, thanks again Tim. Looking forward to hearing from others as well.

-Evelyn

John Richan

unread,
Oct 18, 2019, 10:54:35 AM10/18/19
to archivematica

Hi Evelyn,


Thanks for posing this question to the group. As our digital preservation program is relatively new and have been Archivematica users for just over a year now these are questions we are starting to ask ourselves often - if not daily. For a little context, like Tim I am also at Concordia University but in the Records Management and Archives department which sits outside the library system.


My comment here is more general and does not get into specific formats but could still be useful for overall consideration in terms of how we are working. Generally speaking our normalization workflows have started to diverge between digital content arriving via private donations versus content arriving through institutional transfers.


We have decided to follow the default Archivematica FPR normalization rules more rigorously for digital content arriving from private sources as opposed to those arriving from institutional units. We are working with the reality that the dynamics at play when an individual or group has decided to donate digital records to our department is quite different. There is an emotional - conscious decision often involved with private donations whereas the institutional transfers are an obligation of staff at the University. Whether or not it is true in all cases the bar feels higher to implement digital preservation best practice when content is arriving from a private source; Expectations, detailed negotiations, trust all come into the larger picture. All of this to say that we are leaning more heavily on the default Archivematica preservation rules when it comes to private donations.


On the Institutional side of things we have been working with the default preservation normalization rules on a case-by-case basis. In a recent example, we received an accession of convocation videos in .mov and .mp4 video containers. Needless to say these videos are very large in size. We have made the decision not to do any normalization up front and only create access copies in an “on demand” scenario. As a secondary example, we also recently received a transfer of approximately 1,000 files of administrative records from a faculty at the school. There was very little media included in this accession so in this case we did attempt preservation normalization on this accession.

I think (and I hope J ) as we receive more and more content clearer patterns will emerge and we will be able to react more consistently. We have also been including a document in the submission documentation sub-directory of the AIP which we hope rationalizes our decision making a little.


The other opportunity that presents itself on the institutional records side is our ability to work with and have some influence on records creators within units to create and send us files in a more standardized way that would fall in line with digital preservation best practice. With private donors this would likely never be an option.

Again we are still relatively new to working with digital content and with Archivematica but these are some early observations from our side of things.


John Richan

Digital Archivist, Digital Archivist – Concordia University           

Evelyn McLellan

unread,
Dec 4, 2019, 1:36:23 PM12/4/19
to archivematica
Thanks everyone who responded to this thread. I've provided an update to the issue in github excerpting discussions on this and other lists. I've also added the meeting notes from a recent Archivematica user's group meeting in which the topic was discussed in considerable detail. If you're interested, go to https://github.com/archivematica/Issues/issues/912 and scroll down to my comment from November 28.

It's hard to summarize all the feedback, except to say that community consensus on when to normalize is quite elusive. In general there was a tilt away from wholesale normalization, and I don't think there would be much pushback from the community if we were to make Archivematica's normalization rules a bit more opt-in than opt-out. We are in the process of deciding whether and how to change the rules in upcoming releases - updates will be posted on the github ticket!

Cheers,

Evelyn McLellan
Systems Archivist & Metadata Specialist
Artefactual Systems

Reply all
Reply to author
Forward
0 new messages