Hi everyone,
I'm working with a large amount of files that had no checksum generated when they were created of first transferred to a server. They encompass a wide range of formats: tif, jp2, pdfs, etc. I would like to identify corrupt (can't be opened anymore) files on the server.
This is the kind of strategy I am thinking about now:
1. Run JHOVE on the files to identify seriously damaged files
2. Look for or create scripts/tools like Jpylyzer to validate specific file types
Does this seem like a good course of action? Are there any other software tools anybody would recommend?
**IMPORTANT SAFETY NOTICE
Before using standard tools for reading you possible corrupt files, for great safety, always be sure to setting the ulimit of memory, cpu, etc to values of reasonable. **
One of the dangers of parsing a corrupted file is that it is very easy for the parsing code to get stuck in an infinite loop, or for inappropriate data to find its way into a length field. Sometimes it is possible to sanity check size fields, but sometimes there is no immediate way to tell if a size is nonsensical or whether this is BF Data.
Setting conservative ulimits for the test processes may generate a few false positives, but these are easy enough to recheck manually.
Using virtualized servers can also contain the damage, but you can still cause serious pain to other guests on the same host if they share resources (e.g. if processors are not dedicated).
Simon // Do Not Taunt Happy Tar Ball.
Thanks everyone for your advice and resources. I'll keep everyone posted with how we end up implementing this.
Thanks,
I'm rather fond of using FITS (File Information Tool Set) for testing files, then having a script check the FITS files for the values of <valid> and <well-formed> in the <filestatus> section. If either or both are not "true", I consider the file corrupt. The value of FITS is that it conglomerates multiple existing tools with which to test the files; some are better suited to some kinds of files than others. You *did* say you had a range of file types.
---
Jody L. DeRidder Head, Digital Services University of Alabama Libraries Tuscaloosa, AL 35487 Phone: 205.348.0511 "Hope lies in dreams, in imagination, and in the courage of those who dare to make dreams into reality." --Jonas Salk