Identifying Corrupt Files

73 views
Skip to first unread message

Little, James Clarence IV

unread,
Mar 20, 2015, 2:37:24 PM3/20/15
to digital-...@googlegroups.com

Hi everyone, ​


I'm working with a large amount of files that had no checksum generated when they were created of first transferred to a server. They encompass a wide range of formats: tif, jp2, pdfs, etc. I would like to identify corrupt (can't be opened anymore) files on the server. 


This is the kind of strategy I am thinking about now:


1. Run JHOVE on the files to identify seriously damaged files 

2. Look for or create scripts/tools like Jpylyzer to validate specific file types


Does this seem like a good course of action? Are there any other software tools anybody would recommend? 


Thanks,

Jamie Little
University of Miami Libraries

Michael Kjörling

unread,
Mar 21, 2015, 10:46:12 AM3/21/15
to digital-...@googlegroups.com
On 20 Mar 2015 17:29 +0000, from j.li...@miami.edu (Little, James Clarence IV):
> I'm working with a large amount of files that had no checksum
> generated when they were created of first transferred to a server.
> They encompass a wide range of formats: tif, jp2, pdfs, etc. I would
> like to identify corrupt (can't be opened anymore) files on the
> server.
/snip/
> Does this seem like a good course of action? Are there any other
> software tools anybody would recommend?

Given the file name extensions you list, it seems like ImageMagick
might work well. You can use its "convert" tool to try to convert each
file in turn to some other format, and act on a failure result.
ImageMagick handles just about everything that can be considered in
any way graphical, including PDF. On a *nix host (should include OS X
as well, although I am most familiar with Linux and GNU), this could
be accomplished with a short shell script like:

#!/bin/bash
convert "$1" "/tmp/dummy.bmp" &>/dev/null || echo "$1"

and running that through find with something similar to:

$ find /archive -type f -iname '*.jp2' -exec checker-script.sh '{}' ';'

The "&>/dev/null" part in the script redirects both standard output
and standard error to /dev/null, so that the (often fairly verbose)
error messages from "convert" won't be printed for files that can't be
read. '|| echo "$1"' causes the first parameter to the script to be
printed if "convert" returns a failure status; that is the full path
to the file.

I would recommend putting the output location on a RAM-backed file
system to avoid stressing the storage more than necessary, since you
won't care about the output itself but rather only whether it was
possible to read the input file and make any sense at all of the data.
The choice of a .bmp output is arbitrary. The output of the above
would be a list of files which warrant further scrutiny.

This approach would, of course, only catch those files which are so
badly damaged that they can't be read at all. Things like minor damage
within image files that only affects the display of the image but
doesn't cause the file itself to be corrupted beyond readability won't
be caught by an automated process like this (and while I'd happily be
showed to be wrong, I doubt any automated process can do that reliably
in the absence of any fixity data).

Once you have weeded out the files that are so badly damaged that they
can't even be opened, you should obviously generate checksums for the
remaining files in order to catch any further degredation, and put
regular fixity checks in place.

Hope this helps.

--
Michael Kjörling • https://michael.kjorling.semic...@kjorling.se
OpenPGP B501AC6429EF4514 https://michael.kjorling.se/public-keys/pgp
“People who think they know everything really annoy
those of us who know we don’t.” (Bjarne Stroustrup)

Paul Wheatley

unread,
Mar 23, 2015, 10:32:34 AM3/23/15
to digital-...@googlegroups.com
Hi Jamie,

Further to Michael's excellent advice, if you want to add a validation approach there are some tools which may be of use listed here:

Kost-Val covers most of the formats you listed and Jpylyzer will cover jp2s. Obviously, validation doesn't give all the answers as you have to interpret the results and separate indications of genuine corruption from indications of deviations from the spec that viewer software will tolerate. Which of course isn't always easy. This approach may however allow you to narrow down your target set of files to a number that you can reasonably investigate manually.

Cheers

Paul

Simon Spero

unread,
Mar 23, 2015, 12:02:50 PM3/23/15
to digital-...@googlegroups.com

**IMPORTANT SAFETY NOTICE

Before using standard tools for reading you possible corrupt files, for great safety, always be sure to setting the ulimit of memory, cpu, etc to values of reasonable. **

One of the dangers of parsing a corrupted file is that it is very easy for the parsing code to get stuck in an infinite loop, or for inappropriate data to find its way into a length field. Sometimes it is possible to sanity check size fields, but sometimes there is no immediate way to tell if a size is nonsensical or whether this is BF Data.

Setting conservative ulimits for the test  processes may generate a few false positives, but these are easy enough to recheck manually.

Using virtualized servers can also contain the damage, but you can still cause serious pain to other guests on the same host if they share resources (e.g. if processors are not dedicated).

Simon  // Do Not Taunt Happy Tar Ball.

Little, James Clarence IV

unread,
Mar 23, 2015, 1:28:50 PM3/23/15
to digital-...@googlegroups.com

​Thanks everyone for your advice and resources. I'll keep everyone posted with how we end up implementing this.

 


Thanks,


Jamie Little
University of Miami Libraries 

From: digital-...@googlegroups.com <digital-...@googlegroups.com> on behalf of Paul Wheatley <paulrober...@gmail.com>
Sent: Monday, March 23, 2015 7:07 AM
To: digital-...@googlegroups.com
Subject: [digital-curation] Re: Identifying Corrupt Files
 
--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To post to this group, send email to digital-...@googlegroups.com.
Visit this group at http://groups.google.com/group/digital-curation.
For more options, visit https://groups.google.com/d/optout.

Jody L. DeRidder

unread,
Apr 28, 2015, 4:08:11 PM4/28/15
to digital-...@googlegroups.com

I'm rather fond of using FITS (File Information Tool Set) for testing files, then having a script check the FITS files for the values of <valid> and <well-formed> in the <filestatus> section.  If either or both are not "true", I consider the file corrupt.  The value of FITS is that it conglomerates multiple existing tools with which to test the files;  some are better suited to some kinds of files than others.  You *did* say you had a range of file types.

---
Jody L. DeRidder Head, Digital Services University of Alabama Libraries Tuscaloosa, AL 35487 Phone: 205.348.0511 "Hope lies in dreams, in imagination, and in the courage of those who dare to make dreams into reality." --Jonas Salk
Reply all
Reply to author
Forward
0 new messages