Re: [fedora-tech] Best way to check file integrity of all objects / datastreams files

75 views
Skip to first unread message

Jared Whiklo

unread,
Apr 10, 2018, 12:05:40 PM4/10/18
to fedor...@googlegroups.com, isla...@googlegroups.com
Hey Brian,

What you'll need to do it to turn on checksums for any object and
datastream that is missing it, then you can use the
compareDatastreamChecksum [1] API call.

Because I know you are an Islandora site I would suggest two modules to
solve this issue, islandora checksum [2] and islandora checksum checker[3].

Lastly, as the fedora community in general has moved on to Fedora 4
you'll want to specify that you are still using Fedora 3 when you ask a
question on this listserv. Just so they have the correct context.

cheers,
jared

[1]
https://wiki.duraspace.org/display/FEDORA38/API-M#API-M-compareDatastreamChecksumcompareDatastreamChecksum
[2] https://github.com/Islandora/islandora_checksum
[3] https://github.com/Islandora/islandora_checksum_checker

On 2018-04-10 10:50 AM, bgilling...@gmail.com wrote:
> I am possibly looking at a recovery situation with some of our objects,
> but I need to scan the remainder of our fedora files to determine
> whether or not any others need to be restored or purged/reingested.
>
> I have to admit that the fedora.fcfg did not have the "autoChecksum"
> value set to TRUE, so many of our objects do not have md5 checksums
> stored in the foxml files.  The exception is that an Islandora checksum
> process did generate checksums for some objects, but it appears that
> this process did not complete and newly ingested objects do not seem to
> have checksums.
>
> I am hoping that there is a script that performs this check already.  I
> am lost as to how to check the integrity of the actual object foxml
> files.  My strategy for checking datastream files' integrity was going
> to be to just check their values such as:
>   datastream file size compared to the
> /foxml:datastream/foxml:datastreamVersion[SIZE]
>   datastream file timestamp compared to their
> /foxml:datastream/foxml:datastreamVersion[CREATED]
>   datastream file mimetype compared to their
> /foxml:datastream/foxml:datastreamVersion[MIMETYPE] (but this could
> potentially fail for files such as "application/xml" vs "text/xml")
>
> Would the fedora-rebuild.sh script to rebuild the database or RI perform
> this check and any bad object / datatreams would be listed as the ERROR
> output from running the fedora-rebuild?
>
> Any help is greatly appreciated,
>
> Brian Gillingham
>
> University of Pittsburgh | University Library System
>
> --
> You received this message because you are subscribed to the Google
> Groups "Fedora Tech" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to fedora-tech...@googlegroups.com
> <mailto:fedora-tech...@googlegroups.com>.
> To post to this group, send email to fedor...@googlegroups.com
> <mailto:fedor...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/fedora-tech.
> For more options, visit https://groups.google.com/d/optout.

--
Jared Whiklo
jwh...@gmail.com
--------------------------------------------------
Oh, they have the Internet on computers now. -- Homer Simpson

Jared Whiklo

unread,
Apr 11, 2018, 9:36:31 AM4/11/18
to bgilling...@gmail.com, isla...@googlegroups.com
Moving this from fedora-tech to islandora.

The Islandora Checksum module has a process for re-applying checksums to
already ingested objects.

I would recommend having a look at that module.

cheers,
jared

On 2018-04-10 12:31 PM, bgilling...@gmail.com wrote:
> Jared,
>
> Thanks again!  Yup -- we are using fedora 3.8.1 behind our Islandora sites.
>
> Since we did not get all of the objects processed with the Islandora
> Checksum module, we will need to check them all out another way.  I
> assume that the fedora 3.8 fedora-rebuild script process does in effect
> perform an integrity check as it is rebuilding the database or RI, so we
> may start with that rather than write a custom script that would loop
> through all of the foxml files and then check the related
> datastreamStore files.
>
> Also, moving forward, I will update our fedora.fcfg to set autoChecksum
> to TRUE.
>
> Brian Gillingham
>
> On Tuesday, April 10, 2018 at 12:05:40 PM UTC-4, Jared Whiklo wrote:
>
> Hey Brian,
>
> What you'll need to do it to turn on checksums for any object and
> datastream that is missing it, then you can use the
> compareDatastreamChecksum [1] API call.
>
> Because I know you are an Islandora site I would suggest two modules to
> solve this issue, islandora checksum [2] and islandora checksum
> checker[3].
>
> Lastly, as the fedora community in general has moved on to Fedora 4
> you'll want to specify that you are still using Fedora 3 when you ask a
> question on this listserv. Just so they have the correct context.
>
> cheers,
> jared
>
> [1]
> https://wiki.duraspace.org/display/FEDORA38/API-M#API-M-compareDatastreamChecksumcompareDatastreamChecksum
> <https://wiki.duraspace.org/display/FEDORA38/API-M#API-M-compareDatastreamChecksumcompareDatastreamChecksum>
>
> [2] https://github.com/Islandora/islandora_checksum
> <https://github.com/Islandora/islandora_checksum>
> [3] https://github.com/Islandora/islandora_checksum_checker
> <https://github.com/Islandora/islandora_checksum_checker>
>
> > an email to fedora-tech...@googlegroups.com <javascript:>
> > <mailto:fedora-tech...@googlegroups.com <javascript:>>.
> > To post to this group, send email to fedor...@googlegroups.com
> <javascript:>
> > <mailto:fedor...@googlegroups.com <javascript:>>.
> <https://groups.google.com/group/fedora-tech>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> Jared Whiklo
> jwh...@gmail.com <javascript:>
> --------------------------------------------------
> Oh, they have the Internet on computers now. -- Homer Simpson
>
> --
> You received this message because you are subscribed to the Google
> Groups "Fedora Tech" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to fedora-tech...@googlegroups.com
> <mailto:fedora-tech...@googlegroups.com>.
> To post to this group, send email to fedor...@googlegroups.com
> <mailto:fedor...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/fedora-tech.
> For more options, visit https://groups.google.com/d/optout.

--
Jared Whiklo
jwh...@gmail.com
--------------------------------------------------
In some cutures what I do would be considered normal.

bgil...@pitt.edu

unread,
Apr 11, 2018, 12:15:35 PM4/11/18
to islandora
Jared,

I appreciate moving this to the Islandora group.

As we we had something happen with the NFS mount to our fedora datastore, I am more interested in checking the integrity of these files "as is" because it may end up being a recovery situation for some objects.  Performing checksums on the objects at the current point in time does not really address our current need, but we certainly will be doing this moving forward.

thanks again,

Brian Gillingham

University of Pittsburgh | University Library System


>     > <mailto:fedora-tech+unsub...@googlegroups.com <javascript:>>.
>     > To post to this group, send email to fedor...@googlegroups.com
>     <javascript:>
>     > <mailto:fedor...@googlegroups.com <javascript:>>.
>     > Visit this group at https://groups.google.com/group/fedora-tech
>     <https://groups.google.com/group/fedora-tech>.
>     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
>     --
>     Jared Whiklo
>     jwh...@gmail.com <javascript:>
>     --------------------------------------------------
>     Oh, they have the Internet on computers now. -- Homer Simpson
>
> --
> You received this message because you are subscribed to the Google
> Groups "Fedora Tech" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to fedora-tech...@googlegroups.com

Jared Whiklo

unread,
Apr 11, 2018, 12:57:26 PM4/11/18
to isla...@googlegroups.com
What you sound like you want is to ensure that the image/video/audio
files all got ingested correctly.

You could try and automate this, you'd need to know what type of file to
expect and then use an appropriate tool (identify for images, ffmpeg for
audio/video) and check for a "seemingly" correct returned value.

Truly what you need is humans to do QA, but I appreciate that it can be
a lot of tedious work.

cheers,
jared

On 2018-04-11 11:15 AM, bgil...@pitt.edu wrote:
> Jared,
>
> I appreciate moving this to the Islandora group.
>
> As we we had something happen with the NFS mount to our fedora
> datastore, I am more interested in checking the integrity of these files
> "as is" because it may end up being a recovery situation for some
> objects.  Performing checksums on the objects at the current point in
> time does not really address our current need, but we certainly will be
> doing this moving forward.
>
> thanks again,
>
> Brian Gillingham
>
> University of Pittsburgh | University Library System
>
>
> On Wednesday, April 11, 2018 at 9:36:31 AM UTC-4, Jared Whiklo wrote:
>
> Moving this from fedora-tech to islandora.
>
> The Islandora Checksum module has a process for re-applying
> checksums to
> already ingested objects.
>
> I would recommend having a look at that module.
>
> cheers,
> jared
>
> >     > <mailto:fedora-tech...@googlegroups.com
> <javascript:> <javascript:>>.
> >     > To post to this group, send email to fedor...@googlegroups.com
> >     <javascript:>
> >     > <mailto:fedor...@googlegroups.com <javascript:>>.
> >     > Visit this group at
> https://groups.google.com/group/fedora-tech
> <https://groups.google.com/group/fedora-tech>
> >     <https://groups.google.com/group/fedora-tech
> <https://groups.google.com/group/fedora-tech>>.
> >     > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>
> >     <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>.
> >
> >     --
> >     Jared Whiklo
> >     jwh...@gmail.com <javascript:>
> >     --------------------------------------------------
> >     Oh, they have the Internet on computers now. -- Homer Simpson
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Fedora Tech" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to fedora-tech...@googlegroups.com <javascript:>
> > <mailto:fedora-tech...@googlegroups.com <javascript:>>.
> > To post to this group, send email to fedor...@googlegroups.com
> <javascript:>
> > <mailto:fedor...@googlegroups.com <javascript:>>.
> > Visit this group at https://groups.google.com/group/fedora-tech
> <https://groups.google.com/group/fedora-tech>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> Jared Whiklo
> jwh...@gmail.com <javascript:>
> --------------------------------------------------
> In some cutures what I do would be considered normal.
>
> --
> For more information about using this group, please read our Listserv
> Guidelines: http://islandora.ca/content/welcome-islandora-listserv
> ---
> You received this message because you are subscribed to the Google
> Groups "islandora" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to islandora+...@googlegroups.com
> <mailto:islandora+...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/islandora.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/islandora/a0368f6e-6a29-4a80-8000-6f2a949258eb%40googlegroups.com
> <https://groups.google.com/d/msgid/islandora/a0368f6e-6a29-4a80-8000-6f2a949258eb%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Jared Whiklo
jwh...@gmail.com
--------------------------------------------------
I will always cherish the initial misconceptions I had about you.

bgil...@pitt.edu

unread,
Apr 16, 2018, 2:12:17 PM4/16/18
to islandora
Jared,

We actually had an issue with our NFS server where we can not reliably reboot it without losing the mount entirely -- and before we do anything that might take a long time to resolve, we want to just check whether or not the files on disk are representative of the objects we expect (to match our recent Solr reindex, RI rebuild, and fedora mysql database reindex).  We are pretty sure that they ingested correctly already.

I realize that this is not a standard issue, but it would possibly be helpful to have a script that just does this integrity checking on the foxml and related datastreams.  I already wrote a file checking routine for the 1.8 million fedora objects and 10+ million datastream files.

The foxml file check is just a loop that tries to call tuque's getObject on the PID values that come from the fedora data objectStore files.  (this reads from an index file that we created by running $ find . -name * > objectStore-index.txt from within that folder - much faster than grep based on PID values from the mysql fedora index database).  I have created a separate database for tracking all of these objects and store various values for each object / datastream;  whenever any one of these can not be found or loaded as expected, the problem reason is stored to the related objectStore / datastreamStore mysql table.

The datastreams are checked by reading tuque's ds_info values via a call to $ds_info = $repository->api->m->getDatastream($PID, $dsid, array('validateChecksum' => TRUE)); 

It is not the quickest script.  In separate processes, we are processing about 365 objects per minute -- and about 1,190 datastreams per minute.  So far there is only one object that seems to have disappeared for no good reason all of the datastreams seem to be fine.

I plan to release this utility script to the community after I am sure that it is doing a good job.

thanks!

Brian Gillingham

University of Pittsburgh | University Library System


>     >     > <mailto:fedora-tech+unsub...@googlegroups.com
>     <javascript:> <javascript:>>.
>     >     > To post to this group, send email to fedor...@googlegroups.com
>     >     <javascript:>
>     >     > <mailto:fedor...@googlegroups.com <javascript:>>.
>     >     > Visit this group at
>     https://groups.google.com/group/fedora-tech
>     <https://groups.google.com/group/fedora-tech>
>     >     <https://groups.google.com/group/fedora-tech
>     <https://groups.google.com/group/fedora-tech>>.
>     >     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>
>     >     <https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>>.
>     >
>     >     --
>     >     Jared Whiklo
>     >     jwh...@gmail.com <javascript:>
>     >     --------------------------------------------------
>     >     Oh, they have the Internet on computers now. -- Homer Simpson
>     >
>     > --
>     > You received this message because you are subscribed to the Google
>     > Groups "Fedora Tech" group.
>     > To unsubscribe from this group and stop receiving emails from it,
>     send
>     > an email to fedora-tech...@googlegroups.com <javascript:>
>     > <mailto:fedora-tech+unsub...@googlegroups.com <javascript:>>.
Reply all
Reply to author
Forward
0 new messages