Questions about files integrity check

45 views
Skip to first unread message

mauro....@cmcc.it

unread,
May 18, 2022, 6:46:14 AM5/18/22
to iRODS-Chat
Dear Users,

our iRODS users are happy to use iRODS commands (iput and/or irsync) to archive their files on our storage resource.
Anyway, before deleting the source files from the source file system, users would like to be sure that every file saved on the iRODS (destination) storage resource is ok (no file corruption during the network transfer, no errors  and so on).

Some users are asking me how to check file integrity during the "put" phase, other users are asking me how to check it when the files have been already saved on the iRODS storage resource.
Could you please help me to give an answer to all of them?

I read that irsync "-K" option can verify checksum, but it seems it doesn't save the chksum in the database (in addition I think it could be useful only during the "put" phase).
ichksum command could be used to check the file integrity after the "put" phase, but it is very slow...

I think that your experience will be useful to try to fit iRODS users needs.

Thank you in advance,
Mauro

Terrell Russell

unread,
May 18, 2022, 7:31:50 AM5/18/22
to irod...@googlegroups.com
Hi Mauro,

The irsync -K option *does* calculate/verify and store the checksum in the catalog

$ ls -l mytest
total 8
-rw-rw-r-- 1 tgr tgr 4 May 18 07:18 again
-rw-rw-r-- 1 tgr tgr 5 Feb 26  2021 thefile

$ irsync -rK mytest i:imytest
Running recursive pre-scan... pre-scan complete... transferring data...
C- /tempZone/home/alice/imytest:
   again                           0.000 MB | 0.089 sec | 0 thr |  0.000 MB/s
   thefile                         0.000 MB | 0.074 sec | 0 thr |  0.000 MB/s

$ ils -L imytest
/tempZone/home/alice/imytest:
  alice             0 demoResc            4 2022-05-18.07:19 & again
    sha2:3Jh27QbVRbZ8nvfhcKPiJaOO3iyjKw8SSS5dabqINcw=    generic    /var/lib/irods/Vault/home/alice/imytest/again
  alice             0 demoResc            5 2022-05-18.07:19 & thefile
    sha2:BXj/69UYoghc2PjZWPMF7enmxBFA7cc3gdze7u8X4cI=    generic    /var/lib/irods/Vault/home/alice/imytest/thefile

The same happens with `iput -rK`.

$ iput -rK mytest puttest
Running recursive pre-scan... pre-scan complete... transferring data...

$ ils -L puttest
/tempZone/home/alice/puttest:
  alice             0 demoResc            4 2022-05-18.07:23 & again
    sha2:3Jh27QbVRbZ8nvfhcKPiJaOO3iyjKw8SSS5dabqINcw=    generic    /var/lib/irods/Vault/home/alice/puttest/again
  alice             0 demoResc            5 2022-05-18.07:23 & thefile
    sha2:BXj/69UYoghc2PjZWPMF7enmxBFA7cc3gdze7u8X4cI=    generic    /var/lib/irods/Vault/home/alice/puttest/thefile


And yes, `ichksum` can be used after the data objects are within iRODS - but like irsync and iput, the calculation of the checksum will be synchronous (and possibly slow, depending on the size of the files being checked).

If you're interested in asynchronous checksums, you have two options...

1) Via PEPs, you can synchronously put the checksum calculations on the delay queue - so they run later, in the background.

or

2) You can implement a query-based recurring sweeper that checks the catalog for data objects without a checksum, and enqueue them to be calculated.
   This could be a crontab entry at the OS level, or a delay rule itself that is set to run hourly/daily, etc.

This same conversation is happening in this issue:

I hope that helps answer your question.

Terrell






--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/0d02a1a0-179b-49fc-822e-42bfa7ddc498n%40googlegroups.com.

jc...@sanger.ac.uk

unread,
May 18, 2022, 9:16:45 AM5/18/22
to iRODS-Chat
Hi Mauro,

Some words of caution here, depending on which version you are on. 

In 4.2.7 neither `irsync` nor `iput -r` reliably upload files recursively, at least, not for us with a composite tree which has a replication resource in the mix. 

we've see variants of;
and our workaround until we upgrade past 4.2.9 is to give users a small shell script;

```
#! /bin/bash
 
set -eu pipefail
 
for FID in $(find $2 -type f)
do
    #extract the relative path to the file, sans file name
    RELPATH=$(dirname $FID)
    #create the path on irods. imkdir -p won't error if the path already exists
    imkdir -p /archive/HCA/$1/$RELPATH
    #upload the file
    iput -K $FID /archive/HCA/$1/$RELPATH
done
```

Hope that helps and doesn't scare!

John

Mauro Tridici

unread,
May 18, 2022, 10:07:29 AM5/18/22
to irod...@googlegroups.com
Hi John, Hi Terrell,

thank you for your support.
I will merge your answers to try to define a solution for our users.

Meanwhile, in irsync manual, I found that about "-s” option:

-s   use the size instead of the checksum value for determining
      synchronization.

it is not a “checksum”, but it could be enough…(in this particular moment)

Question: do you know what “-s” option do? Does it calculate the size of file (using something like “stat” linux command) or the size on disk (using something live “du” linux command)!?
It is important for me because the source file system and the destination iRODS storage resource have a different block size.
Source and destination files could show a different “size on disk” despite the fact that the file was successfully transferred.

Thank you in advance,
Mauro

--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.

Terrell Russell

unread,
May 18, 2022, 10:25:53 AM5/18/22
to irod...@googlegroups.com
Yes, the idea is that the -s will only use the size as determinate for whether to sync.

The sync will check the size of the source vs the size of the destination.

If either of those are in the catalog, then that value will be used.
If not, then the size reported by a local stat() operation will be used.

The different block size is an interesting wrinkle - We'd like to learn if that matters in your case.

Terrell



Reply all
Reply to author
Forward
0 new messages