Questions about files integrity check

mauro....@cmcc.it

unread,

May 18, 2022, 6:46:14 AM5/18/22

to iRODS-Chat

Dear Users,

our iRODS users are happy to use iRODS commands (iput and/or irsync) to archive their files on our storage resource.

Anyway, before deleting the source files from the source file system, users would like to be sure that every file saved on the iRODS (destination) storage resource is ok (no file corruption during the network transfer, no errors and so on).

Some users are asking me how to check file integrity during the "put" phase, other users are asking me how to check it when the files have been already saved on the iRODS storage resource.

Could you please help me to give an answer to all of them?

I read that irsync "-K" option can verify checksum, but it seems it doesn't save the chksum in the database (in addition I think it could be useful only during the "put" phase).

ichksum command could be used to check the file integrity after the "put" phase, but it is very slow...

I think that your experience will be useful to try to fit iRODS users needs.

Thank you in advance,

Mauro

Terrell Russell

unread,

May 18, 2022, 7:31:50 AM5/18/22

to irod...@googlegroups.com

Hi Mauro,

The irsync -K option *does* calculate/verify and store the checksum in the catalog

$ ls -l mytest
total 8
-rw-rw-r-- 1 tgr tgr 4 May 18 07:18 again
-rw-rw-r-- 1 tgr tgr 5 Feb 26 2021 thefile

$ irsync -rK mytest i:imytest

Running recursive pre-scan... pre-scan complete... transferring data...
C- /tempZone/home/alice/imytest:
again 0.000 MB | 0.089 sec | 0 thr | 0.000 MB/s
thefile 0.000 MB | 0.074 sec | 0 thr | 0.000 MB/s

$ ils -L imytest
/tempZone/home/alice/imytest:
alice 0 demoResc 4 2022-05-18.07:19 & again
sha2:3Jh27QbVRbZ8nvfhcKPiJaOO3iyjKw8SSS5dabqINcw= generic /var/lib/irods/Vault/home/alice/imytest/again
alice 0 demoResc 5 2022-05-18.07:19 & thefile
sha2:BXj/69UYoghc2PjZWPMF7enmxBFA7cc3gdze7u8X4cI= generic /var/lib/irods/Vault/home/alice/imytest/thefile

The same happens with `iput -rK`.

$ iput -rK mytest puttest
Running recursive pre-scan... pre-scan complete... transferring data...

$ ils -L puttest
/tempZone/home/alice/puttest:
alice 0 demoResc 4 2022-05-18.07:23 & again
sha2:3Jh27QbVRbZ8nvfhcKPiJaOO3iyjKw8SSS5dabqINcw= generic /var/lib/irods/Vault/home/alice/puttest/again
alice 0 demoResc 5 2022-05-18.07:23 & thefile
sha2:BXj/69UYoghc2PjZWPMF7enmxBFA7cc3gdze7u8X4cI= generic /var/lib/irods/Vault/home/alice/puttest/thefile

And yes, `ichksum` can be used after the data objects are within iRODS - but like irsync and iput, the calculation of the checksum will be synchronous (and possibly slow, depending on the size of the files being checked).

If you're interested in asynchronous checksums, you have two options...

1) Via PEPs, you can synchronously put the checksum calculations on the delay queue - so they run later, in the background.

or

2) You can implement a query-based recurring sweeper that checks the catalog for data objects without a checksum, and enqueue them to be calculated.

This could be a crontab entry at the OS level, or a delay rule itself that is set to run hourly/daily, etc.

This same conversation is happening in this issue:

https://github.com/irods/irods/issues/6385#issuecomment-1129114084

I hope that helps answer your question.

Terrell

--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org

iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/0d02a1a0-179b-49fc-822e-42bfa7ddc498n%40googlegroups.com.

jc...@sanger.ac.uk

unread,

May 18, 2022, 9:16:45 AM5/18/22

to iRODS-Chat

Hi Mauro,

Some words of caution here, depending on which version you are on.

In 4.2.7 neither `irsync` nor `iput -r` reliably upload files recursively, at least, not for us with a composite tree which has a replication resource in the mix.

we've see variants of;

https://github.com/irods/irods/issues/3093

https://github.com/irods/irods/issues/5742

https://github.com/irods/irods/issues/5288

https://github.com/irods/irods/issues/4657

and our workaround until we upgrade past 4.2.9 is to give users a small shell script;

```

#! /bin/bash

set -eu pipefail

for FID in $(find $2 -type f)

do

#extract the relative path to the file, sans file name

RELPATH=$(dirname $FID)

#create the path on irods. imkdir -p won't error if the path already exists

imkdir -p /archive/HCA/$1/$RELPATH

#upload the file

iput -K $FID /archive/HCA/$1/$RELPATH

done

```

Hope that helps and doesn't scare!

John

Mauro Tridici

unread,

May 18, 2022, 10:07:29 AM5/18/22

to irod...@googlegroups.com

Hi John, Hi Terrell,

thank you for your support.

I will merge your answers to try to define a solution for our users.

Meanwhile, in irsync manual, I found that about "-s” option:

-s use the size instead of the checksum value for determining

synchronization.

it is not a “checksum”, but it could be enough…(in this particular moment)

Question: do you know what “-s” option do? Does it calculate the size of file (using something like “stat” linux command) or the size on disk (using something live “du” linux command)!?

It is important for me because the source file system and the destination iRODS storage resource have a different block size.

Source and destination files could show a different “size on disk” despite the fact that the file was successfully transferred.

Thank you in advance,

Mauro

--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org

iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/91dc2ac3-1883-4318-a149-63b376fde1e8n%40googlegroups.com.

Terrell Russell

unread,

May 18, 2022, 10:25:53 AM5/18/22

to irod...@googlegroups.com

Yes, the idea is that the -s will only use the size as determinate for whether to sync.

The sync will check the size of the source vs the size of the destination.

If either of those are in the catalog, then that value will be used.

If not, then the size reported by a local stat() operation will be used.

The different block size is an interesting wrinkle - We'd like to learn if that matters in your case.

Terrell

To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/5ECE31B2-D1E6-4C73-949C-F283E400E107%40cmcc.it.

Reply all

Reply to author

Forward