Checksum data object in compound archive and unstick delayed execution

130 views
Skip to first unread message

J.P. Mc Farland

unread,
Mar 16, 2021, 7:52:04 AM3/16/21
to iRODS-Chat
Hello,

There are two issues to be considered here:

1.  We are connecting a Tivoli Storage Manager (TSM) to iRODS 4.2.8 using UnivMSS.  I have created an API that replicates a data object to the compound resource connected to the TSM, verifies the data was replicated correctly with checksums, trims all but the replica stored to the compound archive, and replicates the archived data object back to the compound cache when required.  The auto_repl switch is turned off and I use msisync_to_archive in acPostProcForPut to add a replica to the compound archive.

If none of the replicas have been checksummed prior to replicating to the compound resource, the verify procedure fails at the compound resource with a HIERARCHY_ERROR when it attempts to checksum the replica on the compound archive.  If the compound cache replica is checksummed before it is synced to the compound archive, the checksum appears simply to be copied to the new replica as direct attempts to checksum that replica always result in a HIERARCHY_ERROR.  I have verified this checksum copy separately outside the compound resource.

The whole point of the verify procedure is to checksum the data that has been written to tape to verify that replica is faithful prior to trimming all other replicas.  This appears to be impossible to accomplish.  The behavior is the same whether I use a real TSM interface or a placeholder unixfilesystem for the UnivMSS archive resource.

What is actually happening here?  Is the pre-calculated checksum indeed just being copied?  Is it at all possible to checksum the replica stored on the compound archive?

2.  In addition, I have noticed while testing on both production and local systems, that delayed rules quite often fail to execute and remain in the queue indefinitely, execute yet still remain in the queue indefinitely, and sometimes appear not to make it into the queue at all.  These circumstances are detrimental to this effort as we are relying on a delayed rule to execute the vital msisync_to_archive microservice.  If this doesn't happen, data objects remain in the cache and never make it into the archive.

So far, I have been able to determine that if the rulebase file containing the delay rule is modified, delayed rules tend to fail to execute.  If iRODS is restarted after the modification, everything works as expected.  Also, if new delay rules are submitted after stuck rule, it will tend not to execute, even if the stuck rule(s) is(are) removed with iqdel.  Again, iRODS must be restarted for the new rule to execute.  I think this is quite bizarre behavior!

Help diagnosing why this happens and how to fix or work around it would be greatly appreciated.

--John Mc Farland (University of Groningen)

Terrell Russell

unread,
Mar 16, 2021, 2:58:49 PM3/16/21
to irod...@googlegroups.com
Hi John,


Issue 1... direct checksum of replica on archive

No, it is not possible to directly checksum the replica on the compound archive.  iRODS must currently read the content of the file to the cache to calculate a checksum (a file could be arbitrarily large and bigger than the system's available memory).

We have discussed adding a checksum operation to the resource plugin interface itself - and allowing a storage medium to give us a pre-calculated checksum value if it has it (see multiple S3-compatible devices on the market).   See https://github.com/irods/irods/issues/3127       Does TSM have that capability?

Happy to think through other means to meet your requirements.     Here or on a call where the bandwidth might be higher.



Issue 2... delayed rule behavior

Your description of requiring a restart if the policies (rulebases) are changing... might be more explainable/understood with a bit more detail.

Are these the correct steps to reproduce the error(s) you're seeing?

1 - add a rule to the delay queue, calling a rule defined in an existing, live rulebase
2 - make an unrelated change to the existing, live rulebase on the server where the rule is to be run
3 - see the delayed rule execute, and fail 

What is the error you're seeing?    The rule is stuck?    What is stuck?



Thanks,

Terrell




--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/4b73463a-9426-4f4c-89d1-89db46d29b44n%40googlegroups.com.

J.P. Mc Farland

unread,
Mar 18, 2021, 2:22:19 PM3/18/21
to iRODS-Chat
Hi Terrell,

Thanks for the prompt and detailed response.

It is unfortunate, but not unexpected about the checksumming.  There are CRC values (per tape block) in TSM that we might be able to use somehow.  I will be looking into that eventually, but this CRC will have to be replicated on the local data file, so effectively 2 separate checksums to store.  There's not a huge rush for this because we're just beginning real testing.

I think the most secure method for now is replicating to cache with a forced checksum after first trimming the cache of its replica, but not the original source replica(s).  Expensive, but reliable.  What might be a bit cheaper potentially is to checksum the data stream from the archive and just dump it instead of writing it.  This way an extra replica's worth of storage space is saved where it is necessary and a filesystem write operation is avoided.  The efficacy of this method may not be achievable in all cases, however (streaming multiple threads of multiple large files could be a challenge).  I've used this trick to extract headers and checksum only data segments of local and remote files in Python.  It can work quite well.  Food for thought.

As for the delayed rule issue, the definition of sticking here is that regardless of whether or not the rule is executed, the entry seen with iqstat remains indefinitely in the queue unless it is removed with iqdel.  Its presence in the queue appears to prevent execution of any delayed rules following it.  In fact, on one of our production systems I recently discovered several stuck rules that cannot be removed and new delay rules never execute!  There is no error message I can find that appears related to any of them.  I have determined definitively that these rules don't execute by simply adding a writeLine("serverLog", ...) at the very beginning (see code example below).  When the rule sticks, I never see this message.

I'll try to work up an isolated example on a clean system to demonstrate this, but to clarify the steps:

1 - have a rule defined in an existing, live rulebase that will submit a delay rule to the queue (see code example below)
2 - make an unrelated change to the existing, live rulebase on the server where the rule is to be run
3 - run the defined rule
4 - see the delayed rule submitted by the defined rule
5 - see the delayed rule succeed or fail to execute, leaving the entry in the delayed rule queue (seen with iqstat)
6 - see subsequent delayed rules stay in the queue without executing

acPostProcForPut {
    writeLine("serverLog", "Submitting delayed rule for $objPath");
    delay("<PLUSET>0s</PLUSET><EF>1m DOUBLE UNTIL SUCCESS OR 8 TIMES</EF><INST_NAME>irods_rule_engine_plugin-irods_rule_language-instance</INST_NAME>") {
        writeLine("serverLog", "Executing delayed rule for *objPath"); 
}

Varying the PLUSET and EF seems to have no effect.  We are currently using a separate rule to run the delay function due to packing issues while using GenQuery.  The same thing happens there as well.

Cheers,


-=John

Terrell Russell

unread,
Mar 23, 2021, 8:38:52 PM3/23/21
to irod...@googlegroups.com
J.P.,

We have reproduced and isolated the 'rules staying in the delay queue' issue.


We have not yet gone back in time to see how long this particular functionality has been broken.

Terrell





J.P. Mc Farland

unread,
Mar 26, 2021, 8:44:32 AM3/26/21
to iRODS-Chat
Hi Terrell,

Thanks!  This will definitely help.  Is there any recommendation until it is fixed?  Perhaps just not use <EF>?

It reminds me that I have noticed in the past the unit suffix in the <PLUSET> appears not to be honored when I've checked it, though not rigorously.  For simplicity, I have always used only seconds in the very few cases I wrote.  Perhaps that is related to this parsing bug as well.

Looking back at the checksum of a data object in the compound archive, I noticed in the irepl help text that the checksum is indeed calculated for the destination copy:

Note that if the source copy has a checksum value associated with it, a checksum will be computed for the replicated copy and compare with the source value for verification.

How is that done in this case?  From my perspective, the checksum appears to be copied when replicating to the compound archive.

Cheers,


--John

Terrell Russell

unread,
Mar 26, 2021, 9:05:48 AM3/26/21
to irod...@googlegroups.com
Recommendation for now, until we understand the extent of the problem better, yes, to not depend on EF if you don't need to.

Indeed, you're on a roll - the PLUSET unit bug is https://github.com/irods/irods/issues/4055

Regarding the checksum after replication... I believe that is how irepl (and replication in general) is intended to work - but the case of a compound archive could still be a slightly different codepath (it could ignore the flag and just copy the checksum value) (again, it could be tape).  Also, all this logic and code have been rewritten in the last few months in the run-up to 4.2.9.  So I think the help text is correct, if not quite... comprehensive.

I hope this explanation helps you make some decisions - I realize it's not the strongest of answers.  This process is how we make it better.

Thanks,

Terrell



J.P. Mc Farland

unread,
Mar 26, 2021, 12:23:52 PM3/26/21
to iRODS-Chat
Hi Terrell,

Thanks for all the help!  I now have some clarification at least, and can continue with the project with functional work-arounds for now.  If I happen to come up with any useful insights or solutions along the way, I'll be sure to pass them along.

Cheers,


--John 

Jean-Yves Nief

unread,
Oct 25, 2021, 10:58:59 AM10/25/21
to iRODS-Chat
hello,

      on the subject of the checksum for compound resource. As John mentioned, the checksum which has been calculated on the cache resource is being copied directly to the replica on the compound resource (v4.2.10 in production).
In the former iRODS releases (v3), when the file /<physical path>/filename was copied the compound resource (and the latter does not have thee possibility to do checksuming), a copy of this replica was made again on the cache resource under the path  /<physical path>/filename.<timestamp> and the checksum was then added to the compound replica.
There is an extra step (staging files again, not necessarily bad as it can be still on disk on the compound resource before true internal migration on tape providing you are not writing on tape directly), but it was handling this case gracefully.
Could it be possible to have this in 4.2 ?
cheers,
JY

Alan King

unread,
Oct 25, 2021, 1:56:21 PM10/25/21
to irod...@googlegroups.com
Just to make sure I understand what you would like...

Copying the checksum calculated on replica in the cache resource to the replica in the archive resource is not the expected/desired behavior (as seen here, in 4.0.0: https://github.com/irods/irods/blob/659c77f2168d89894e5d4f54a8a52ad8a14985b0/iRODS/server/api/src/rsDataObjClose.cpp#L811-L818)

The legacy behavior (<4.x) - and what I'm perceiving to be the expected/desired behavior - would actually copy the at-rest replica in the archive to the cache resource, perform the checksum on that file, and then copy that checksum to the catalog entry for the replica in the archive resource (as seen here, in 3.3.1: https://github.com/irods/irods-legacy/blob/ff4eaa47a34f1bb5990d5560f825975c26bab118/iRODS/server/core/src/physPath.c#L498-L535)

We can think of 3 options for checksumming a replica in an archive resource:

1. Copy the checksum of the source replica (current behavior; not necessarily representative of actual bits in the archive)
2. Perform the checksum directly on the replica in the archive (as you noted, this is probably not possible)
3. Don't record the checksum (this is viable because of the parenthetical notes on options 1 and 2 - the true representation is not possible to obtain)

The legacy behavior proposed for restoration would be an option #4. What is the use case for this? Why would the copy from the archive to the cache be any more trustworthy than that provided in option #1?



--
Alan King
Software Developer | iRODS Consortium

Jean-Yves Nief

unread,
Oct 26, 2021, 6:34:35 AM10/26/21
to irod...@googlegroups.com, Alan King
Thanks for your answer.
See my response below:

Alan King wrote:
> Just to make sure I understand what you would like...
>
> Copying the checksum calculated on replica in the cache resource to
> the replica in the archive resource is not the expected/desired
> behavior (as seen here, in 4.0.0:
> https://github.com/irods/irods/blob/659c77f2168d89894e5d4f54a8a52ad8a14985b0/iRODS/server/api/src/rsDataObjClose.cpp#L811-L818)
>
> The legacy behavior (<4.x) - and what I'm perceiving to be the
> expected/desired behavior - would actually copy the at-rest replica in
> the archive to the cache resource, perform the checksum on that file,
> and then copy that checksum to the catalog entry for the replica in
> the archive resource (as seen here, in 3.3.1:
> https://github.com/irods/irods-legacy/blob/ff4eaa47a34f1bb5990d5560f825975c26bab118/iRODS/server/core/src/physPath.c#L498-L535)
>
> We can think of 3 options for checksumming a replica in an archive
> resource:
>
> 1. Copy the checksum of the source replica (current behavior; not
> necessarily representative of actual bits in the archive)
> 2. Perform the checksum directly on the replica in the archive (as you
> noted, this is probably not possible)
> 3. Don't record the checksum (this is viable because of the
> parenthetical notes on options 1 and 2 - the true representation is
> not possible to obtain)
>
> The legacy behavior proposed for restoration would be an option #4.
> What is the use case for this? Why would the copy from the archive to
> the cache be any more trustworthy than that provided in option #1?
it is because in most of the use cases, data will be ingested into the
compound resource through an upload and hence will arrive first on the
cache resource (eg: "iput -K"). That way it will certify that the copy
stored in iRODS on the cache resource is exactly the same as the
original data sent by the producer. From the cache resource, the files
will be migrated to the archive resource and checksum is also needed in
the process to check the consistency of the data with the cache version
which has already been validated as a good version of the file (ie not
altered from the original version).
cheers,
JY
> <https://groups.google.com/d/msgid/irod-chat/4b73463a-9426-4f4c-89d1-89db46d29b44n%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> --
> The Integrated Rule-Oriented Data System
> (iRODS) - https://irods.org
>
> iROD-Chat:
> http://groups.google.com/group/iROD-Chat
> ---
> You received this message because you are
> subscribed to the Google Groups "iRODS-Chat"
> group.
> To unsubscribe from this group and stop
> receiving emails from it, send an email to
> irod-chat+...@googlegroups.com.
>
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/irod-chat/e441fcd7-f282-4758-bca6-c604c2ade9b6n%40googlegroups.com
> <https://groups.google.com/d/msgid/irod-chat/e441fcd7-f282-4758-bca6-c604c2ade9b6n%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> --
> The Integrated Rule-Oriented Data System (iRODS) -
> https://irods.org
>
> iROD-Chat: http://groups.google.com/group/iROD-Chat
> ---
> You received this message because you are subscribed
> to the Google Groups "iRODS-Chat" group.
> To unsubscribe from this group and stop receiving
> emails from it, send an email to
> irod-chat+...@googlegroups.com.
>
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/irod-chat/ab2e8f21-c335-46c8-8d7d-a6f73f991a55n%40googlegroups.com
> <https://groups.google.com/d/msgid/irod-chat/ab2e8f21-c335-46c8-8d7d-a6f73f991a55n%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> --
> The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
>
> iROD-Chat: http://groups.google.com/group/iROD-Chat
> ---
> You received this message because you are subscribed to the Google
> Groups "iRODS-Chat" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to irod-chat+...@googlegroups.com
> <mailto:irod-chat+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/irod-chat/e35cd41d-1c09-488a-b277-fe43b4adba4en%40googlegroups.com
> <https://groups.google.com/d/msgid/irod-chat/e35cd41d-1c09-488a-b277-fe43b4adba4en%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
>
>
> --
> Alan King
> Software Developer | iRODS Consortium
> --
> --
> The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
>
> iROD-Chat: http://groups.google.com/group/iROD-Chat
> ---
> You received this message because you are subscribed to the Google
> Groups "iRODS-Chat" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to irod-chat+...@googlegroups.com
> <mailto:irod-chat+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/irod-chat/CADnp3x7opooEqJn-9WKReoy5OpB5HV0wA32m7qOPBySsB_Vi_A%40mail.gmail.com
> <https://groups.google.com/d/msgid/irod-chat/CADnp3x7opooEqJn-9WKReoy5OpB5HV0wA32m7qOPBySsB_Vi_A%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Alan King

unread,
Oct 27, 2021, 6:47:44 PM10/27/21
to Jean-Yves Nief, irod...@googlegroups.com
Okay, I think it's becoming clearer. Thanks for your patience :)

As you explained, the checksum on the replica in the cache resource only verifies that the data arriving on the cache resource from some source is correct. The checksum which would be performed on the replica in the archive resource would verify that the data from the cache replicated into the archive is correct. Seeing as how performing a checksum on the replica in the archive resource is not possible by definition (i.e. tape), your proposed approach (that is, the legacy behavior) is to replicate the data back to the cache resource and perform the checksum on that data, counting it as verification for the replica in the archive.

I think this explains the difference in how data verification is performed in the compound resource between then and now. My question has more to do with the why this data verification is any better than using the data verification from the cache resource.

The current approach - that is, simply using the checksum from the replica on the cache resource - is an act of submission to the return code from the sync-to-archive operation. If the sync-to-archive operation returns successfully, the compound resource is trusting that claim of success. That claim of success may or may not include some kind of underlying data verification built into the archival technology. If the sync-to-archive operation returns with an error, the compound resource is trusting that claim of failure. In summary, the default policy of the compound resource is to trust the return code of the sync-to-archive operation as defined by the storage technology of the archive's resource plugin type

From the perspective of the Consortium, replicating the data back to a filesystem to perform a checksum is not truly representative of the data in the archive because the data in the archive is not the data being verified. The verification is being performed on different bits in physical space and time. Therefore, if the technology storing the replica in the archive says that the data is good, then the data is good. Anything beyond that - again, in the view of the Consortium - is comfort for the user. The reason why the Consortium views the 4.0.0+ approach as sufficient for verifying the replica in the archive is because there is no way to truly verify the data on the archive by definition at the iRODS layer.

This is merely a defense of the decision to make this behavior the default policy of the compound resource plugin. Perhaps an example implementation of a policy which captures the use case you have in mind may be worth trying?

We have also been discussing the idea of moving checksum calculation into the resource plugins themselves and therefore closer to the bits. Here is the issue for this: https://github.com/irods/irods/issues/3127 This would allow for more control over verification of bits in the archive by the underlying technology on the actual data in time and space which we wish to verify.

J.P. Mc Farland

unread,
Nov 30, 2021, 7:32:21 AM11/30/21
to iRODS-Chat
Hi Alan,

As Jean-Yves says, the choice option 4 is more desirable here because the source of the bytes is different.  In option 1, the test says that the replica copied from the incoming data is consistent with the original, not what was written to the archive.  In option 4, the test says the replica copied from the archive is consistent with the replica originally copied to the cache that was already tested in option 1.  From this perspective, they are complimentary, not interchangeable.  As you point out, data in the archive cannot be checksummed by iRODS, but option 4 gives something more representative of what is actually there.

This is the approach I certainly would prefer and will ultimately implement as part of our archiving workflow.  If it becomes a default part of iRODS, all the better.

Cheers,


--John

Terrell Russell

unread,
Jan 13, 2022, 2:15:07 PM1/13/22
to irod...@googlegroups.com
We have implemented support for two additional flags in rsDataObjRepl() API - which should give some more control over the desired behavior of replicating to an archive resource that cannot efficiently calculate a replica.

The truth table is here...


Terrell




Reply all
Reply to author
Forward
0 new messages