no checksum in S3 resource on irods 4.3.1+

36 views
Skip to first unread message

Carsten Grzemba

unread,
May 13, 2024, 9:51:31 AMMay 13
to iRODS-Chat
I have configured a chached S3 resource:
$ ilsresc lza
lza:passthru
└── archive:compound
    ├── archives3:s3
    └── cacheRescS3:unixfilesystem

I get checksums only on the cache resource
$ ils -L /culdaef/aip/TEST/smoke_test.2024-05-13+15-37-25.tgz
  rods              0 lza;archive;cacheRescS3        51994 2024-05-13.15:37 & smoke_test.2024-05-13+15-37-25.tgz
    b13dfc6936eeed54d6532b5e75f83aa0    generic    /data-s3cache/aip/TEST/smoke_test.2024-05-13+15-37-25.tgz
  rods              1 lza;archive;archives3        51994 2024-05-13.15:37 & smoke_test.2024-05-13+15-37-25.tgz
        generic    /culda-ue-prod-bucket/aip/TEST/smoke_test.2024-05-13+15-37-25.tgz

If I try to generate checksums for all replicas I get an error
$ ichksum -a /culdaef/aip/TEST/smoke_test.2024-05-13+15-37-25.tgz
remote addresses: xx.xx.xx.xx ERROR: chksumDataObjUtil: rcDataObjChksum error for /culdaef/aip/TEST/smoke_test.2024-05-13+15-37-25.tgz status = -1814000 DIRECT_ARCHIVE_ACCESS
Level 0: Could not compute/update checksum for replica [1].
remote addresses: xx.xx.xx.xx ERROR: chksumUtil: chksum error for /culdaef/aip/TEST/smoke_test.2024-05-13+15-37-25.tgz, status = -1814000 status = -1814000 DIRECT_ARCHIVE_ACCESS

For the S3 resource the following parameters are set:
S3_PROTO=HTTPS;S3_WAIT_SEC=5;S3_RETRY_COUNT=5;S3_ENABLE_MD5=1;ARCHIVE_NAMING_POLICY=consistent;S3_REGIONNAME=us-east-1;S3_MPU_CHUNK=500

This has worked before in 4.2. but do not work in 4.3.1 and 4.3.2

How can I fix this?

James, Justin Kyle

unread,
May 13, 2024, 2:44:36 PMMay 13
to iRODS-Chat
I've reproduced this error with both an S3 archive resource and a local unixfilesystem resource as the archive.

I think this eliminates the S3 resource plugin as the culprit.  We'll have to look at it more.

From: irod...@googlegroups.com <irod...@googlegroups.com> on behalf of Carsten Grzemba <grze...@gmail.com>
Sent: Monday, May 13, 2024 9:51 AM
To: iRODS-Chat <irod...@googlegroups.com>
Subject: [iROD-Chat:21898] no checksum in S3 resource on irods 4.3.1+
 
--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/77b5badc-1b06-42a2-ab3b-730080030ce6n%40googlegroups.com.

Kory Draughn

unread,
May 13, 2024, 3:18:02 PMMay 13
to irod...@googlegroups.com
Hi Carsten,

The behavior you're witnessing is expected/correct.


A replica created on an archive resource in a compound resource hierarchy by a sync-to-archive operation will not have its checksum calculated and no checksum will be applied to the replica's entry in the catalog. This is because the resource plugin may not support calculating the checksum or may be extremely expensive. Historically, the server has attempted to calculate the checksum on archive resources and an error is returned and caught interally: DIRECT_ARCHIVE_ACCESS. It was then waived away as a non-error and the replica on the archive resource received the checksum from the replica on the cache resource. This is no longer the case as the recorded checksum cannot be trusted. Therefore, no checksum is recorded for replicas synced to archive resources.

That is also true for 4.2.12.

Have you considered the idea of replacing the compound resource with a single S3 resource configured to run in cacheless mode? See the following:
We have an issue open in irods/irods to move the checksum logic into the resource plugins. That will make it possible for each resource to take advantage of the storage devices checksum capabilities. See the following:
Until that issue is resolved, checksums on replicas in an archive resource are not supported.

Kory Draughn
Chief Technologist
iRODS Consortium


--

Carsten Grzemba

unread,
May 14, 2024, 3:28:00 AMMay 14
to irod...@googlegroups.com
It is possible to migrate from cached S3 to cacheless S3 resource without touch or migrated the already stored data?

James, Justin Kyle

unread,
May 14, 2024, 10:47:42 AMMay 14
to irod...@googlegroups.com
There should be no issue in migrating from cache to cacheless.  

  1. First trim the replicas that are on the cache (UFS) resource.  If there happen to be data objects that are only on the cache and not in archive you can do one of the following:
    1. Replicate that data object to the archive and then trim it from the cache.
    2. Leave it as is and that data object will remain on that UFS resource.
  2. Remove both the cache and archive resources from the compound resource (iadmin rmchildfromresc) and delete the compound resource.
  3. Update the HOST_MODE in the context string to cacheless_attached.
  4. If there are no replicas on the cache UFS resource then it can be removed.



Sent: Tuesday, May 14, 2024 3:28 AM
To: irod...@googlegroups.com <irod...@googlegroups.com>
Subject: Re: [iROD-Chat:21902] no checksum in S3 resource on irods 4.3.1+
 

Carsten Grzemba

unread,
May 14, 2024, 4:00:25 PMMay 14
to irod...@googlegroups.com
Because I replaced the top compound resource with the child S3 resource, I had to update the resc_id in the r_data_main table of the data objects already existing in the previous compund resource.

Kory Draughn

unread,
May 14, 2024, 4:42:54 PMMay 14
to irod...@googlegroups.com
Hmm. What do you mean by replaced the top compound resource with the child S3 resource?

You shouldn't need to touch the database directly unless something terrible has happened. Can you list the steps you took exactly?


Kory Draughn
Chief Technologist
iRODS Consortium

Carsten Grzemba

unread,
May 15, 2024, 2:29:54 AMMay 15
to irod...@googlegroups.com
Ok, I made a mistake:
I renamed the passthrough resoure, detached the S3 resource from the compound and created a new S3 resource. But instead of create a new I would have had to modify the existing resource.
$ iadmin modresc lza name lzaS3cached
$ iadmin rmchildfromresc archive archives3
WRONG: $ iadmin mkresc lza s3 ....


Kory Draughn

unread,
May 15, 2024, 8:42:38 AMMay 15
to irod...@googlegroups.com
Got it.

Were there any replicas in the cache and archive resources?
Did you handle them according to what Justin outlined?

Are things working now or are you still tweaking things?

Kory Draughn
Chief Technologist
iRODS Consortium

Carsten Grzemba

unread,
May 15, 2024, 10:27:18 AMMay 15
to irod...@googlegroups.com
Unfortunately the delivery with irsync is not working, it ends with SYS_INTERNAL_ERR
$ irsync -KVR lza  i:/culdaj/repl/thulbuser/1-2024051525107.pack_1.tar  i:/culdaef/federated/culdaj/aip/thulbuser/1-2024051525107/1-2024051525107.pack_1.tar
and creates only on S3 rescource:
$ ils -L /culdaef/federated/culdaj/aip/thulbuser/1-2024051525118/1-2024051525118.pack_1.tar
  rods              0 lza            0 2024-05-15.12:16 X 1-2024051525118.pack_1.tar
        generic    /culda-ue-prod-bucket/federated/culdaj/aip/thulbuser/1-2024051525118/1-2024051525118.pack_1.tar

Carsten Grzemba

unread,
May 15, 2024, 11:37:55 AMMay 15
to irod...@googlegroups.com
iput stops after some time:
$ iput -KVR lza /data-working/repl/thulbuser/1-2024051525148.pack_1.tar /culdaef/federated/culdaj/aip/thulbuser/1-2024051525148/1-2024051525148.pack_1.tar
From server: NumThreads=16, addr:culda-ue.uni-erfurt.de, port:20091, cookie=1856856643
remote addresses: 10.8.2.1, 141.35.106.32 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 229326, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 10.8.2.1, 141.35.106.32 ERROR: rcvTranHeader: toread = 24, read = 0
remote addresses: 10.8.2.1, 141.35.106.32 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 4108834, errno = 32 status = -27032 SYS_COPY_LEN_ERR, Broken pipe
remote addresses: 10.8.2.1, 141.35.106.32 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 433750, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 617310, errno = 32 status = -27032 SYS_COPY_LEN_ERR, Broken pipe
remote addresses: 10.8.2.1, 141.35.106.32 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1733986, errno = 32 status = -27032 SYS_COPY_LEN_ERR, Broken pipe
 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 2013616, errno = 32 status = -27032 SYS_COPY_LEN_ERR, Broken pipe
 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1037600, errno = 32 status = -27032 SYS_COPY_LEN_ERR, Broken pipe
remote addresses: 10.8.2.1, 141.35.106.32 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 260958, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 10.8.2.1, 141.35.106.32 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1705382, errno = 32 status = -27032 SYS_COPY_LEN_ERR, Broken pipe
remote addresses: 10.8.2.1, 141.35.106.32 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 3009488, errno = 32 status = -27032 SYS_COPY_LEN_ERR, Broken pipe
remote addresses: 10.8.2.1, 141.35.106.32 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 969122, errno = 32 status = -27032 SYS_COPY_LEN_ERR, Broken pipe
remote addresses: 10.8.2.1, 141.35.106.32 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1031206, errno = 32 status = -27032 SYS_COPY_LEN_ERR, Broken pipe
remote addresses: 10.8.2.1, 141.35.106.32 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 360560, errno = 32 status = -27032 SYS_COPY_LEN_ERR, Broken pipe
remote addresses: 10.8.2.1, 141.35.106.32 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 453604, errno = 32 status = -27032 SYS_COPY_LEN_ERR, Broken pipe
remote addresses: 141.35.106.32 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 3504984, errno = 32 status = -27032 SYS_COPY_LEN_ERR, Broken pipe
remote addresses: 141.35.106.32 ERROR: putUtil: put error for /culdaef/federated/culdaj/aip/thulbuser/1-2024051525148/1-2024051525148.pack_1.tar, status = -27104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer

Carsten Grzemba

unread,
May 15, 2024, 3:06:11 PMMay 15
to iRODS-Chat
I copied the file to the target node with normal rsync and tried to put it to the S3 resource locally. It ends with the similar bad message:

$ time irsync -KVR lza /data-s3cache/1-2024051525118.pack_1.tar i:/culdaef/federated/culdaj/aip/thulbuser/1-2024051525118/1-2024051525118.pack_1.tar
remote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1633683, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1040944, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1564808, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1633683, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1699166, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1027376, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1285916, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1633683, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peerremote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1699166, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 194.95.117.171 ERROR: rcvTranHeader: toread = 24, read = 0
remote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1289308, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1633683, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1633683, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer
remote addresses: 194.95.117.171 ERROR: rcPartialDataPut: toWrite 4194304, bytesWritten 1620115, errno = 104 status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer

remote addresses: 194.95.117.171 ERROR: rcvTranHeader: toread = 24, read = 0
remote addresses: 194.95.117.171 ERROR: rcvTranHeader: toread = 24, read = 0
 ERROR: rsyncUtil: rsync error for /culdaef/federated/culdaj/aip/thulbuser/1-2024051525118/1-2024051525118.pack_1.tar status = -27104 SYS_COPY_LEN_ERR, Connection reset by peer

real 2m11.860s
user 1m10.215s
sys 0m24.204s

Alan King

unread,
Jun 10, 2024, 12:05:43 PMJun 10
to irod...@googlegroups.com
Hi Carsten,

Have you tried uploading data to the S3 resource without calculating checksums (i.e. no -K)?

It may be helpful to share your S3 resource configuration (the context string) so we can better understand the situation. Please remember to redact any sensitive information such as hostnames and key-pair locations.

If you are able to build the S3 plugin and try it out in a test deployment, you may wish to try building with the commits in this pull request: https://github.com/irods/irods_resource_plugin_s3/pull/2200 Other users have reported issues regarding the number of threads when transferring files to the S3 resource plugin and this could be related to the situation you are seeing.



--
Alan King
Senior Software Developer | iRODS Consortium

Carsten Grzemba

unread,
Jun 17, 2024, 3:22:48 AMJun 17
to irod...@googlegroups.com
I compiled the S3 resource plugin for 4.3 and tested. Is have still problems with iget and irsync:
$ iget /culdail/federated/culdaj/aip/thulbuser/1-2024052725813/1-2024052725813.pack_1.tar .
remote addresses: x.x.x.x ERROR: rcPartialDataGet: toGet 41943040, bytesRead 24 status = -27000 SYS_COPY_LEN_ERR
remote addresses: x.x.x.x ERROR: rcPartialDataGet: toGet 41943040, bytesRead 24 status = -27000 SYS_COPY_LEN_ERR
remote addresses: x.x.x.x ERROR: rcPartialDataGet: toGet 41943040, bytesRead 24 status = -27000 SYS_COPY_LEN_ERR
remote addresses: x.x.x.x ERROR: rcPartialDataGet: toGet 41943040, bytesRead 24 status = -27000 SYS_COPY_LEN_ERR
remote addresses: x.x.x.x ERROR: getUtil: get error for ./1-2024052725813.pack_1.tar status = -27000 SYS_COPY_LEN_ERR

James, Justin Kyle

unread,
Jun 17, 2024, 11:34:32 AMJun 17
to irod...@googlegroups.com
I have reproduced this with irsync.  In both your logs and my test, I am seeing timeouts reading from the circular buffer.  That generally happens because the client (via iRODS core code) is not sending the number of bytes that I am expecting to receive.

I suspect this is again a case where iRODS is not breaking the file up as expected.  I will investigate further and get back to you.

I also see some timeouts writing to the circular buffer.  I'll concentrate on the reading timeouts first.

Sent: Monday, June 17, 2024 3:23 AM
To: irod...@googlegroups.com <irod...@googlegroups.com>
Subject: Re: [iROD-Chat:21961] no checksum in S3 resource on irods 4.3.1+
 
Reply all
Reply to author
Forward
0 new messages