s3 plugin settings

48 views
Skip to first unread message

joris luijsterburg

unread,
Feb 17, 2023, 4:01:12 AM2/17/23
to iRODS-Chat
All,

I am having some troubles when doing performancetests on my s3 resources. I run 3 irods serves v4.3.0 on ubuntu 20. One is the provider, and I have two consumers that have a connection to a storage machine via s3(both to a different one). I put these two resources behind a replication resource so all data will be sent to two datacentres.

I fired up 15 docker containers that will each do iputs on the replication resource, with filesizes doubling after 10 files, so 1K,2K,4K,8K, etc.

Now the issue is that after some time the test will usually fail. Not all 15 docker containers fail, but everytime I run the test, some of them do. Usually when the containers are at 4/8/16 MB. It never fails at the smaller files.

The question is now, what limit am I exactly hitting here? And what settings can I tinker with to make sure it doesn't fail?

I tried increasing the cirular buffer, but that didn't help much.

Errors for the first issue are e.g. something like this:
 {"log_category":"s3_transport_logging_category","log_facility":"local0","log_level":"error","log_message":"/irods_plugin_source/s3/s3_transport/src/s3_transport.cpp:222 [store_and_log_status] [[
140312517400320]]  libs3_types::status: [AbortedByCallback] - 47","request_api_name":"DATA_COPY_AN","request_api_number":611,"request_api_version":"d","request_client_user":"irods","request_host"
:"X.X.X.X","request_proxy_user":"irods","request_release_version":"rods4.3.0","server_host":"SERVER_Y","server_pid":3765741,"server_timestamp":"2023-02-17T08:45:19.774Z","server_type":"agent
"}
 {"log_category":"s3_transport_logging_category","log_facility":"local0","log_level":"error","log_message":"/irods_plugin_source/s3/s3_transport/src/s3_transport.cpp:225 [store_and_log_status] [[
140312517400320]]  S3Host: LOCALS3machine.X.X","request_api_name":"DATA_COPY_AN","request_api_number":611,"request_api_version":"d","request_client_user":"irods","request_host":"X.X.X.X
01","request_proxy_user":"irods","request_release_version":"rods4.3.0","server_host":"SERVER_Y","server_pid":3765741,"server_timestamp":"2023-02-17T08:45:19.774Z","server_type":"agent"}
 {"log_category":"s3_transport_logging_category","log_facility":"local0","log_level":"error","log_message":"/irods_plugin_source/s3/s3_transport/src/s3_transport.cpp:229 [store_and_log_status] [[
140312517400320]]  Function: s3_upload::callback_for_write_to_s3_base::on_response_completion","request_api_name":"DATA_COPY_AN","request_api_number":611,"request_api_version":"d","request_client
_user":"irods","request_host":"X.X.X.X","request_proxy_user":"irods","request_release_version":"rods4.3.0","server_host":"SERVER_Y","server_pid":3765741,"server_timestamp":"2023-02-17T08:45:
19.774Z","server_type":"agent"}
 {"log_category":"s3_transport_logging_category","log_facility":"local0","log_level":"error","log_message":"/irods_plugin_source/s3/s3_transport/include/s3_transport.hpp:486 (close) [[14031251740
0320]] flush_cache_file returned error","request_api_name":"DATA_COPY_AN","request_api_number":611,"request_api_version":"d","request_client_user":"irods","request_host":"X.X.X.X","request_pr
oxy_user":"irods","request_release_version":"rods4.3.0","server_host":"SERVER_Y","server_pid":3765741,"server_timestamp":"2023-02-17T08:45:21.776Z","server_type":"agent"}
 {"log_category":"legacy","log_facility":"local0","log_level":"error","log_message":"[-]\t/irods_source/server/api/src/rsFileClose.cpp:120:int _rsFileClose(rsComm_t *, fileCloseInp_t *) :  status
 [S3_PUT_ERROR]  errno [] -- message [fileClose failed for [4]]\n\t[-]\t/irods_source/server/drivers/src/fileDriver.cpp:167:irods::error fileClose(rsComm_t *, irods::first_class_object_ptr) :  st
atus [S3_PUT_ERROR]  errno [] -- message [failed to call 'close']\n\t\t[-]\t/irods_source/plugins/resources/replication/src/librepl.cpp:765:irods::error repl_file_close(irods::plugin_context &) :
  status [S3_PUT_ERROR]  errno [] -- message [Failed while calling child operation.]\n\t\t\t[-]\t/irods_plugin_source/s3/s3_transport/include/s3_transport.hpp:487:virtual bool irods::experimental
::io::s3_transport::s3_transport<char>::close(const irods::experimental::io::on_close_success *) [CharT = char] :  status [S3_PUT_ERROR]  errno [] -- message [flush_cache_file returned error]\n\n
","request_api_name":"DATA_COPY_AN","request_api_number":611,"request_api_version":"d","request_client_user":"irods","request_host":"X.X.X.X","request_proxy_user":"irods","request_release_ver
sion":"rods4.3.0","server_host":"SERVER_Y","server_pid":3765741,"server_timestamp":"2023-02-17T08:45:21.776Z","server_type":"agent"}

Kind regards,
Joris Luijsterburg
devops RDM Wageningen university

James, Justin Kyle

unread,
Feb 17, 2023, 1:38:02 PM2/17/23
to iRODS-Chat
Can you enable debug logging on the S3 plugin with the highlighted part added to the /etc/irods/server_config.json file on each iRODS server, re-run the test, and send me the logs when you see the failure?

    "log_level": {
        "agent": "info",
        "agent_factory": "info",
        "api": "info",
        "authentication": "info",
        "database": "info",
        "delay_server": "info",
        "legacy": "info",
        "microservice": "info",
        "network": "info",
        "resource": "info",
        "rule_engine": "info",
        "server": "info",
        "s3_resource_plugin": "debug"
    },

In addition, can you send the resource configurations using ilresc -l <resc name>?

I am curious why it is using a cache file.  That might be because the clients are connected to a different server than what is attached to the S3 resource as described here:


That shouldn't cause a failure though.  If it is possible you could set the host_mode to cacheless_detached so that the redirect doesn't happen.


From: irod...@googlegroups.com <irod...@googlegroups.com> on behalf of joris luijsterburg <jlu...@gmail.com>
Sent: Friday, February 17, 2023 4:01 AM
To: iRODS-Chat <irod...@googlegroups.com>
Subject: [iROD-Chat:20963] s3 plugin settings
 
--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/5a6df1f1-16e0-4356-87b7-b89671426ddfn%40googlegroups.com.

joris luijsterburg

unread,
Feb 20, 2023, 10:31:16 AM2/20/23
to iRODS-Chat
Some extra info on the matter:

- When running directly on the s3 resource, no errors occur. It is somehow related to the replication resource
- After your post I noticed the retries. I added a retry on both the s3 resource and the replication resource. This manages the issue for the client, after enough retries it didn't fail anymore, but in the irodslogging the errors still appear.
- I shortly tried detached mode, but that gave different errors. I didn't have time to look at that yet.

My context now looks like this:
s3 resource 1 & 2:
S3_DEFAULT_HOSTNAME=our_S3_storage_server;S3_AUTH_FILE=some_auth_file;S3_REGIONNAME=some_region;S3_RETRY_COUNT=2;S3_WAIT_TIME_SECONDS=3;S3_PROTO=HTTPS;ARCHIVE_NAMING_POLICY=consistent;HOST_MODE=cacheless_attached;S3_ENABLE_COPYOBJECT=0;CIRCULAR_BUFFER_SIZE=4;S3_MPU_CHUNK=5

replication resource:
retry_attempts=5;first_retry_delay_in_seconds=1;backoff_multiplier=2.0

ilsresc:
acctapearchive:replication
├── s3nlacc01:s3
└── s3nlacc02:s3

Some logs that I found now:
On the provider host:
(removed some info:
   log_message: [replicate_data_object:778] - failed to replicate [/RDMacc/home/irods/speedtest2/speed1test5/1676904742/4742N3_4M_4]
   log_message: [replicate_data:674] - error finalizing replica; [SYS_COPY_LEN_ERR: [finalize_destination_replica:208] - failed to get size in vault [error_code=[-27000], path=[/RDMacc/home/irods/speedtest2/speed1test5/1676904742/4742N3_4M_4], hierarchy=[acctapearchive;s3nlacc02]]
   log_message: [get_size_in_vault:169] - size in vault does not equal target size[target size=[4194304], size in vault=[2097152], path=[/RDMacc/home/irods/speedtest2/speed1test5/1676904742/4742N3_4M_4], hierarchy=[acctapearchive;s3nlacc02]]

Some logs on the resource host:
   log_message: [-]    /irods_source/server/api/src/rsFileClose.cpp:120:int _rsFileClose(rsComm_t *, fileCloseInp_t *) :  status [S3_PUT_ERROR]  errno [] -- message [fileClose failed for [4]]
    [-]    /irods_source/server/drivers/src/fileDriver.cpp:167:irods::error fileClose(rsComm_t *, irods::first_class_object_ptr) :  status [S3_PUT_ERROR]  errno [] -- message [failed to call 'close']
        [-]    /irods_source/plugins/resources/replication/src/librepl.cpp:765:irods::error repl_file_close(irods::plugin_context &) :  status [S3_PUT_ERROR]  errno [] -- message [Failed while calling child operation.]
            [-]    /irods_plugin_source/s3/s3_transport/include/s3_transport.hpp:487:virtual bool irods::experimental::io::s3_transport::s3_transport<char>::close(const irods::experimental::io::on_close_success *) [CharT = char] :  status [S3_PUT_ERROR]  errno [] -- message [flush_cache_file returned error]
   log_message: /irods_plugin_source/s3/s3_transport/include/s3_transport.hpp:486 (close) [[140710225368832]] flush_cache_file returned error
   log_message: /irods_plugin_source/s3/s3_transport/include/s3_transport.hpp:2040 (s3_upload_part_worker_routine) [[140710324528896]] S3_upload_part returned error [status=AbortedByCallback][attempt=1][retry_count_limit=2].  Sleeping between 1 and 3 seconds
   log_message: /irods_plugin_source/s3/s3_transport/src/s3_transport.cpp:229 [store_and_log_status] [[140710324528896]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
   log_message: /irods_plugin_source/s3/s3_transport/src/s3_transport.cpp:225 [store_and_log_status] [[140710324528896]]  S3Host: our_S3_storage_server
   log_message: /irods_plugin_source/s3/s3_transport/src/s3_transport.cpp:222 [store_and_log_status] [[140710324528896]]  libs3_types::status: [AbortedByCallback] - 47
   log_message: /irods_plugin_source/s3/s3_transport/include/s3_transport.hpp:2040 (s3_upload_part_worker_routine) [[140710208583424]] S3_upload_part returned error [status=AbortedByCallback][attempt=1][retry_count_limit=2].  Sleeping between 1 and 3 seconds
   log_message: /irods_plugin_source/s3/s3_transport/include/s3_transport.hpp:2040 (s3_upload_part_worker_routine) [[140710233761536]] S3_upload_part returned error [status=AbortedByCallback][attempt=1][retry_count_limit=2].  Sleeping between 1 and 3 seconds
   log_message: /irods_plugin_source/s3/s3_transport/src/s3_transport.cpp:229 [store_and_log_status] [[140710208583424]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
   log_message: /irods_plugin_source/s3/s3_transport/src/s3_transport.cpp:225 [store_and_log_status] [[140710208583424]]  S3Host: our_S3_storage_server
   log_message: /irods_plugin_source/s3/s3_transport/src/s3_transport.cpp:229 [store_and_log_status] [[140710233761536]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
   log_message: /irods_plugin_source/s3/s3_transport/src/s3_transport.cpp:225 [store_and_log_status] [[140710233761536]]  S3Host: our_S3_storage_server
   log_message: /irods_plugin_source/s3/s3_transport/src/s3_transport.cpp:222 [store_and_log_status] [[140710233761536]]  libs3_types::status: [AbortedByCallback] - 47
   log_message: /irods_plugin_source/s3/s3_transport/src/s3_transport.cpp:222 [store_and_log_status] [[140710208583424]]  libs3_types::status: [AbortedByCallback] - 47
   log_message: Agent process [811885] exited with status [1]
   log_message: [-]    /irods_source/server/core/src/rodsAgent.cpp:588:int runIrodsAgentFactory(sockaddr_un) :  status [SSL_SHUTDOWN_ERROR]  errno [] -- message [failed to call 'agent stop']
    [-]    /irods_source/lib/core/src/sockComm.cpp:160:irods::error sockAgentStop(irods::network_object_ptr) :  status [SSL_SHUTDOWN_ERROR]  errno [] -- message [failed to call 'agent stop']
        [-]    /irods_source/plugins/network/src/ssl.cpp:850:irods::error ssl_agent_stop(irods::plugin_context &) :  status [SSL_SHUTDOWN_ERROR]  errno [] -- message [error completing shutdown of SSL connection]

Best regards,
Joris

James, Justin Kyle

unread,
Feb 20, 2023, 2:01:53 PM2/20/23
to iRODS-Chat
I think there is some good information here.  I will try to reproduce on my end and write an issue on it.

Thanks.


From: irod...@googlegroups.com <irod...@googlegroups.com> on behalf of joris luijsterburg <jlu...@gmail.com>
Sent: Monday, February 20, 2023 10:31 AM
To: iRODS-Chat <irod...@googlegroups.com>
Subject: Re: [iROD-Chat:20968] s3 plugin settings
 

joris luijsterburg

unread,
Mar 27, 2023, 3:42:04 AM3/27/23
to iRODS-Chat
Were you able to reproduce the bug and/or do you need more info?

James, Justin Kyle

unread,
Mar 27, 2023, 12:38:04 PM3/27/23
to iRODS-Chat
I have reproduced it or something very close to it.  I am troubleshooting the root cause now.

From: irod...@googlegroups.com <irod...@googlegroups.com> on behalf of joris luijsterburg <jlu...@gmail.com>
Sent: Monday, March 27, 2023 3:42 AM
To: iRODS-Chat <irod...@googlegroups.com>
Subject: Re: [iROD-Chat:21091] s3 plugin settings
 

James, Justin Kyle

unread,
Mar 27, 2023, 1:16:56 PM3/27/23
to iRODS-Chat
I don't know for certainty that this is your issue but here is the issue I wrote up.

I have two S3 resources under a replication node. When I write an 8M file to the replication node, I get the following calls to the S3 write operation: Mar 27 16:46:20 pid:5459 NOTICE: /github/irod...


From: irod...@googlegroups.com <irod...@googlegroups.com> on behalf of James, Justin Kyle <jja...@renci.org>
Sent: Monday, March 27, 2023 12:37 PM
To: iRODS-Chat <irod...@googlegroups.com>
Subject: Re: [iROD-Chat:21096] s3 plugin settings
 

joris luijsterburg

unread,
Mar 28, 2023, 6:24:58 AM3/28/23
to iRODS-Chat
Thanks!

That is probably the one indeed, it talks about the technical implementation underneath as well, so maybe I wouldn't recognise it at first though.

Regards,
Joris
Reply all
Reply to author
Forward
0 new messages