python put timeouts

118 views
Skip to first unread message

joris luijsterburg

unread,
May 13, 2024, 8:44:52 AMMay 13
to iRODS-Chat
All,

We are encountering some strange behaviour. We are uploading files and encounter timeouts. We d not encounter these in all circumstances, but some we do. We have tried varous variations to find where exactly things go wrong but could use some extra ideas.

The situation is as follows:
We have an irods v 4.3.0 server, runing on ubuntu 20.04, and have a 1TB file on a different server that we send to irods via a pythonscript(python-irodsclient v 2.0.0). We get timeouts on the traffic, and the file does not end up completely in irods. We see that multiple threads are used to send the file, sometimes 2, sometimes 3. If we send the file via icommands it does not give any errors and arrives safely.
If we look at the linuxfilesystem of the receiving system, we see that sometimes half of the file arrives, and the other half doesn't(hinting that some of the parralelle threads go wrong?)
The python script is usually running for a long time before it fails. We have seen errors after 1-5 hours depending on the filesize. We sometimes see errors in the irods serverlog, but the timing of these errors are usually right after starting the transfer, not when the transfer ends according to the pythonscript.



Variations we have tried:
- Sending from a windos and ubuntu server, both getting errors
- Different irods pythonclient versions(2.0.0, 2.0.1, 1.1.6)
- We tested network ports separately, they are open, both 1247 and the 20000 range
- I shortly shecked keepalive settings, but if this is the reason I'd expect the filetransfer to always stop at the same moment
- PAM and SSL is standard, we also tried with irods native and CS_NEG_REFUSE, that still gives a timeout.
- We tried setting it to using no parralell transfer, but didn't succeed in this yet via python. We used:
    options = {'regChksum': True, 'verifyChksum': True, 'ChksumAll': True, 'numThreads': 1}
    ic = irods.session.iRODSSession(password = password, **ienv)
    ic.data_objects.put(local_path, irods_path, **options)
with result:
    [2024-05-13 14:24:18,313] {connection.py:538} INFO - PAM authorization validated
    [2024-05-13 14:24:18,345] {parallel.py:343} INFO - num_threads = 2 ; bytes_per_thread = 766643717120



Specific errors encountered:
Python-irods client:
Exception has occurred: NetworkException X
Could not receive server response
ValueError: non-zero flags not allowed in calls to recv() on <class 'ssl.SSLSocket'>
 
During handling of the above exception, another exception occured:
 
TimeoutError: the read operation timed out
 
During handling of the above exception, another exception occurred:  
File "xxx\main.py", line 70, in <module>    
upload(session, local_path, irods_path, overwrite=True, options=options)
irods.exception.NetworkException: Could not receive server response


irods-serverlog:

we get this error, but not always: {"log_category":"legacy","log_facility":"local0","log_level":"error","log_message":"[-]\t/irods_source/server/core/src/rsApiHandler.cpp:560:int readAndProcClientMsg(rsComm_t *, int) :  status [Unknown iRODS error]  errno [No buffer space available] -- message [failed to call 'read header']\n\t[-]\t/irods_source/lib/core/src/sockComm.cpp:198:irods::error readMsgHeader(irods::network_object_ptr, msgHeader_t *, struct timeval *) :  status [Unknown iRODS error]  errno [No buffer space available] -- message [failed to call 'read header']\n\t\t[-]\t/irods_source/plugins/network/src/ssl.cpp:528:irods::error ssl_read_msg_header(irods::plugin_context &, void *, struct timeval *) :  status [Unknown iRODS error]  errno [No buffer space available] -- message [read -1 expected 4]\n\n","request_api_name":"AUTH_RESPONSE_AN","request_api_number":704,"request_api_version":"d","request_client_user":"XXX","request_host":"XXX","request_proxy_user":"XXX","request_release_version":"rods4.3.2","server_host":"XXX","server_pid":366722,"server_timestamp":"2024-05-13T12:13:58.475Z","server_type":"agent"}

And we see a thread being killed: {"log_category":"legacy","log_facility":"local0","log_level":"error","log_message":"[-]\t/irods_source/server/core/src/rodsAgent.cpp:588:int runIrodsAgentFactory(sockaddr_un) :  status [SSL_SHUTDOWN_ERROR]  errno [] -- message [failed to call 'agent stop']\n\t[-]\t/irods_source/lib/core/src/sockComm.cpp:160:irods::error sockAgentStop(irods::network_object_ptr) :  status [SSL_SHUTDOWN_ERROR]  errno [] -- message [failed to call 'agent stop']\n\t\t[-]\t/irods_source/plugins/network/src/ssl.cpp:850:irods::error ssl_agent_stop(irods::plugin_context &) :  status [SSL_SHUTDOWN_ERROR]  errno [] -- message [error completing shutdown of SSL connection]\n\n","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"XXX","request_host":"XXX","request_proxy_user":"XXX","request_release_version":"rods4.3.2","server_host":"XXX","server_pid":373095,"server_timestamp":"2024-05-13T12:24:05.439Z","server_type":"agent"}

kind regards,
Joris

jc...@sanger.ac.uk

unread,
May 13, 2024, 9:14:18 AMMay 13
to iRODS-Chat
Hi Joris,

Have you seen the troubleshooting section about the TCP keepalives? 

Might be worth ruling that out if you haven't already.

John

joris luijsterburg

unread,
May 14, 2024, 6:47:54 AMMay 14
to iRODS-Chat
Hey John,

I think I can rule that one out since I see errors occurring in the logs straight after starting the transfer. The weird thing is that the errors appear only after an hour or a couple of hours at the sending party though.

jc...@sanger.ac.uk

unread,
May 14, 2024, 8:09:00 AMMay 14
to iRODS-Chat
Something that isn't obvious from the documentation (at least, that I can see in 4.3.0) is that there are two parts to an iRODS connection - the control connection and the data connection. This only applies when using parallel transfer or more than one thread, so it might be interesting to explore if you see the behaviour with only one thread.

I'm hoping the following description is still valid, and if not, more knowledgeable folks will be along shortly to correct me, here goes...

The control connection is established at the start of the process and goes something like

client -> Provider -> Consumer

Whereas once the control connection has established where to get the replica from the data connections go

client -> Consumer

so the client then has N connections;

client -> Provider -> Consumer (x1)
client -> Consumer (x N threads)

Without the TCP keepalives being aligned, the control connection is then idle until a replica is written and it needs to update the catalog. 
If the connection has been timed out in the meantime however, the port will have dropped at the provider and when the client goes to send the data size, checksum etc of the replica to the Provider, it finds the socket closed and you get messages like you mentioned, and the whole upload fails.
Depending on the size of the file and the capacity of the networking between the client and the consumer, an upload, especially for the kind of large files we see in genomics, can easily take longer than the default 20 mins most OS's have (or had on 18.04 - I haven't checked 22.04 and similar), however, some can succeed, again depending on network congestion and the like which can make it seem random. 
We had this behaviour when we wrote replicas to two sets of servers connected by a WAN link, and when the link was congested, we got these errors. Fixing the keep-alives fixed the issue for us.

I can't explain why the errors show up immediately with this, but that might explain why others only show up later.

It might be worth using tcpdump and similar to see what's happening at the server(s) ends when you try the upload in case of firewalls or routing meaning the connections aren't coming in the port/path/ip you expect?

Hope that's of some help?

John

dmoore.renci

unread,
May 14, 2024, 11:01:14 AMMay 14
to iRODS-Chat
hello Joris!

  Could you send an excerpt showing where you create the iRODSSession?  
  Also, are you
    - creating your own ssl context ,or relying on the python-irodsclient to create it internally?
    - using self signed certs in this environment?
thanks - Daniel

Alan King

unread,
May 14, 2024, 11:28:35 AMMay 14
to irod...@googlegroups.com
Hi,

What John has described - accurately, I think - is the legacy parallel transfer which uses the "high ports". This is only applicable to some clients, most notably the iCommands iput and iget.

PRC uses a different mechanism for parallel transfer which we call "multi-1247" parallel transfer. This establishes a number of independent connections (must be the same server) on the iRODS port (rather than the "high ports") and it performs concurrent, coordinated writes to transfer the data.

Two things stand out to me in the errors that you shared:

1. From the server logs: "status [Unknown iRODS error]  errno [No buffer space available] -- message [read -1 expected 4]"

This indicates that something is leaking network buffers (according to The Internet) in the server. Maybe that thing is iRODS, but maybe it's something else running in the server... Are you able to tell whether the agent printing this log message is the one servicing the PRC put request?

2. ValueError: non-zero flags not allowed in calls to recv() on <class 'ssl.SSLSocket'>

We are investigating an issue very similar to this raised by another user as it relates to ingest, so it's interesting to see it appear here apart from ingest. If you can share your python script as Dan requested, we may be able to find a pattern. However, you stated that you already tried disabling SSL and still see timeouts, so that is probably not the issue. I think the real issue is whatever is causing that "No buffer space available".

I wanted to make one final comment about adjusting the number of threads used with the put. DataObjectManager.put() has a num_threads argument which you must set to 1 to get the desired streaming behavior. The numThreads keyword being set in your example code will only set the numThreads for the API call, but PRC makes the determination about whether to use the "multi-1247" parallel transfer before it makes the API call. So try setting num_threads=1 in your put() call and see if that helps (although, I imagine transferring a 1TB file will take an absurdly long time with 1 thread!).

Alan

--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/2a933080-f4f2-48d6-a74d-8b977145a849n%40googlegroups.com.


--
Alan King
Senior Software Developer | iRODS Consortium

joris luijsterburg

unread,
May 16, 2024, 3:06:52 AMMay 16
to iRODS-Chat
Hey all,

A lot to take in here!
- We have a single server setup, only a provider, with a unixfilesystemresource(and a passthru) on the provider. I didn't know that python does everything via port 1247(and are the high ports thus considered legacy and in the line of being removed in future versions?)
- pythoncode at the bottom.
- We use externally signed certificates, I think the root of that authority is known, at least I saw no problems before with this root.
- It looks like the SSL context is not created explicitly here, see code below.
- No update on the buffer space yet, I am gathering some help to dig deeper there.
- With the correct numthreads option we tried again, with interesting results. The python application got the same error, but in the rodslog nothing was found.When the error was there in python I could still see the irodsprocess running on the irodsserver that was likely serving the pythonscript(based on starttime, I don't know how how to relate this directly). Apparantly the one starting the process did put up a breakpoint in vscode, so the pythonprocess was also still running in the background. The weird thing is that the next day when I checked the file again, the status had changed from '?' to '&', and it had a checksum calculated. The number of bytes was correct, so I assume the transfer was a success. So somehow the user gets an error, but the transfer is still continued and finished! The exact timing of the error seen in python is unknown(some logging was disabled to exclude this as an option), but between 15-210 minutes after starting. (Our keepalive settings should keep lines open to up to at least 4 hours)

Best regards,
Joris

import json
from pathlib import Path
from os import path
from getpass import getpass
import irods.session

 

if __name__ == "__main__":
    # Environment file
    env_file = Path(path.expanduser("~")).joinpath(".irods", "irods_environment.json")
    if not path.exists(env_file):
        print('Environment file not found')
        exit()

 

    # Authenticate
    with open(env_file, "r") as f:
        ienv = json.load(f)

 

    password = getpass("Your iRODS password")

    ic = irods.session.iRODSSession(password = password, **ienv)
    _ = ic.server_version

 

    # Upload
    local_path = "xx\G5_20211204_col_only.tar"
    irods_path = f'/{ic.zone}/home/{ic.username}/G5_20211204_col_only.tar'

    options = {'regChksum': True, 'verifyChksum': True, 'ChksumAll': True, 'numThreads': 1}
    if not ic.data_objects.exists(irods_path):
        print(f"Uploading {local_path} to {irods_path}")
        ic.data_objects.put(local_path, irods_path, **options, num_threads=1)

Smeele, A.P.M. (Ton)

unread,
May 21, 2024, 2:26:57 AMMay 21
to irod...@googlegroups.com
Hi Joris,

From what you mention it looks as if the server side process continued to finish the job, after connection with the client had gone lost.  We have seen similar cases in the past, in particular when the client request is at a stage where the server does not require any further input from the client agent to complete the api-request (e.g. when all data has been transmitted).  For instance after transmission of a larger data file the server process might need to process data to calculate a checksum from the stored file, or synchronous replication might need to take place, before the 'complete' sign can be sent back to the client.

In the past, we have found this issue related to 1) in-between-firewall policy to close a seemingly inactive TCP connection or 2) client program has terminated (e.g. forcefully by an impatient user ).  

Ton 


On Thu, 2024-05-16 at 00:06 -0700, joris luijsterburg wrote:

CAUTION: This email originated from outside of Utrecht University. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi,

What John has described - accurately, I think - is the legacy parallel transfer which uses the "high ports". This is only applicable to some clients, most notably the iCommandsiput and iget.

PRC uses a different mechanism for parallel transfer which we call "multi-1247" parallel transfer. This establishes a number of independent connections (must be the same server) on the iRODS port (rather than the "high ports") and it performs concurrent, coordinated writes to transfer the data.

Two things stand out to me in the errors that you shared:

1. From the server logs: "status [Unknown iRODS error]  errno [No buffer space available] -- message [read -1 expected 4]"

This indicates that something is leaking network buffers (according to The Internet) in the server. Maybe that thing is iRODS, but maybe it's something else running in the server... Are you able to tell whether the agent printing this log message is the one servicing the PRC put request?

2. ValueError: non-zero flags not allowed in calls to recv() on <class 'ssl.SSLSocket'>

We are investigating an issue very similar to this raised by another user as it relates to ingest, so it's interesting to see it appear here apart from ingest. If you can share your python script as Dan requested, we may be able to find a pattern. However, you stated that you already tried disabling SSL and still see timeouts, so that is probably not the issue. I think the real issue is whatever is causing that "No buffer space available".

I wanted to make one final comment about adjusting the number of threads used with the put.DataObjectManager.put() has a num_threads argument which you must set to 1 to get the desired streaming behavior. ThenumThreads keyword being set in your example code will only set thenumThreads for the API call, but PRC makes the determination about whether to use the "multi-1247" parallel transfer before it makes the API call. So try settingnum_threads=1 in your put() call and see if that helps (although, I imagine transferring a 1TB file will take an absurdly long time with 1 thread!).

Alan

--

Message has been deleted

joris luijsterburg

unread,
Jun 11, 2024, 6:26:56 AMJun 11
to iRODS-Chat
All,

A small update on this issue. We spoke at the troubleshooting session at the UGM, and one of the possible measures was going forward to 4.3.1. We did tests at the UGM, and it seemed to work. We sent files of 1,2,5,10,20,50GB without issues. We proceeded to install 4.3.1 on the relevant servers. However, the actual files were much larger, and then the problem sort of reappeared. When sending 100GB files we encountered timeout issies again. There are some slight changes in behaviour though.
- The files that we tested upto 50GB seemed to consistently work, in 4.3.0 it would already go awry there.
- We now consistently see that the complete file gets uploaded, and then the client gets a 'ValueError: non-zero flags not allowed in calls to recv() on <class 'ssl.SSLSocket'>' error.
- The serverside continues working after this, and in the end the file is successfully transferred, and has a checksum calculated.
- We also tested a 100 and 250Gb file without asking irods to calculate a checksum, this does not give errors.
- We didnt find an exact cutoff where everything goes wrong between 50 and 100GB, 99GB also went wrong, we also didn't try to find an exact cutoff as this take a lot of testing.


It is clear that now the problem seems to be in the last stage of the transfer, where something with the connection fails, probably during checksum calculation. I then did another test, sending 250Gb with checksumcalculation, where I tried to have an exact timeline:
10:55:05 - transfer start
11:12:38 transfer complete - start calculating checksum (inferred from last access time of the file in posix)
11:18:38 - error in python code.
11:31:?? - checksum clauclated, file has status & (inferred from timestamp from file in ils)

The difference is exactly 360 seconds. Now I can imagine that calculating a achecksum for ma 250Gb file takes more than 6 minutes, so that seems reasonable. The 360 seconds is also not a random number, because this is a setting in the pythoncode:

    ic.connection_timeout = 360

This makes it seem like during the checksum calculation phase the connection is not kept alive. The serverside keepalive settings should kick in after 250 seconds, so that doesnt help here though.

After this I tried another test, I set connection_timeout to 10 seconds, and transferred a 60Gb file.
This had the same timeline:

11:27:53 - start transfer
11:32:33 - transfer compelete, start claculating checksum
11:32:43 - error in PRC
11:36:?? - checksum done


There might be two possible workarounds:
- Give a really high connection_timeout in the PRC
- Don't calculate checksums
- Can the PRC do something to maintain the connection during checksum-phase?
- other ideas? We also tried setting "refresh_time": 300 in the irods_environment.json, which is then lower then connecion_timeout = 360 in our pythoncode, but that didn't work yet.

Regards,

Joris

Terrell Russell

unread,
Jun 11, 2024, 7:45:00 AMJun 11
to irod...@googlegroups.com
Good detective work.

> - Can the PRC do something to maintain the connection during checksum-phase?

I think this is probably the most reasonable approach.

Terrell

joris luijsterburg

unread,
Jun 12, 2024, 4:04:05 AMJun 12
to iRODS-Chat
Thnx!

I made an issue for it in the PRC. This does leave me with an additional question about the checksum calculation though. Does irods also parallelize this?

Best regards,
Joris

Terrell Russell

unread,
Jun 12, 2024, 7:13:45 AMJun 12
to irod...@googlegroups.com
The checksum calculation itself is singularly not parallel.  The default hashing algorithm of sha256 is used and is not parallelizable for a single file.  We have discussed parallel hashing functions, but those are not currently implemented.

In general, everything in iRODS is synchronous because of the transactional nature of the database backing the catalog and the call/response nature of the network calls that are being made.

However, the delay queue is operated on concurrently by the advanced setting `number_of_concurrent_delay_rule_executors` (default 4).  So, if a checksum calculation for a particular data object is put onto the delay queue, then it will be handled in parallel along with other delayed rules.  This could parallelize the work of many files, however, at the cost of a time window where your data objects may be incorrect and already at rest in iRODS, available for other tooling to see/use them.  Determining whether this is okay is a matter of policy for your organization and/or particular workflow.

Terrell


--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.

Terrell Russell

unread,
Jun 12, 2024, 7:17:19 AMJun 12
to irod...@googlegroups.com
For future readers... the new issue from Joris is at https://github.com/irods/python-irodsclient/issues/564

Terrell

Reply all
Reply to author
Forward
0 new messages