irods error

198 views
Skip to first unread message

kxk...@gmail.com

unread,
Oct 27, 2020, 3:05:45 PM10/27/20
to iRODS-Chat
Hello,

Our Galaxy server uses irods for storage on Test environment. Galaxy and irods servers are on different hosts. Occasionally, I get this NetworkException in Galaxy server log (I have not been able to re-produce this problem, BTW):

timeout: timed out
 File "irods/connection.py", line 91, in recv msg = iRODSMessage.recv(self.socket)
 File "irods/message/__init__.py", line 74, in recv rsp_header_size = _recv_message_in_len(sock, 4)
 File "irods/message/__init__.py", line 27, in _recv_message_in_len buf = sock.recv(size_left, socket.MSG_WAITALL)

NetworkException: Could not receive server response: timed out
 File "galaxy/objectstore/irods.py", line 314, in _data_object_exists self.session.data_objects.get(data_object_path)
 File "irods/manager/data_object_manager.py", line 45, in get parent = self.sess.collections.get(irods_dirname(path))
 File "irods/manager/collection_manager.py", line 17, in get result = query.one()
 File "irods/query.py", line 220, in one results = self.execute() File "irods/query.py", line 174, in execute result_message = conn.recv()
 File "irods/connection.py", line 95, in recv raise NetworkException("Could not receive server response: " + str(e))


Looking at irods log, I see the following:

Oct 26 11:41:30 pid:7677 remote addresses: 129.114.58.190 ERROR: [-]    /irods/server/core/src/rsApiHandler.cpp:540:int readAndProcClientMsg(rsComm_t *, i    nt) :  status [SYS_SOCK_READ_ERR]  errno [Connection timed out] -- message [failed to call 'read header']

135         [-]     /irods/lib/core/src/sockComm.cpp:201:irods::error readMsgHeader(irods::network_object_ptr, msgHeader_t *, struct timeval *) :  status [SYS    _SOCK_READ_ERR]  errno [Connection timed out] -- message [failed to call 'read header']

136                 [-]     /irods/plugins/network/tcp/libtcp.cpp:190:irods::error tcp_read_msg_header(irods::plugin_context &, void *, struct timeval *) :  s    tatus [SYS_SOCK_READ_ERR]  errno [Connection timed out] -- message [error reading from socket after [0] bytes read]

137                         [-]     /irods/plugins/network/tcp/libtcp.cpp:71:irods::error tcp_socket_read(int, void *, int, int &, struct timeval *) :  status     [SYS_SOCK_READ_ERR]  errno [Connection timed out] -- message [error reading from socket after [0] bytes read]

138 

139 Oct 26 11:41:30 pid:7677  ERROR: Agent [7677] exiting with status = -116110

140 Oct 26 11:41:30 pid:24472  ERROR: Agent process [7677] exited with status [114]

I ran ierror on -116110 and its SYS_SOCK_READ_ERR. Any idea why this happens on the server side? I see the Agent ERROR prior to this exception being thrown. Not sure if its a cause for this exception or a result.

Any help is greatly appreciated.
-Kaivan




Terrell Russell

unread,
Oct 27, 2020, 3:19:31 PM10/27/20
to irod...@googlegroups.com
Kaivan,

This is with python-irodsclient v0.8.4, which includes your stale connection PR?

Can you add some logging to determine how long connections have been open on the client/Galaxy side?   It appears the iRODS server thought nobody was still on the other end.

Terrell

 

--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/2c067fc5-1f00-4f32-a7d2-b3ccd9bb0344n%40googlegroups.com.

xk302a

unread,
Oct 27, 2020, 3:49:35 PM10/27/20
to irod...@googlegroups.com
Yes, this has the stale connection logic and I still get the error message :(
I'll log connection duration. This requires me to add a create_time to Connection class. I'll do this in my fork (If needed, I can create a PR to merge it to master later)
Thanks Terrell,
-Kaivan

dmoore.renci

unread,
Oct 29, 2020, 6:05:09 PM10/29/20
to iRODS-Chat
Kaivan,
I know the existing anti-staleness connection logic marks the return to the pool as the last use time. Depending on the use pattern (are sessions ever held on to for a significant amount of time before  cleanup?) a fixed interval  threshold for assuming staleness may need to be larger to accommodate the increased error margin. Does a lower threshold cause this error to happen more often?
Dan

kxk...@gmail.com

unread,
Oct 29, 2020, 8:37:21 PM10/29/20
to iRODS-Chat
Thanks Dan,

I found one problem at my end. I was not setting the connection_refresh_time config parameter, hence, it would default to the no-refresh behavior. Fixed that and another issue (Old connections had to be dropped even if they were used recently), and deployed the revised code. Will check the server logs tomorrow to see the error is there or not. Will repot back :) 

kxk...@gmail.com

unread,
Oct 30, 2020, 12:09:26 PM10/30/20
to iRODS-Chat

So, I have not seen a timeout for 24 hours, which is very encouraging. I will check the log and if there are no timeouts over the weekend, I will then declare victory.

Terrell, I will be creating a new PR as I've changed the logic to re-create old connections, not not-used connections.

Best,
-Kaivan

Terrell Russell

unread,
Oct 30, 2020, 12:12:26 PM10/30/20
to irod...@googlegroups.com
Tentatively... excellent.

Terrell


kxk...@gmail.com

unread,
Nov 2, 2020, 11:56:56 AM11/2/20
to iRODS-Chat

No errors over the weekend. I think re-creating old connections will solve this issue. Will create a PR soon. Thanks.

Terrell Russell

unread,
Nov 2, 2020, 12:13:50 PM11/2/20
to irod...@googlegroups.com
Gotcha - thanks.

Terrell


Reply all
Reply to author
Forward
0 new messages