irods resource server stop responding

124 views
Skip to first unread message

d hu

unread,
Mar 19, 2018, 10:56:43 AM3/19/18
to iRODS-Chat
Hi,

We recently upgraded to 4.2.2 and noticed a problem with resource servers.

The resource server will stop responding to port 1247/1248. On the resource server no daemon is listening on port 1247,
so the original "/usr/sbin/irodsServer" seemed crashed.

The following is some log info, not sure if this log is related with this issue. Just wondering if this is a known issue and if there is a fix or workaround?

Thanks,
Dong


==
Mar 18 05:43:58 pid:2292 remote addresses: 172.20.1.25 ERROR: readWorkerTask - readStartupPack failed. -4000
Mar 18 05:48:58 pid:2292 remote addresses: 172.20.1.25 ERROR: readWorkerTask - readStartupPack failed. -4000
Mar 18 05:53:58 pid:2292 remote addresses: 172.20.1.25 ERROR: readWorkerTask - readStartupPack failed. -4000
Mar 18 05:58:58 pid:2292 remote addresses: 172.20.1.25 ERROR: readWorkerTask - readStartupPack failed. -4000
Mar 18 06:03:58 pid:2292 remote addresses: 172.20.1.25 ERROR: readWorkerTask - readStartupPack failed. -4000
Mar 18 06:08:58 pid:2292 remote addresses: 172.20.1.25 ERROR: readWorkerTask - readStartupPack failed. -4000
Mar 18 06:13:58 pid:2292 remote addresses: 172.20.1.25 ERROR: readWorkerTask - readStartupPack failed. -4000
Mar 18 06:18:58 pid:2292 remote addresses: 172.20.1.25 ERROR: readWorkerTask - readStartupPack failed. -4000
Mar 18 06:22:15 pid:2292 remote addresses: 172.17.18.239 ERROR: readWorkerTask - readStartupPack failed. -4000
Mar 18 06:22:15 pid:2292 remote addresses: 172.17.18.239 ERROR: readWorkerTask - readStartupPack failed. -4000
Mar 18 06:22:16 pid:2292 remote addresses: 172.17.18.239 ERROR: readWorkerTask - readStartupPack failed. -4000
Mar 18 06:22:16 pid:2292  ERROR: [-]    /tmp/tmpJzsKTL/server/core/src/irods_server_control_plane.cpp:1143:irods::error irods::server_control_executor::process_operation(const zmq::message_t &, std::string &) :  status [Unknown iRODS error]  errno [] -- message [failed in EVP_DecryptFinal_ex - error:0606506D:lib(6):func(101):reason(109)]
        [-]     /tmp/tmpJzsKTL/lib/core/src/irods_buffer_encryption.cpp:327:irods::error irods::buffer_crypt::decrypt(const array_t &, const array_t &, const array_t &, array_t &) :  status [Unknown iRODS error]  errno [] -- message [failed in EVP_DecryptFinal_ex - error:0606506D:lib(6):func(101):reason(109)]

Mar 18 06:22:16 pid:2292  ERROR: [-]    /tmp/tmpJzsKTL/server/core/src/irods_server_control_plane.cpp:768:void irods::server_control_executor::operator()() :  status [Unknown iRODS error]  errno [] -- message [failed in EVP_DecryptFinal_ex - error:0606506D:lib(6):func(101):reason(109)]
        [-]     /tmp/tmpJzsKTL/server/core/src/irods_server_control_plane.cpp:1144:irods::error irods::server_control_executor::process_operation(const zmq::message_t &, std::string &) :  status [Unknown iRODS error]  errno [] -- message [failed in EVP_DecryptFinal_ex - error:0606506D:lib(6):func(101):reason(109)]
                [-]     /tmp/tmpJzsKTL/lib/core/src/irods_buffer_encryption.cpp:327:irods::error irods::buffer_crypt::decrypt(const array_t &, const array_t &, const array_t &, array_t &) :  status [Unknown iRODS error]  errno [] -- message [failed in EVP_DecryptFinal_ex - error:0606506D:lib(6):func(101):reason(109)]

terminating with uncaught exception of type zmq::error_t: Operation cannot be accomplished in current state
Mar 19 09:34:32 pid:142822 NOTICE: Agent factory process pid = [142823]
    LocalHostName:  localhost, asa-data-7, asa-data-7. Port Num: 1247.

Zone Info:
    ZoneName: resarchivezone   Type: LOCAL_ICAT    HostAddr: irodsa  PortNum: 1247

===

Terrell Russell

unread,
Mar 19, 2018, 12:57:59 PM3/19/18
to irod...@googlegroups.com
Hi Dong,

Can you share a bit more about what version you upgraded from?  And on how many servers?

We've not seen this error before directly.   Can you share any logs from the resource server that isn't coming up?

Terrell




--
--
"iRODS: the Integrated Rule-Oriented Data-management System; A community driven, open source, data grid software solution" https://www.irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat

---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

d hu

unread,
Mar 19, 2018, 2:02:32 PM3/19/18
to iRODS-Chat
Hi Terrell,

Thanks for your response. The version was upgraded from 4.2.1 to current 4.2.2.
We have total of 16 resources servers configured as below, so far we saw irods resource daemon died on asa-data-1,2,3,4,7. This happened over the last few weeks, latest one happened on the node asa-data-7.
resarchiveResc:replication
├── asaResc:random
│   ├── asa-data-1Resource:unixfilesystem
│   ├── asa-data-2Resource:unixfilesystem
│   ├── asa-data-3Resource:unixfilesystem
│   ├── asa-data-4Resource:unixfilesystem
│   ├── asa-data-5Resource:unixfilesystem
│   ├── asa-data-6Resource:unixfilesystem
│   ├── asa-data-7Resource:unixfilesystem
│   └── asa-data-8Resource:unixfilesystem
└── asbResc:random
    ├── asb-data-1Resource:unixfilesystem
    ├── asb-data-2Resource:unixfilesystem
    ├── asb-data-3Resource:unixfilesystem
    ├── asb-data-4Resource:unixfilesystem
    ├── asb-data-5Resource:unixfilesystem
    ├── asb-data-6Resource:unixfilesystem
    ├── asb-data-7Resource:unixfilesystem
    └── asb-data-8Resource:unixfilesystem

The server just stopped responding on port 1247, and further check show the daemon was not listen on port 1247.

when I tried to stop I got the following errors,
==
_irods@asa-data-1:~$ /var/lib/irods/irodsctl status
irodsServer :
  Process 54764
  Process 97144
  Process 191739
_irods@asa-data-1:~$ /var/lib/irods/irodsctl stop
Stopping iRODS server...
Error encountered in graceful shutdown.
iRODS server processes remain after "irods-grid shutdown".
irodsServer :
  Process 54764
  Process 97144
  Process 191739
Killing forcefully...
Killing /usr/sbin/irodsServer, pid 54764
Killing /usr/sbin/irodsServer, pid 97144
Killing /usr/sbin/irodsServer, pid 191739
Success
===
The log I included in my first post WAS the log from the resource server that stopped responding. Please let me know if any more info you need from my side.

Thanks and best regards,
Dong
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.

Zoey Greer

unread,
Mar 20, 2018, 11:22:22 AM3/20/18
to irod-chat
Hey Dong,

Have you restarted iRODS on all your servers since upgrading? I would be curious if this problem goes away after restarting if you haven't already. Upgrading doesn't force a bounce of the server, which can leave you running stale binaries which could have various possible incompatibilities leading to a hang or a crash (and is definitely the most common thing I know of to cause errors in the graceful shutdown process).

On the other hand, if you have already restarted then I'll see if I can figure out what's going wrong with your decryption.

Thanks,
-Zoey

To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+unsubscribe@googlegroups.com.

d hu

unread,
Mar 20, 2018, 2:43:48 PM3/20/18
to iRODS-Chat
Thanks Zoey for looking into this problem.

All servers were restarted with the upgrading, including the postgre DB server as well. 

On the node with problem, the original "/usr/sbin/irodsServer" daemon just disappeared and no process listening on port 1247.
The daemon with "ESTABLISHED" status on port 1247 were fine. Please let me know if you need more info.

Thanks and best regards,
Dong


Zoey Greer

unread,
Mar 20, 2018, 4:59:14 PM3/20/18
to irod-chat
I took a closer look at the log, and I'm pretty sure you've uncovered a bug that causes the server to crash when a particular codepath is hit. I have a reasonably good guess as to what the codepath is, though I have no idea how it's getting triggered. Something is connecting to the control plane's port and (apparently?) transmitting malformatted data, I think. I can fix the bug and put it in the next release, and then you won't get the crashes, but I don't know why you're getting bad startupPacks and bad control plane messages in the first place, particularly sporadically. I will say that it is concerning to me that the resource server log thinks its type is "LOCAL_ICAT". Do you have some kind of complicated domain name resolution for load balancing running?

Thanks,
-Zoey

To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+unsubscribe@googlegroups.com.

d hu

unread,
Mar 21, 2018, 1:43:07 PM3/21/18
to iRODS-Chat
Hi Zoey,

Thanks so much for your update.

Looks to me the log was just reporting remote irods server info, which was correct. Our irods server is "irodsa.ststgg.ccm.sickkids.ca" so the type is LOCAL_ICAT for that server, even the log was on a resource server?
The log did show "LocalHostName:  localhost, asa-data-7, asa-data-7.stg.ccm.sickkids.ca, Port Num: 1247" which was the resource server itself.
Is there anything else we should check?

Thanks,
Dong

==
Zone Info:
    ZoneName: resarchivezone   Type: LOCAL_ICAT    HostAddr: irodsa.stg.ccm.sickkids.ca   PortNum: 1247

==
Reply all
Reply to author
Forward
0 new messages