We are observing a recurring failure in which the iRODS delay server stops processing the delay queue. We identified the condition on multiple independent deployments and zones. So far the only recovery strategy we managed to find requires restarting the affected iRODS server.
When the issue occurs, the iRODS server itself remains alive and responsive, but no delayed rules are dequeued, nor executed. Queued rules accumulate and their expected execution time is not updated.
We were wondering if there is any experience with this type of issue that we can leverage.
-- Luca Le Preux (University of Groningen)
The issue seems to occur independently of the specific rules being queued to the delay server and manually removing the rules from the queue does not recover the process.
We also could not link the issue to any particular load condition of the server. The fault occurs independently of the up-time of the server: sometimes it can go for almost a month without issues, other times the issue appears several times in a week, each time the iRODS server is manually restarted.
No explicit diagnostic logs are produced by the delay server at trace log level that we can attribute to this issue. Since the first occurrence of the issue we set the delay server log level to “trace”. Looking at the server logs around the time the issue occurs we cannot see any errors and the only obvious sign of fault is the delay server going silent.
iRODS version: 4.3.3 and 4.3.4
Delay server logging level: trace
Enabled rule engines (in configured order):
irods-rule-engine-plugin-metadata-guard
Irods-rule-engine-plugin-logical-quotas
irods_rule_engine_plugin-irods_rule_language-instance
Irods-rule-engine-plugin-audit-amqp
Irods_rule_engine_plugin-python-instance
irods_rule_engine_plugin-cpp_default_policy-instance
The fault occurs independently of the up-time of the server: sometimes it can go for almost a month without issues, other times the issue appears several times in a week,
--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/irod-chat/730290dc-032c-4cf3-bb48-a43a5877ca21n%40googlegroups.com.
Hi Kory,
Thank you for your reply. Here is more information based on your suggestions:
Before and during the issue window the memory usage of the affected machine remains at about 50%. The machine does not look like it is under excessive load.
The delay server process still looks active in the processes tree, though no logs (at trace level) are produced anymore.
We have noticed that during the failure, ips reports an increased number of connections from the delay server. Other times the number of connections from the dealy server is only one.
At the moment we are not able to reproduce the issue in an isolated environment yet and we are lacking triggering conditions to make the system fail in a controlled manner. For these reasons we are not able to test the issue with iRODS 5.0.2 yet.
Best regards,
Luca Le Preux
University of Groningen
To view this discussion visit https://groups.google.com/d/msgid/irod-chat/c082fcd8-a97b-474b-9b38-22f6c1411195n%40googlegroups.com.