Delay server stops processing the delay queue without obvious reasons

25 views
Skip to first unread message

L. Le Preux

unread,
Feb 9, 2026, 7:37:32 AM (12 days ago) Feb 9
to iRODS-Chat

We are observing a recurring failure in which the iRODS delay server stops processing the delay queue. We identified the condition on multiple independent deployments and zones. So far the only recovery strategy we managed to find requires restarting the affected iRODS server.

When the issue occurs, the iRODS server itself remains alive and responsive, but no delayed rules are dequeued, nor executed. Queued rules accumulate and their expected execution time is not updated.


We were wondering if there is any experience with this type of issue that we can leverage.


-- Luca Le Preux (University of Groningen)


Observations

The issue seems to occur independently of the specific rules being queued to the delay server and manually removing the rules from the queue does not recover the process.

We also could not link the issue to any particular load condition of the server. The fault occurs independently of the up-time of the server: sometimes it can go for almost a month without issues, other times the issue appears several times in a week, each time the iRODS server is manually restarted.


Server logs

No explicit diagnostic logs are produced by the delay server at trace log level that we can attribute to this issue. Since the first occurrence of the issue we set the delay server log level to “trace”. Looking at the server logs around the time the issue occurs we cannot see any errors and the only obvious sign of fault is the delay server going silent.


Environments' configuration
  • iRODS version: 4.3.3 and 4.3.4

  • Delay server logging level: trace

  • Enabled rule engines (in configured order):

    • irods-rule-engine-plugin-metadata-guard

    • Irods-rule-engine-plugin-logical-quotas

    • irods_rule_engine_plugin-irods_rule_language-instance

    • Irods-rule-engine-plugin-audit-amqp

    • Irods_rule_engine_plugin-python-instance

    • irods_rule_engine_plugin-cpp_default_policy-instance

Kory Draughn

unread,
Feb 11, 2026, 10:30:38 AM (10 days ago) Feb 11
to irod...@googlegroups.com
Hi,

Is the delay server process still running (visible in the process tree) when you encounter the issue?
What does memory usage look like during that time?

We are aware of two open memory leak issues. Perhaps they are related to what you're seeing.
The fault occurs independently of the up-time of the server: sometimes it can go for almost a month without issues, other times the issue appears several times in a week,

That makes me think it's more of a timing / sequencing issue. Especially if the delay server process is still live following the fault.

Are you able to reproduce the problem with iRODS 5.0.2 in a test environment?

Kory Draughn
Chief Technologist
iRODS Consortium


--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/irod-chat/730290dc-032c-4cf3-bb48-a43a5877ca21n%40googlegroups.com.

L. Le Preux

unread,
Feb 13, 2026, 4:03:15 AM (8 days ago) Feb 13
to iRODS-Chat

Hi Kory,


Thank you for your reply. Here is more information based on your suggestions:

  1. Before and during the issue window the memory usage of the affected machine remains at about 50%. The machine does not look like it is under excessive load.

  2. The delay server process still looks active in the processes tree, though no logs (at trace level) are produced anymore. 

  3. We have noticed that during the failure, ips reports an increased number of connections from the delay server. Other times the number of connections from the dealy server is only one.

At the moment we are not able to reproduce the issue in an isolated environment yet and we are lacking triggering conditions to make the system fail in a controlled manner. For these reasons we are not able to test the issue with iRODS 5.0.2 yet.

Best regards,

Luca Le Preux

University of Groningen

Kory Draughn

unread,
Feb 13, 2026, 9:38:03 AM (8 days ago) Feb 13
to irod...@googlegroups.com
Hmm, I wonder if the connections aren't letting go because the rule being executed is stalled due to an infinite loop or blocked in some way.

Do any of your delay rules contain loops?
Can you summarize what your delay rules do?

Aside from that, the information reported by ips is stored in the /var/lib/irods/log/proc directory. The filenames in that directory match the PID numbers of live agents.
Have you tried confirming the delay server connections reported by ips match actual iRODS processes in the OS? If not, please give that a try next time you encounter the issue.

Kory Draughn
Chief Technologist
iRODS Consortium

Reply all
Reply to author
Forward
0 new messages