Worker and mavis processes lifecycle

18 views
Skip to first unread message

Andrzej Mendel-Nykorowycz

unread,
Mar 1, 2026, 7:22:35 AM (6 days ago) Mar 1
to Event-Driven Servers
Hello,

I am currently stress-testing my tac_plus-ng + LDAP PoC with the goal of properly sizing the VMs for eventual deployment in production. I am using a python script to run multiple TACACS request in parallel (and then repeated immediately) and noticed three issues:
  1. It is possible to trigger a DoS by causing tac_plus-ng to spawn too many perl processes for mavis backends. Since each perl process takes about 25 M of memory, If I hit tac_plus-ng with more than avalable_memory/25M requests at the same time, so many perl processes spawn that the host exhausts all memory, which basically disables the host (even the console is irresponsive), since after OOM killing a single perl process other processes that are spawning will use the memory just released. Setting maximum instances and maximum users to a value below avalable_memory/25M protects me from, but puts a hard limit at the number of requests tac_plus-ng can handle simultaneously.
  2. If I don't set the limits mentioned above, but keep parallel requests below avalable_memory/25M, I can sometimes also cause memory exhaustion. It appears that in some cases tac_plus-ng will spawn new mavis backends to handle incoming requests before terminating old ones.
  3. When limiting instances and users, some requests will understandably time out.
  4. In other cases, a worker will restart, and for a short period of time the listener process will reset new connections ("[Errno 104]  ConnectionResetError"). I can trigger this semi-reliably, i.e. this will repeat with some stress-test setting (e.g. 0,5s delay between requests from a single stress-test thread) and not with other (e.g. a 0,1s delay).
I will gladly provide debugs, configs, captures etc., but first I would like to better understand how tac_plus-ng handles processes to create a better test-case:
  1. Are mavis backends created on a per-request basis or are they reused between requests?
  2. When are mavis backends terminated?
  3. Is it possible to put a hard limit on the number of mavis backend processes spawned?
  4. If I limit the amount of memory available for tac_plus-ng (e.g. by using MemoryMax= in systemd), how will it handle mavis backend processes being OOM-killed?
  5. Under what circumstanes are worker processes terminated.
I hope this is not too much to ask.

Kind regards,
Andrzej

Marc Huber

unread,
Mar 1, 2026, 7:53:21 AM (6 days ago) Mar 1
to event-driv...@googlegroups.com

Hi,

the maximum number of MAVIS processes forked by the "external" module is controlled using "childs max" and defaults to 20, so yes, you can max out memory if your "instances max" (the upper limit of server processes) isn't limited reasonably. By default, the "external" module will run "childs min = 4" and start additional ones if further requests come in and the "max" isn't yet reached. MAVIS backend processes are being reused and won't be terminated.

Regarding the number of worker processes: The load-balancing algorithm is detailed in https://projects.pro-bono-publico.de/event-driven-servers/doc/spawnd.html#AEN493 -- to copy-paste from there:

spawnd allows configuration of upper and lower limits for users and processes. The distribution algorithm will try to assign new connections to one of the running servers with less than users_min connections. If all servers already have at least users_min active connections and the total number of servers doesn't exceed servers_max, an additional server process is started, and the connection is assigned to that process. If no more processes may be started, the connection is assigned to the server process with less than users_max users, which serves the lowest number of connections. Otherwise, the connection will stall until an existing connection terminates.

OOM behavior is mostly undefined. I think the "external" module can handle a fork() failure, but wouldn't vouch for that and frankly don't know how that interacts with systemd memory boundaries.

Cheers,

Marc

--
You received this message because you are subscribed to the Google Groups "Event-Driven Servers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to event-driven-ser...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/event-driven-servers/8d97e740-cd2b-4583-9b2b-5925170caf99n%40googlegroups.com.

Andrzej Mendel-Nykorowycz

unread,
Mar 1, 2026, 8:53:09 AM (5 days ago) Mar 1
to Event-Driven Servers
Thank you, I missed the "childs max" option in the documentation. Several follow-up questions to clarify:
  1. Do I understand correctly, that the "childs max" is applied per-worker, meaning that effective maximum number of MAVIS backend processes will be instances_max*childs_max?
  2. What happens when the number of requests (as determined by users_max) is higher than the number of backend processes? Are they queued by the worker process?
  3. "MAVIS backend processes are being reused and won't be terminated." - I saw the number of perl processes spike when serving requests and drop afterwards, so it would seem they are terminated after all.
  4. "Otherwise, the connection will stall until an existing connection terminates." - Does this mean that the listener receives a SYN from TACACS+ client, but doesn't respond with a SYN+ACK until a slot opens?
  5. Is it expected behaviour for tac_plus-ng workers to terminate when "retire limit" and "retire timeout" are not configured? This is what I saw in my tests and I wonder if I managed to somehow crash a worker process.
Best regards,
Andrzej

Marc Huber

unread,
Mar 1, 2026, 11:51:25 AM (5 days ago) Mar 1
to event-driv...@googlegroups.com

Hi Andrzej,

On 01.03.2026 14:53, Andrzej Mendel-Nykorowycz wrote:
  1. Do I understand correctly, that the "childs max" is applied per-worker, meaning that effective maximum number of MAVIS backend processes will be instances_max*childs_max?
yes.
  1. What happens when the number of requests (as determined by users_max) is higher than the number of backend processes? Are they queued by the worker process?
Spawnd will handle this (default: "overload = queue", other options are "close" (accept connection, but close it immediately) and "reset" (temporarily closes listening sockets)).
  1. "MAVIS backend processes are being reused and won't be terminated." - I saw the number of perl processes spike when serving requests and drop afterwards, so it would seem they are terminated after all.
The Perl processes will terminate if their parent process (the worker) terminates, or they can terminate voluntarily. I don't remember adding any auto-termination code, at least not outside the error path.
  1. "Otherwise, the connection will stall until an existing connection terminates." - Does this mean that the listener receives a SYN from TACACS+ client, but doesn't respond with a SYN+ACK until a slot opens?
No, that would require kernel-level operations. That's just the "overload = queue" option where the OS still accepts the connection but the daemon no longer listens until a free slot becomes available.
  1. Is it expected behaviour for tac_plus-ng workers to terminate when "retire limit" and "retire timeout" are not configured? This is what I saw in my tests and I wonder if I managed to somehow crash a worker process.

No, the workers should not terminate voluntarily without retire options being set. This might be a bug, or an OOM issue. You could check syslog for crash infos, the daemon should log such issues.

Cheers,

Marc

Andrzej Mendel-Nykorowycz

unread,
Mar 2, 2026, 6:03:17 AM (5 days ago) Mar 2
to Event-Driven Servers
Hi Marc,

I looked into cases where workers terminate and I don't see any segfaults. The processes appear to exit cleanly, outputing "Exiting" to the log. Below are logs for a case when two worker processes terminated simultaneously (1780147 and 1780149) and the listener process reporting sending a Connection refused that I also logged on the client end:

Mar  2 11:42:24 tacacs_server tac_plus-ng[1780149]: 1.2.3.4 looking for user user_redacted in MAVIS backend
[the above is repeated additional 8 times]
Mar  2 11:42:24 tacacs_server tac_plus-ng[1780147]: 1.2.3.4 looking for user user_redacted in MAVIS backend
Mar  2 11:42:24 tacacs_server tac_plus-ng[1780147]: 1.2.3.4 result for user user_redacted is ACK [3 ms]
Mar  2 11:42:24 tacacs_server tac_plus-ng[1780147]: 1.2.3.4 shell login for 'user_redacted' from python_device on python_tty0 succeeded (profile=admin)
Mar  2 11:42:24 tacacs_server tac_plus-ng[1780149]: 1.2.3.4 looking for user user_redacted in MAVIS backend
[the above is repeated additional 4 times]
Mar  2 11:42:24 tacacs_server tac_plus-ng[1780149]: 1.2.3.4 result for user user_redacted is ACK [37 ms]
Mar  2 11:42:24 tacacs_server tac_plus-ng[1780149]: 1.2.3.4 shell login for 'user_redacted' from python_device on python_tty0 succeeded (profile=admin)
[the two lines above are repeated additional 13 times (with different response times for mavis, obviously)]
Mar  2 11:42:24 tacacs_server tac_plus-ng[1780147]: - Exiting.
Mar  2 11:42:24 tacacs_server tac_plus-ng[1780149]: - Exiting.

Mar  2 11:42:24 tacacs_server tac_plus-ng[1777924]: scm_send_msg: sendmsg: Connection refused
Mar  2 11:42:24 tacacs_server tac_plus-ng[1777924]: scm_send_msg (/home/user_redacted/event-driven-servers-master/mavis/spawnd_accepted.c:438), pid: 1780149: Connection refused

Reply all
Reply to author
Forward
0 new messages