cqueues patch

4 views
Skip to first unread message

Thilo Molitor

unread,
Jan 8, 2021, 5:58:34 AM1/8/21
to proso...@googlegroups.com
This patch should prevent the cqueues layer from locking up the main queue of
the server effectively DOSing the whole prosody process.

- tmolitor
cqueues.patch

Matthew Wild

unread,
Jan 8, 2021, 7:06:33 AM1/8/21
to Prosody IM Developers Group
Hi Thilo,

Thanks for the contribution!

On Fri, 8 Jan 2021 at 10:58, Thilo Molitor <th...@eightysoft.de> wrote:
>
> This patch should prevent the cqueues layer from locking up the main queue of
> the server effectively DOSing the whole prosody process.

Can you update the commit message to match our usual format? And below
the first line include a description of exactly how the fix works. If
this is a workaround that should be removed in the future, please also
add a COMPAT comment in the code linking to the issue it is working
around.

You moved some things around, but as far as I can tell the actual
change is to switch from 'return timeout' to 'timer.add_task(timeout,
...)'.

From what you wrote on the lua-http issue report I understand this is
to prevent rescheduling the timer (which would be immediately
executed) but create a new one instead (which will be executed on the
next "tick").

However...... I don't see that this is actually the case. In Prosody
`return timeout` from a timer callback already creates a new timer. It
doesn't reschedule the current one. So I feel like I'm missing
something. Any ideas? Also, what network backend are you experiencing
the issue with? Each has slightly different handling of timers.

All this said, I'm not opposed to merging the patch (with the changes
mentioned above) because I'm fairly confident nobody but you is using
net.cqueues right now and it's a critical production issue for you.
But I'd still like to understand the problem and solution to make sure
we're not missing something (e.g. a deeper bug in Prosody's timer
handling).

Regards,
Matthew

Thilo Molitor

unread,
Jan 8, 2021, 2:53:22 PM1/8/21
to proso...@googlegroups.com
> Can you update the commit message to match our usual format? And below
> the first line include a description of exactly how the fix works. If
> this is a workaround that should be removed in the future, please also
> add a COMPAT comment in the code linking to the issue it is working
> around.
Sure

> You moved some things around, but as far as I can tell the actual
> change is to switch from 'return timeout' to 'timer.add_task(timeout,
> ...)'.
Yes it is, but I did not want to have this timeout handling in two places, so
I coalesced it into one single function.

> From what you wrote on the lua-http issue report I understand this is
> to prevent rescheduling the timer (which would be immediately
> executed) but create a new one instead (which will be executed on the
> next "tick").
Yes

> However...... I don't see that this is actually the case. In Prosody
> `return timeout` from a timer callback already creates a new timer. It
> doesn't reschedule the current one. So I feel like I'm missing
> something. Any ideas? Also, what network backend are you experiencing
> the issue with? Each has slightly different handling of timers.
I'm using the epoll backend and here it does reschedule the current one.
net/server_epoll.lua line 133 inserts a new timer using the variable "elapsed"
which is invariant in the whole timer loop --> returning 0.0 from the timer
sets the new timer's timestamp to "elapsed".
That timer gets peeked later in the loop and thus the if in line 124 does not
trigger (peek > elapsed), which makes the loop execute the rescheduled timer
again without ever leaving the timer loop at all.
Adding a new timer (with zero timeout) through the timer api will not use the
"frozen" elapsed variable, but a new call to monotonic() to calculate the
timeout, which will be biggern than "elapsed" and will be executed only after
the next epoll() call.

net/server_select.lua does not have this behaviour as far as I can tell.

> All this said, I'm not opposed to merging the patch (with the changes
> mentioned above) because I'm fairly confident nobody but you is using
> net.cqueues right now and it's a critical production issue for you.
> But I'd still like to understand the problem and solution to make sure
> we're not missing something (e.g. a deeper bug in Prosody's timer
> handling).
Sure, maybe a better fix would be to fix net/server_epoll.lua to not allow
creating endless loops through timers at all.
Initially I didn't want to patch server_epoll because I was not sure what
assumptions the prosody codebase makes about the network backend module.
But given that server_select does not exhibit the same behaviour, patching
server_epoll would bring the epoll backend back in line with the select based
one.
That makes my cqueues patch obsolete and I propose the following change of
line 132 instead:
-local next_time = elapsed+ret;
+local next_time = monotonic()+ret;

-tmolitor



Reply all
Reply to author
Forward
0 new messages