mod_mam CPU usage causing timeouts

20 views
Skip to first unread message

Geoff Stevenson

unread,
Apr 17, 2023, 4:56:03 AM4/17/23
to Prosody IM Users
Hello,
I run 2 separate prosody 0.11.8 servers with ~1000 users for a few years. Both have decent CPUs ( Xeon E-2336 @ 4.8ghz ). The clients all use a fork of Conversations. 

The last 6 months an issue has been occurring randomly every few weeks where Prosody CPU usage hits 100% and connections timeout. This can last for several minutes to over an hour, then eventually the problem resolves itself.

Looking through the logs, I tracked the issue down to mod_mam archive queries. This occurs on both groups and one-on-one conversations. Strangely it seems to mostly happen cross-server ( i.e: user A from example.com querying group B on example2.com ). The queries only take a few seconds each, but are followed one after another by more queries which is blocking the server.

Here's a snippet from the most recent outage (with usernames redacted):

Apr 12 20:03:17 c2s5644a83f4b00 debug   Received[c2s]: <iq type='set' from='XYZ/Chat.STv5' id='jSiYtL7_fhQ1'>
Apr 12 20:03:17 example.com:mam      debug   Archive query, id nuat69qfv1 with FOOBAR from 1674081740 until 1680912929)
Apr 12 20:03:21 example.com:mam      debug   Archive query nuat69qfv1 completed
Apr 12 20:03:21 c2s5644a83f4b00 debug   Sending[c2s]: <iq id='jSiYtL7_fhQ1' to='XYZ/Chat.STv5' type='result'>
Apr 12 20:03:21 c2s5644a83f4b00 debug   Calling delayed_ack_function directly (still waiting for ack)
Apr 12 20:03:21 c2s5644a83f4b00 debug   Firing event 'smacks-ack-delayed', queue = 146
Apr 12 20:03:21 c2s5644a83f4b00 debug   Handled 68 incoming stanzas
Apr 12 20:03:21 c2s5644a83f4b00 debug   Received[c2s]: <iq type='set' from='XYZ/Chat.STv5' id='XmZG3HZMm_wo'>
Apr 12 20:03:21 example.com:mam      debug   Archive query, id clmgs2ho0n with FOOBAR from 1680912929 until 1680912927)
Apr 12 20:03:24 example.com:mam      debug   Archive query clmgs2ho0n completed
Apr 12 20:03:24 c2s5644a83f4b00 debug   Sending[c2s]: <iq id='XmZG3HZMm_wo' to='XYZ/Chat.STv5' type='result'>
Apr 12 20:03:24 c2s5644a83f4b00 debug   Calling delayed_ack_function directly (still waiting for ack)
Apr 12 20:03:24 c2s5644a83f4b00 debug   Firing event 'smacks-ack-delayed', queue = 147
Apr 12 20:03:24 c2s5644a83f4b00 debug   Handled 69 incoming stanzas
Apr 12 20:03:24 c2s5644a83f4b00 debug   Received[c2s]: <iq type='set' from='XYZ/Chat.STv5' id='aJOndG1knKJn'>
Apr 12 20:03:24 example.com:mam      debug   Archive query, id lm9d3creil with ABC from 1675094398 until 1680912929)
Apr 12 20:03:27 example.com:mam      debug   Archive query lm9d3creil completed

I've tried a number of changes which had no effect:
- Tuning the gc to use threshold = 200
- archive_expires_after = "1w"
- max_archive_query_results = 20
- restarting the server

Does anyone know what's causing this or how it can be fixed? Is this related to https://issues.prosody.im/1523 ? Will upgrading to 0.12.3 resolve this ? I'd appreciate any help I can get. Besides this problem I'm really happy with prosody overall.


Geoff Stevenson

unread,
Apr 17, 2023, 4:56:10 AM4/17/23
to Prosody IM Users
Hello,
I have 2 separate Debian servers using Prosody 0.11.8 with a few thousand users for the past 4 years. The clients use a fork of Conversations. Everything works well except for the past few months an issue started occurring randomly every few weeks, where prosody CPU usage hits 100% and all the users timeout, losing their connection. This can last a few minutes to over an hour. Only 1 server is affected at a time.

I tracked the problem down to archive queries which are blocking prosody and eating up CPU usage. This occurs on queries for both groups and users. Strangely, they seem to involve users or groups on the other server.

From example1.com:
Mar 27 22:11:57 example1.com:mam    debug   Archive query, id 26p5u5m6sb with foo...@group.example2.com from the dawn of time until now)

From example2.com:
Apr 12 20:03:27 example2.com:mam    debug   Archive query, id tvoml8re7k with jo...@example1.com from 1680912929 until 1680912927)

I have tried a few fixes, none of which have worked:
- archive_expires_after = "1w"
- max_archive_query_results = 20
- Tuning the gc to use threshold = 200
- Restarting the server while the issue is happening

Eventually the issue resolves itself and connectivity is restored. Any idea on how to fix this or what's causing it ? Is this related to https://issues.prosody.im/1523 ? Would upgrading to Prosody 0.12.3 resolve this ? 

Cheers
Reply all
Reply to author
Forward
0 new messages