MaxScale 23.02.1 crash

359 views
Skip to first unread message

Eoin Kim

unread,
Apr 17, 2023, 10:47:28 PM4/17/23
to MaxScale
Hi Team,

I'm wondering if I could get some help regarding an issue I'm having.

MaxScale is running on a VM and it crashes from time to time. I've enabled gdb-stacktrace and managed to catch some messages although they're cryptic to me.

I have extracted related part from the log file and attached here. I've also attached the current configuration file. The issue happens on the other version as well - our production is running with 6.2.1 and hasn't been upgraded yet.

If this crash is related to misconfiguration, I'd like to get some advice to correct it. If further information is required, please let me know. I am happy to provide.

Hope you can help. Thank you.

Eoin
maxscale.log
maxscale.cnf

Markus Mäkelä

unread,
Apr 18, 2023, 12:58:37 AM4/18/23
to maxs...@googlegroups.com

Hi,

Based on the stacktrace, it looks like it might be a systemd watchdog timeout. If this is the case, the output of journalctl -u maxscale should have a message about a watchdog timeout. The stacktrace itself doesn't show anything that would immediately tell me it's a bug which makes me think it might be related to the virtualized environment somehow stalling things enough to trigger it. You can test this theory by editing the /lib/systemd/system/maxscale.service file and increasing the timeout from WatchdogSec=60s to WatchdogSec=120s, this should reduce the occurrence of these problems.

Another thing to pay attention is whether there are any errors in the applications that connect through MaxScale. If they seem to end up timing out at roughly the same time as MaxScale does, it might indicate that somehow a thread in MaxScale stalls and is not responding. The stacktrace does not suggest that this is the case but it's possible that some pattern of events results in an infinite loop. If the applications that use MaxScale are not timing out when this problem occurs, it would support the first theory where the VM is "too slow" to respond in time to the systemd watchdog.

If you can, please file a bug report on the MariaDB Jira under the MaxScale project. Regardless of the cause, we should be able to prevent this even in virtualized environments.

Markus

--
You received this message because you are subscribed to the Google Groups "MaxScale" group.
To unsubscribe from this group and stop receiving emails from it, send an email to maxscale+u...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/maxscale/e18cc6de-82d8-4c6a-90c4-8a3ca1a6d553n%40googlegroups.com.
-- 
Markus Mäkelä, Senior Software Engineer
MariaDB Corporation

Eoin Kim

unread,
Apr 18, 2023, 1:33:57 AM4/18/23
to MaxScale
Hi Markus,

Thanks for your prompt answer, much appreciated. You're correct, I've had a look at the journals and yes, I could see a watchdog timeout. There was another restart this afternoon but with a different signal. I guess the system had its memory bust (the VM has 2 CPUs and only 4GB RAM)? Is there a way to control how much memory MaxScale can utilise?

$ sudo journalctl -u maxscale --since '2023-04-18 05:00:00'-- Logs begin at Mon 2022-02-07 11:26:34 AEST, end at Tue 2023-04-18 15:18:42 AEST. --
Apr 18 05:42:34 me1mspp systemd[1]: maxscale.service: Watchdog timeout (limit 1min)!
Apr 18 05:42:34 me1mspp systemd[1]: maxscale.service: Killing process 76329 (maxscale) with signal SIGABRT.
Apr 18 05:43:53 me1mspp systemd[1]: maxscale.service: Main process exited, code=dumped, status=6/ABRT
Apr 18 05:43:53 me1mspp systemd[1]: maxscale.service: Failed with result 'watchdog'.
Apr 18 05:43:58 me1mspp systemd[1]: maxscale.service: Scheduled restart job, restart counter is at 2.
Apr 18 05:43:58 me1mspp systemd[1]: Stopped MariaDB MaxScale Database Proxy.
Apr 18 05:43:58 me1mspp systemd[1]: Starting MariaDB MaxScale Database Proxy...
Apr 18 05:43:59 me1mspp maxscale[77659]: /etc/maxscale.cnf.d does not exist, not reading.
Apr 18 05:43:59 me1mspp maxscale[77659]: Module 'mariadbmon' loaded from '/usr/lib/x86_64-linux-gnu/maxscale/libmariadbmon.so'.
Apr 18 05:43:59 me1mspp maxscale[77659]: Module 'readwritesplit' loaded from '/usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so'.
Apr 18 05:43:59 me1mspp maxscale[77659]: Using up to 586.92MiB of memory for query classifier cache
Apr 18 05:43:59 me1mspp systemd[1]: Started MariaDB MaxScale Database Proxy.
Apr 18 14:15:46 me1mspp systemd[1]: maxscale.service: Main process exited, code=killed, status=9/KILL
Apr 18 14:15:46 me1mspp systemd[1]: maxscale.service: Failed with result 'signal'.
Apr 18 14:15:52 me1mspp systemd[1]: maxscale.service: Scheduled restart job, restart counter is at 3.
Apr 18 14:15:52 me1mspp systemd[1]: Stopped MariaDB MaxScale Database Proxy.
Apr 18 14:15:52 me1mspp systemd[1]: Starting MariaDB MaxScale Database Proxy...
Apr 18 14:15:53 me1mspp maxscale[84949]: /etc/maxscale.cnf.d does not exist, not reading.
Apr 18 14:15:53 me1mspp maxscale[84949]: Module 'mariadbmon' loaded from '/usr/lib/x86_64-linux-gnu/maxscale/libmariadbmon.so'.
Apr 18 14:15:53 me1mspp maxscale[84949]: Module 'readwritesplit' loaded from '/usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so'.
Apr 18 14:15:53 me1mspp maxscale[84949]: Using up to 586.92MiB of memory for query classifier cache
Apr 18 14:15:53 me1mspp systemd[1]: Started MariaDB MaxScale Database Proxy.


Regarding your additional comment - whether there are any errors in the applications that connect through MaxScale. Currently, applications connecting databases via MaxScale don't have connection closing setup as far as I heard from our dev team. Idle session/connection close is controlled by MaxScale (I've set the connection_timeout). I also checked application logs and there were no errors nor signs of database connection issues. Do you think this setup can be causing the issue as well? Can this problem be mitigated by sort of limiting maximum thread numbers? Is there a configuration for that in MaxScale? Or, will setting connection close in our applications fix this?

Thanks again.

Eoin

Markus Mäkelä

unread,
Apr 18, 2023, 1:49:45 AM4/18/23
to maxs...@googlegroups.com

Hi,

The second KILL/9 signal most likely has a corresponding kernel message (you'll see if if you drop the -u maxscale part) about the system running out of memory and it killing the MaxScale process. There is an open bug report about MaxScale running out of memory (MXS-4582) and I believe it is under investigation. If you can confirm that this is indeed an OOM situation, you could leave a comment on that bug report stating that you're also facing a similar problem. It would be helpful if you can copy the kernel messages about how much the MaxScale process is using memory there as well.

You can reduce the amount of memory MaxScale uses by disabling the query classifier cache but given that it uses only 15% of available memory by default, it should not cause this problem. To lower it from the~500MB it's using now to 1MB, you can add query_classifier_cache_size=1M under the [maxscale] section. If this is a memory leak of some sorts, it is unlikely that this will solve the problem but it's worth trying out.

If you can monitor the memory usage of MaxScale and see if it keeps growing, it would be supporting evidence that this might be a memory leak that's causing the watchdog timeout. If the system ends up running out of memory and it has to resort to swap memory, MaxScale might be so slow that it can't respond in time to the watchdog. This could explain why this is happening and why it seems to happen on a regular basis.

Markus

Eoin Kim

unread,
Apr 18, 2023, 2:04:16 AM4/18/23
to MaxScale
Hi Markus,

Thanks a lot, you gave me a hint - kernel message. The dmesg result shows the OOM. It looks like I can't post anything into Jira you linked, so I'll just attach the dmesg log here.

Your comment exactly describes what's happening on our system. The memory usage of MaxScale keeps growing and eventually swap is used and then MaxScale gets busted one night (or day). We've configured restart procedure for this sort of events, so we're somewhat covered. But it looks like there is a memory leak as you said. If that's the case, I guess there's nothing I can do for a while?

Eoin
dmesg.txt

Markus Mäkelä

unread,
Apr 18, 2023, 2:33:14 AM4/18/23
to maxs...@googlegroups.com

Hi,

If you can, I'd highly recommend registering on the MariaDB Jira and leaving a comment. It's a far more convenient place to store logs, configs and such compared to this mailing list. It'll also help anyone else that runs into this problem if the stacktrace is shared there and they have a matching one.

If the problem happens constantly (seems like it does), you could try using jemalloc to profile the heap. Here's a Red Hat article on how to configure it on their system but based on the log messages, you're using Ubuntu. You'll probably have to adapt it a bit but I think adding the following into the /lib/systemd/system/maxscale.service file into the [Service] section should make it work after installing the libjemalloc2 package:

Environment=MALLOC_CONF=prof:true,prof_leak:true,lg_prof_sample:19,prof_final:true,stats_print:true,prof_prefix:/var/log/maxscale/jeprof
Environment=LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

This should generate a file in /var/log/maxscale/ that contains the heap profiling information when MaxScale is stopped. The file shows if any memory leaks on shutdown and if it does, where the leak is located. This would greatly help us speed up fixing any problems that we aren't able to reproduce.

Markus

Eoin Kim

unread,
Apr 18, 2023, 3:37:33 AM4/18/23
to MaxScale
Hi Markus,

Thanks for your information. Okay, I'll sign up the Jira and post all the conversations there. I'll also implement your suggestion into our system and see how it goes.

To get another crash, I may need to wait for a few days. I'll keep you posted. Thank you.

Eoin
Reply all
Reply to author
Forward
0 new messages