Hi,
Based on the stacktrace, it looks like it might be a systemd watchdog timeout. If this is the case, the output of journalctl -u maxscale should have a message about a watchdog timeout. The stacktrace itself doesn't show anything that would immediately tell me it's a bug which makes me think it might be related to the virtualized environment somehow stalling things enough to trigger it. You can test this theory by editing the /lib/systemd/system/maxscale.service file and increasing the timeout from WatchdogSec=60s to WatchdogSec=120s, this should reduce the occurrence of these problems.
Another thing to pay attention is whether there are any errors in the applications that connect through MaxScale. If they seem to end up timing out at roughly the same time as MaxScale does, it might indicate that somehow a thread in MaxScale stalls and is not responding. The stacktrace does not suggest that this is the case but it's possible that some pattern of events results in an infinite loop. If the applications that use MaxScale are not timing out when this problem occurs, it would support the first theory where the VM is "too slow" to respond in time to the systemd watchdog.
If you can, please file a bug report on the MariaDB Jira under the MaxScale project. Regardless of the cause, we should be able to prevent this even in virtualized environments.
Markus
--
You received this message because you are subscribed to the Google Groups "MaxScale" group.
To unsubscribe from this group and stop receiving emails from it, send an email to maxscale+u...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/maxscale/e18cc6de-82d8-4c6a-90c4-8a3ca1a6d553n%40googlegroups.com.
-- Markus Mäkelä, Senior Software Engineer MariaDB Corporation
Hi,
The second KILL/9 signal most likely has a corresponding kernel
message (you'll see if if you drop the -u
maxscale part) about the system running out of memory and
it killing the MaxScale process. There is an open bug report about
MaxScale running out of memory (MXS-4582)
and I believe it is under investigation. If you can confirm that
this is indeed an OOM situation, you could leave a comment on that
bug report stating that you're also facing a similar problem. It
would be helpful if you can copy the kernel messages about how
much the MaxScale process is using memory there as well.
You can reduce the amount of memory MaxScale uses by disabling the query classifier cache but given that it uses only 15% of available memory by default, it should not cause this problem. To lower it from the~500MB it's using now to 1MB, you can add query_classifier_cache_size=1M under the [maxscale] section. If this is a memory leak of some sorts, it is unlikely that this will solve the problem but it's worth trying out.
If you can monitor the memory usage of MaxScale and see if it
keeps growing, it would be supporting evidence that this might be
a memory leak that's causing the watchdog timeout. If the system
ends up running out of memory and it has to resort to swap memory,
MaxScale might be so slow that it can't respond in time to the
watchdog. This could explain why this is happening and why it
seems to happen on a regular basis.
Markus
To view this discussion on the web, visit https://groups.google.com/d/msgid/maxscale/cfdd0ba9-7253-4de4-8366-6798d21a958en%40googlegroups.com.
Hi,
If you can, I'd highly recommend registering on the MariaDB Jira
and leaving a comment. It's a far more convenient place to store
logs, configs and such compared to this mailing list. It'll also
help anyone else that runs into this problem if the stacktrace is
shared there and they have a matching one.
If the problem happens constantly (seems like it does), you could try using jemalloc to profile the heap. Here's a Red Hat article on how to configure it on their system but based on the log messages, you're using Ubuntu. You'll probably have to adapt it a bit but I think adding the following into the /lib/systemd/system/maxscale.service file into the [Service] section should make it work after installing the libjemalloc2 package:
Environment=MALLOC_CONF=prof:true,prof_leak:true,lg_prof_sample:19,prof_final:true,stats_print:true,prof_prefix:/var/log/maxscale/jeprof
Environment=LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
This should generate a file in /var/log/maxscale/
that contains the heap profiling information when MaxScale is
stopped. The file shows if any memory leaks on shutdown and if it
does, where the leak is located. This would greatly help us speed
up fixing any problems that we aren't able to reproduce.
Markus
To view this discussion on the web, visit https://groups.google.com/d/msgid/maxscale/e4f0df58-78d4-40fb-b31f-c2dafa4778cdn%40googlegroups.com.