After monitoring this for a further week, it doesn't seem that my issue has been fixed.
The extra node did seem to initially make a difference but the issue still persists unfortunately.
I've also upgraded the disks that are being used by all 4 elastic nodes from Standard HDD to SSDs now as I thought that might be the issue, but have not seen any immediate improvement since the upgrade a few hours ago.
Looking at the cluster in Stack Monitoring in Kibana, the CPU usage isn't overly high on any of the nodes and ranges from 30-95% but is never hitting 100%.
I also noted that in the indices tab in monitoring, if I sort by "Index Rate", I can see some indexes have a rate of nearly 300 /s whereas the wazuh index ranges from 30 - 100 at any one time which seems very low.
Looking at the filebeat logs on each of the wazuh nodes, if I'm reading it correctly, it seems that each node is only processing around 200 events per 30 seconds which is way lower than the amount of logs we have incoming into the alerts.json file:
Mar 14 15:41:24 waz-01 filebeat[27698]: 2022-03-14T15:41:24.973Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":1190,"time":{"ms":8}},"total":{"ticks":4400,"time":{"ms":23},"value":4400},"user":{"ticks":3210,"time":{"ms":15}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":15},"info":{"ephemeral_id":"****","uptime":{"ms":2430050},"version":"7.14.2"},"memstats":{"gc_next":56313840,"memory_alloc":37211368,"memory_total":818280384,"rss":132743168},"runtime":{"goroutines":47}},"filebeat":{"events":{"added":350,"done":350},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":200,"active":200,"batches":4,"total":200},"read":{"bytes":45718},"write":{"bytes":561300}},"pipeline":{"clients":2,"events":{"active":4117,"published":350,"total":350},"queue":{"acked":350}}},"registrar":{"states":{"current":2,"update":350},"writes":{"success":4,"total":4}},"system":{"load":{"1":0.1,"15":0.21,"5":0.19,"norm":{"1":0.025,"15":0.0525,"5":0.0475}}}}}}
Mar 14 15:41:54 -waz-01 filebeat[27698]: 2022-03-14T15:41:54.972Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":1200,"time":{"ms":12}},"total":{"ticks":4420,"time":{"ms":22},"value":4420},"user":{"ticks":3220,"time":{"ms":10}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":15},"info":{"ephemeral_id":"****","uptime":{"ms":2460053},"version":"7.14.2"},"memstats":{"gc_next":56313840,"memory_alloc":41090448,"memory_total":822159464,"rss":132743168},"runtime":{"goroutines":47}},"filebeat":{"events":{"added":200,"done":200},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":200,"active":200,"batches":4,"total":200},"read":{"bytes":45719},"write":{"bytes":586748}},"pipeline":{"clients":2,"events":{"active":4117,"published":200,"total":200},"queue":{"acked":200}}},"registrar":{"states":{"current":2,"update":200},"writes":{"success":1,"total":1}},"system":{"load":{"1":0.2,"15":0.22,"5":0.21,"norm":{"1":0.05,"15":0.055,"5":0.0525}}}}}}
Mar 14 15:42:24 pro-waz-01 filebeat[27698]: 2022-03-14T15:42:24.975Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":1210,"time":{"ms":5}},"total":{"ticks":4450,"time":{"ms":30},"value":4450},"user":{"ticks":3240,"time":{"ms":25}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":15},"info":{"ephemeral_id":"****","uptime":{"ms":2490050},"version":"7.14.2"},"memstats":{"gc_next":56313840,"memory_alloc":44361472,"memory_total":825430488,"rss":132743168},"runtime":{"goroutines":47}},"filebeat":{"events":{"added":150,"done":150},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":150,"active":200,"batches":3,"total":150},"read":{"bytes":34292},"write":{"bytes":423821}},"pipeline":{"clients":2,"events":{"active":4117,"published":150,"total":150},"queue":{"acked":150}}},"registrar":{"states":{"current":2,"update":150},"writes":{"success":3,"total":3}},"system":{"load":{"1":0.33,"15":0.23,"5":0.23,"norm":{"1":0.0825,"15":0.0575,"5":0.0575}}}}}}
Also what seems weird is that when it's "quieter" (e.g. out of hours), filebeat seems to be able to keep up with processing the incoming events and the above log files have numbers in the tens of thousand ranges e.g:
Mar 09 04:27:57 waz-01 filebeat[5491]: 2022-03-09T04:27:57.431Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":103990,"time":{"ms":180}},"total":{"ticks":543800,"time":{"ms":813},"value":543800},"user":{"ticks":439810,"time":{"ms":633}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":16},"info":{"ephemeral_id":"*****","uptime":{"ms":45060049},"version":"7.14.2"},"memstats":{"gc_next":72352496,"memory_alloc":58068936,"memory_total":116420989176,"rss":148754432},"runtime":{"goroutines":52}},"filebeat":{"events":{"active":1,"added":8443,"done":8442},"harvester":{"open_files":2,"running":2}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":8442,"active":200,"batches":169,"total":8442},"read":{"bytes":1928969},"write":{"bytes":30186763}},"pipeline":{"clients":2,"events":{"active":4117,"published":8442,"total":8442},"queue":{"acked":8442}}},"registrar":{"states":{"current":4,"update":8442},"writes":{"success":115,"total":115}},"system":{"load":{"1":0.33,"15":0.39,"5":0.46,"norm":{"1":0.0825,"15":0.0975,"5":0.115}}}}}}
Mar 09 04:28:27 waz-01 filebeat[5491]: 2022-03-09T04:28:27.431Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":104240,"time":{"ms":251}},"total":{"ticks":544930,"time":{"ms":1131},"value":544930},"user":{"ticks":440690,"time":{"ms":880}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":16},"info":{"ephemeral_id":"*****","uptime":{"ms":45090049},"version":"7.14.2"},"memstats":{"gc_next":76545680,"memory_alloc":53250976,"memory_total":116666359584,"rss":149356544},"runtime":{"goroutines":52}},"filebeat":{"events":{"active":-1,"added":12037,"done":12038},"harvester":{"open_files":2,"running":2}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":11938,"active":200,"batches":239,"total":11938},"read":{"bytes":2727597},"write":{"bytes":41486590}},"pipeline":{"clients":2,"events":{"active":4117,"published":12038,"total":12038},"queue":{"acked":12038}}},"registrar":{"states":{"current":4,"update":12038},"writes":{"success":152,"total":152}},"system":{"load":{"1":0.28,"15":0.38,"5":0.43,"norm":{"1":0.07,"15":0.095,"5":0.1075}}}}}}
Mar 09 04:28:57 waz-01 filebeat[5491]: 2022-03-09T04:28:57.431Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":104580,"time":{"ms":342}},"total":{"ticks":546410,"time":{"ms":1475},"value":546410},"user":{"ticks":441830,"time":{"ms":1133}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":16},"info":{"ephemeral_id":"*****","uptime":{"ms":45120050},"version":"7.14.2"},"memstats":{"gc_next":62474240,"memory_alloc":41939504,"memory_total":116981560640,"rss":151396352},"runtime":{"goroutines":52}},"filebeat":{"events":{"added":14986,"done":14986},"harvester":{"open_files":2,"running":2}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":14986,"active":200,"batches":300,"total":14986},"read":{"bytes":3423914},"write":{"bytes":54949313}},"pipeline":{"clients":2,"events":{"active":4117,"published":14986,"total":14986},"queue":{"acked":14986}}},"registrar":{"states":{"current":4,"update":14986},"writes":{"success":190,"total":190}},"system":{"load":{"1":0.34,"15":0.39,"5":0.44,"norm":{"1":0.085,"15":0.0975,"5":0.11}}}}}}
It seems as if Filebeat is capping itself in the number of events it's sending to Elastic? As the numbers also seem to be nicely rounded, when compared to the times it's running properly.
There are no warnings or errors in any of the Elasticsearch nodes apart from this one which always occurs on the master node but not sure it's related:
[2022-03-14T16:36:26,393][WARN ][r.suppressed ] [ela-02] path: /_xpack/usage, params: {}
java.util.concurrent.CancellationException: Task cancelled
at org.elasticsearch.xpack.core.action.TransportXPackUsageAction.lambda$masterOperation$2(TransportXPackUsageAction.java:66) ~[x-pack-core-7.14.2.jar:7.14.2]
at org.elasticsearch.xpack.core.common.IteratingActionListener.onResponse(IteratingActionListener.java:121) [x-pack-core-7.14.2.jar:7.14.2]
at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) [elasticsearch-7.14.2.jar:7.14.2]
at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) [elasticsearch-7.14.2.jar:7.14.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.14.2.jar:7.14.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.14.2.jar:7.14.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:831) [?:?]
I'm running out of ideas but this is starting to become a real problem for us.
Does anyone have any ideas on where look to try to resolve this?
Thanks in advance,
Gen.