Wazuh stops processing events each day

1,693 views
Skip to first unread message

Gen

unread,
Feb 24, 2022, 5:29:51 AM2/24/22
to Wazuh mailing list
Hi team,

Hoping to get some assistance with an issue I've started having last week.

Last week on Friday we added around 100 new agents to the cluster (2 nodes) which looks like started generating a lot more events.

Typically in the hour between 00:00 and 01:00 each day we get around 1.5m events. Since deploying these new agents, this has gone up to around 2.4m and you can see from the screenshots that it stops processing new events after a couple hours each morning.

wazuh-manager is running on both nodes and I couldn't see any relevant errors in the /var/ossec/logs/ossec.log files.
Filebeat is also reporting as running fine.

Also, as you can see from the third screenshot, the ossec-alerts.json file is being populated and has current data, the events just aren't making their way into the wazuh index.

Once I did a ossec-control restart, it started processing events again from the point that it dropped off today.

Is the cluster overloaded? Where can I look to determine why it stops processing events?

Thanks in advance,
Gen.
Capture.PNG
Capture.PNG
Capture.PNG

John Soliani

unread,
Feb 24, 2022, 1:33:22 PM2/24/22
to Wazuh mailing list

Hello Gen,

Thank you for using our community!

You did a great job detailing the issue, and you’ve done also, some troubleshooting. Well done!

The most important fact here is checking the managers are still generating alerts, this would mean that the agents are sending data to them, they are active and processing events and generating alerts. You can check this executing this command in each manager to verify the date and time of the last alert the manager has generated:

  • tail -1 /var/ossec/logs/alerts/alerts.json | grep timestamp

In your case, this looks ok, so we need to check the DataBase (Elasticsearch) state and Filebeat service (parser). First Elasticsearch.

  • systemctl status elasticsearch || service elasticsearch status
  • journalctl -u elasticsearch --no-pager | grep -E "ERRO|WARN|Fail" # It’s possible that you reached the maximum shards limit 1000/1000.
  • curl -k -u <USER>:<PASSWORD> -XGET https://<ELASTICSEARCH>:9200/_cluster/health?pretty=true

Let’s also check Filebeat logs:

  • systemctl status filebeat || service filebeat status
  • journalctl -u filebeat --no-pager | grep -E "ERRO|WARN" | tail
  • filebeat test output

The possible reasons I can think of are:

  • You could have reached the max limit in shards per node in Elasticsearch.
  • Elasticsearch turned into a read-only state to protect itself after detecting more than 95% of the disk/partition used.
  • Filebeat can’t load data into Elasticsearch.

In the first two situations, you will need to clean the DB and probably set retention policies so you keep the DB and the Managers always with enough resources (disk space in this case). The third situation will depend on the problem, we need more data.

Let me know what you’ve found.

Regards,
John.-


Gen

unread,
Feb 25, 2022, 3:29:12 AM2/25/22
to Wazuh mailing list
Hi John,

Thanks a lot for the support and the troubleshooting steps!

The managers are indeed still generating alert.

"tail -1 /var/ossec/logs/alerts/alerts.json | grep timestamp" - displays events from the current time on both wazuh servers.
"systemctl status elasticsearch" - Active: active (running) on all 3 elastic servers
"journalctl -u elasticsearch --no-pager | grep -E "ERRO|WARN|Fail"" - only 1 log line from 20th of January which is unrelated on all 3 nodes. "_cluster/health?pretty=true":

{
  "cluster_name" : "elastic-cluster-01",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 870,
  "active_shards" : 1740,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
"systemctl status filebeat" - Active: active (running) on both wazuh servers "journalctl -u filebeat --no-pager | grep -E "ERRO|WARN" | tail" - There are a few errors on both wazuh servers but not sure that they're related due to the timestamps and when it seems the events stopped being processed:
Capture.PNG

Here are those errors:

Feb 24 16:29:47 waz-02 filebeat[909]: 2022-02-24T16:29:47.724Z        ERROR        [elasticsearch]        elasticsearch/client.go:221        failed to perform any bulk index operations: Post "https://ela-01:9200/_bulk": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 24 16:29:49 waz-02 filebeat[909]: 2022-02-24T16:29:49.404Z        ERROR        [publisher_pipeline_output]        pipeline/output.go:180        failed to publish events: Post "https://ela-01:9200/_bulk": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
"filebeat test output" - All "OK" except 1 Warning: "TLS... security... WARN server's certificate chain verification is disabled" on both wazuh servers

Apologies for the formatting - I couldn't work out how to use the code formatting.

It's also probably useful to know that we have data coming into the elastic cluster from various other sources but it seems only wazuh is having this problem. Here's an example of events from another filebeat instance: Capture.PNG Kind Regards,
Gen.

Gen

unread,
Feb 28, 2022, 7:08:25 AM2/28/22
to Wazuh mailing list
I've realised we have the elastic logs going into separate files in the /var/logs/ folder. Looking at these log files, there are a few WARN messages which I'm not sure whether they're related. I've attached the logs from one of the nodes for yesterday.

Thanks,
Gen.
ela-01 logs.txt

Gen

unread,
Mar 2, 2022, 3:32:25 AM3/2/22
to Wazuh mailing list
Any ideas?

I've now deployed a second worker node hoping that this would help but still seeing the same sort of pattern - possibly holds out a little longer but still drops off at some point in the morning and even restarting the wazuh-service only slightly helps as the rate of processing the logs from the alerts.json file to elastic is very slow. In the screenshot you can see that I deployed the second worker node after 3pm yesterday:Capture.PNG

John Soliani

unread,
Mar 2, 2022, 4:50:17 PM3/2/22
to Wazuh mailing list

Hi Gen,

First, keep in mind this is a PUBLIC post (remove sensitive data form logs).
I had to do some research with the data you provided to find that this problems are related to insufficient resources in the Elasticsearch cluster (CPU mostly).

If your Elasticsearch servers are virtual you could try adding more CPU cores, but it will be a better solution to add another node to the cluster to cover the amount of request per second you are reaching in heavy load hours. We can see that after midnight, Filebeat is able to load data and doesn’t drop events because of timeouts.

I would check resources in the Elastic servers during 10 am and 4 pm, specially CPU usage. The most common tool for this would be top, pressing the I key you can disable Irix mode to see total values in the CPU column also pressing the P key will sort the list by CPU most consuming services. Check also the RAM usage. The jvm heap memory configuration on each elastic node (/etc/elasticsearch/jvm.options file) to be up to half of total RAM installed in the system (and not higher than 32 GB) will also help.

Let me know if this helped.
John.-

Gen

unread,
Mar 6, 2022, 4:20:14 AM3/6/22
to Wazuh mailing list
Hi John,

I believe I've redacted any sensitivty information in the logs, did you spot anything that I haven't removed?

Thanks a lot for your help! You're right,  the CPU load was 90%+ on all 3 nodes. I've deployed a new elastic node and my problem has been fixed!

Now I have a question in regards to indexing the wazuh logs which have been missed since the issue started from the /var/ossec/logs/alerts/2022/xxx/ossec-alerts-**.json.gz files. What's the best way to get the missing data from these files into elastic without duplicating what's already there?

I can probably remove the indexes from elastic for those days and have it filebeat process the whole file from the beginning but wondering if there's a better way?

Thanks!
Gen.

Gen

unread,
Mar 6, 2022, 5:45:25 AM3/6/22
to Wazuh mailing list
I've found this blog article which I'm attempting to follow now: https://wazuh.com/blog/recover-your-data-using-wazuh-alert-backups/

However, it seems to be taking a long time to run the script on each log file, and I have 15 days worth of missing data to index. Is it normal that it's taking this long (stuck on running the script for about 20 minutes now on a 2GB .gz file).

Is this script still valid for wazuh 4.2.5?

Thanks!
Gen.

Gen

unread,
Mar 14, 2022, 12:44:03 PM3/14/22
to Wazuh mailing list
After monitoring this for a further week, it doesn't seem that my issue has been fixed.

The extra node did seem to initially make a difference but the issue still persists unfortunately.

I've also upgraded the disks that are being used by all 4 elastic nodes from Standard HDD to SSDs now as I thought that might be the issue, but have not seen any immediate improvement since the upgrade a few hours ago.

Looking at the cluster in Stack Monitoring in Kibana, the CPU usage isn't overly high on any of the nodes and ranges from 30-95% but is never hitting 100%.

I also noted that in the indices tab in monitoring, if I sort by "Index Rate", I can see some indexes have a rate of nearly 300 /s whereas the wazuh index ranges from 30 - 100 at any one time which seems very low.

Looking at the filebeat logs on each of the wazuh nodes, if I'm reading it correctly, it seems that each node is only processing around 200 events per 30 seconds which is way lower than the amount of logs we have incoming into the alerts.json file:

Mar 14 15:41:24 waz-01 filebeat[27698]: 2022-03-14T15:41:24.973Z        INFO        [monitoring]        log/log.go:145        Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":1190,"time":{"ms":8}},"total":{"ticks":4400,"time":{"ms":23},"value":4400},"user":{"ticks":3210,"time":{"ms":15}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":15},"info":{"ephemeral_id":"****","uptime":{"ms":2430050},"version":"7.14.2"},"memstats":{"gc_next":56313840,"memory_alloc":37211368,"memory_total":818280384,"rss":132743168},"runtime":{"goroutines":47}},"filebeat":{"events":{"added":350,"done":350},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":200,"active":200,"batches":4,"total":200},"read":{"bytes":45718},"write":{"bytes":561300}},"pipeline":{"clients":2,"events":{"active":4117,"published":350,"total":350},"queue":{"acked":350}}},"registrar":{"states":{"current":2,"update":350},"writes":{"success":4,"total":4}},"system":{"load":{"1":0.1,"15":0.21,"5":0.19,"norm":{"1":0.025,"15":0.0525,"5":0.0475}}}}}}
Mar 14 15:41:54 -waz-01 filebeat[27698]: 2022-03-14T15:41:54.972Z        INFO        [monitoring]        log/log.go:145        Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":1200,"time":{"ms":12}},"total":{"ticks":4420,"time":{"ms":22},"value":4420},"user":{"ticks":3220,"time":{"ms":10}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":15},"info":{"ephemeral_id":"****","uptime":{"ms":2460053},"version":"7.14.2"},"memstats":{"gc_next":56313840,"memory_alloc":41090448,"memory_total":822159464,"rss":132743168},"runtime":{"goroutines":47}},"filebeat":{"events":{"added":200,"done":200},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":200,"active":200,"batches":4,"total":200},"read":{"bytes":45719},"write":{"bytes":586748}},"pipeline":{"clients":2,"events":{"active":4117,"published":200,"total":200},"queue":{"acked":200}}},"registrar":{"states":{"current":2,"update":200},"writes":{"success":1,"total":1}},"system":{"load":{"1":0.2,"15":0.22,"5":0.21,"norm":{"1":0.05,"15":0.055,"5":0.0525}}}}}}
Mar 14 15:42:24 pro-waz-01 filebeat[27698]: 2022-03-14T15:42:24.975Z        INFO        [monitoring]        log/log.go:145        Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":1210,"time":{"ms":5}},"total":{"ticks":4450,"time":{"ms":30},"value":4450},"user":{"ticks":3240,"time":{"ms":25}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":15},"info":{"ephemeral_id":"****","uptime":{"ms":2490050},"version":"7.14.2"},"memstats":{"gc_next":56313840,"memory_alloc":44361472,"memory_total":825430488,"rss":132743168},"runtime":{"goroutines":47}},"filebeat":{"events":{"added":150,"done":150},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":150,"active":200,"batches":3,"total":150},"read":{"bytes":34292},"write":{"bytes":423821}},"pipeline":{"clients":2,"events":{"active":4117,"published":150,"total":150},"queue":{"acked":150}}},"registrar":{"states":{"current":2,"update":150},"writes":{"success":3,"total":3}},"system":{"load":{"1":0.33,"15":0.23,"5":0.23,"norm":{"1":0.0825,"15":0.0575,"5":0.0575}}}}}}

Also what seems weird is that when it's "quieter" (e.g. out of hours), filebeat seems to be able to keep up with processing the incoming events and the above log files have numbers in the tens of thousand ranges e.g:

Mar 09 04:27:57 waz-01 filebeat[5491]: 2022-03-09T04:27:57.431Z        INFO        [monitoring]        log/log.go:145        Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":103990,"time":{"ms":180}},"total":{"ticks":543800,"time":{"ms":813},"value":543800},"user":{"ticks":439810,"time":{"ms":633}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":16},"info":{"ephemeral_id":"*****","uptime":{"ms":45060049},"version":"7.14.2"},"memstats":{"gc_next":72352496,"memory_alloc":58068936,"memory_total":116420989176,"rss":148754432},"runtime":{"goroutines":52}},"filebeat":{"events":{"active":1,"added":8443,"done":8442},"harvester":{"open_files":2,"running":2}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":8442,"active":200,"batches":169,"total":8442},"read":{"bytes":1928969},"write":{"bytes":30186763}},"pipeline":{"clients":2,"events":{"active":4117,"published":8442,"total":8442},"queue":{"acked":8442}}},"registrar":{"states":{"current":4,"update":8442},"writes":{"success":115,"total":115}},"system":{"load":{"1":0.33,"15":0.39,"5":0.46,"norm":{"1":0.0825,"15":0.0975,"5":0.115}}}}}}
Mar 09 04:28:27 waz-01 filebeat[5491]: 2022-03-09T04:28:27.431Z        INFO        [monitoring]        log/log.go:145        Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":104240,"time":{"ms":251}},"total":{"ticks":544930,"time":{"ms":1131},"value":544930},"user":{"ticks":440690,"time":{"ms":880}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":16},"info":{"ephemeral_id":"*****","uptime":{"ms":45090049},"version":"7.14.2"},"memstats":{"gc_next":76545680,"memory_alloc":53250976,"memory_total":116666359584,"rss":149356544},"runtime":{"goroutines":52}},"filebeat":{"events":{"active":-1,"added":12037,"done":12038},"harvester":{"open_files":2,"running":2}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":11938,"active":200,"batches":239,"total":11938},"read":{"bytes":2727597},"write":{"bytes":41486590}},"pipeline":{"clients":2,"events":{"active":4117,"published":12038,"total":12038},"queue":{"acked":12038}}},"registrar":{"states":{"current":4,"update":12038},"writes":{"success":152,"total":152}},"system":{"load":{"1":0.28,"15":0.38,"5":0.43,"norm":{"1":0.07,"15":0.095,"5":0.1075}}}}}}
Mar 09 04:28:57 waz-01 filebeat[5491]: 2022-03-09T04:28:57.431Z        INFO        [monitoring]        log/log.go:145        Non-zero metrics in the last 30s        {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":104580,"time":{"ms":342}},"total":{"ticks":546410,"time":{"ms":1475},"value":546410},"user":{"ticks":441830,"time":{"ms":1133}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":16},"info":{"ephemeral_id":"*****","uptime":{"ms":45120050},"version":"7.14.2"},"memstats":{"gc_next":62474240,"memory_alloc":41939504,"memory_total":116981560640,"rss":151396352},"runtime":{"goroutines":52}},"filebeat":{"events":{"added":14986,"done":14986},"harvester":{"open_files":2,"running":2}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":14986,"active":200,"batches":300,"total":14986},"read":{"bytes":3423914},"write":{"bytes":54949313}},"pipeline":{"clients":2,"events":{"active":4117,"published":14986,"total":14986},"queue":{"acked":14986}}},"registrar":{"states":{"current":4,"update":14986},"writes":{"success":190,"total":190}},"system":{"load":{"1":0.34,"15":0.39,"5":0.44,"norm":{"1":0.085,"15":0.0975,"5":0.11}}}}}}

It seems as if Filebeat is capping itself in the number of events it's sending to Elastic? As the numbers also seem to be nicely rounded, when compared to the times it's running properly.

There are no warnings or errors in any of the Elasticsearch nodes apart from this one which always occurs on the master node but not sure it's related:

[2022-03-14T16:36:26,393][WARN ][r.suppressed             ] [ela-02] path: /_xpack/usage, params: {}
java.util.concurrent.CancellationException: Task cancelled
        at org.elasticsearch.xpack.core.action.TransportXPackUsageAction.lambda$masterOperation$2(TransportXPackUsageAction.java:66) ~[x-pack-core-7.14.2.jar:7.14.2]
        at org.elasticsearch.xpack.core.common.IteratingActionListener.onResponse(IteratingActionListener.java:121) [x-pack-core-7.14.2.jar:7.14.2]
        at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) [elasticsearch-7.14.2.jar:7.14.2]
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) [elasticsearch-7.14.2.jar:7.14.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.14.2.jar:7.14.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.14.2.jar:7.14.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:831) [?:?]


I'm running out of ideas but this is starting to become a real problem for us.

Does anyone have any ideas on where look to try to resolve this?

Thanks in advance,
Gen.

Gen

unread,
Mar 22, 2022, 6:42:54 AM3/22/22
to Wazuh mailing list
For anyone else who is having the same issue - I've managed to fix this.

The problem seemed to be that filebeat was hitting some kind of rate limit and the default configuration wasn't good enough for the amount of logs we were getting per node.

We're averaging around 17m wazuh logs per day spread between 3 nodes.

The fix was to increase the max_bulk_size and potentially the worker count in the elasticsearch output configuration of filebeat.yml.

Reply all
Reply to author
Forward
0 new messages