Clusterd on worker node randomly crashing

Charl Jordan

unread,

Mar 12, 2020, 4:21:34 PM3/12/20

to Wazuh mailing list

Hi All,

I have a very odd issue that I am struggling to understand the root cause.

The clusterd on my worker node is crashing randomly, I cannot establish a time-frame pattern thus seems random.

Extract of Worker node clusterd logs:

2020/03/09 07:21:42 wazuh-clusterd: DEBUG: [Worker wazuh-worker] [Integrity] Received 0 shared files to update from master.
2020/03/09 07:21:42 wazuh-clusterd: INFO: [Worker wazuh-worker] [Integrity] Updating files: End.
2020/03/09 07:23:13 wazuh-clusterd: INFO: [Worker wazuh-worker] [Agent info] Starting to send agent status files
2020/03/09 07:23:48 wazuh-clusterd: DEBUG: [Local Server] [Keep alive] Calculating.
2020/03/09 07:23:48 wazuh-clusterd: DEBUG: [Local Server] [Keep alive] Calculated.
2020/03/09 07:23:48 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Main] Connection closed due to an unhandled error: [Errno 32] Broken pipe
2020/03/09 07:23:48 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Integrity] Error synchronizing integrity:
2020/03/09 07:23:48 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Agent info] Error synchronizing agent status files:

There seems to be a whole almost 2 minutes of zero activity in the logs.

Corresponsing time in Master clusterd

2020/03/09 07:23:06 wazuh-clusterd: ERROR: [Master] [Keep alive] No keep alives have been received from wazuh-worker in the last minute. Disconnecting

2020/03/09 07:23:06 wazuh-clusterd: DEBUG: [Master] [Keep alive] Calculated.

2020/03/09 07:23:06 wazuh-clusterd: INFO: [Worker wazuh-worker] [Main] Disconnected.

2020/03/09 07:23:06 wazuh-clusterd: INFO: [Worker wazuh-worker] [Main] Cancelling pending tasks.

2020/03/09 07:23:09 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:09 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:17 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:17 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:25 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:25 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:33 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:33 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:41 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:41 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:49 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:49 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:57 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:57 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:58 wazuh-clusterd: INFO: [Worker] [Main] Connection from ('10.0.0.38', 41940)

2020/03/09 07:23:58 wazuh-clusterd: DEBUG: [Worker] [Main] Command received: b'hello'

2020/03/09 07:24:05 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:24:05 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:24:06 wazuh-clusterd: DEBUG: [Local Server] [Keep alive] Calculating.

2020/03/09 07:24:06 wazuh-clusterd: DEBUG: [Local Server] [Keep alive] Calculated.

2020/03/09 07:24:06 wazuh-clusterd: DEBUG: [Master] [Keep alive] Calculating.

2020/03/09 07:24:06 wazuh-clusterd: DEBUG: [Master] [Keep alive] Calculated.

This then seems to set off a chain reaction of events

Master:

2020/03/09 10:47:55 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Main] File is not a zip file

Worker:

2020/03/09 07:24:37 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Integrity] Error asking for permission: Error sending request: timeout expired.

2020/03/09 11:15:20 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Integrity] Error synchronizing integrity:

2020/03/09 11:15:20 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Agent info] Error synchronizing agent status files:

2020/03/09 11:15:30 wazuh-clusterd: ERROR: [Cluster] [Main] Unhandled exception:

Master:

2020/03/09 11:11:07 wazuh-clusterd: ERROR: [Master] [Keep alive] No keep alives have been received from wazuh-worker in the last minute. Disconnecting

2020/03/09 11:15:32 wazuh-clusterd: ERROR: [Worker] [Main] Error during handshake with incoming connection.

The end ... of the process

If I stop and start the wazuh-manager service all is well for a while again.

Wazuh version: 3.10.2

Both master and worker are hosted in AWS in a single VPC and availability zone

Any assistance would be greatly apprieciated

Regards,

Charl

Daniel Ruiz

unread,

Mar 13, 2020, 4:44:14 AM3/13/20

to Wazuh mailing list

Hi Charl,

I'm sorry you have problems with Wazuh cluster. Let's do some troubleshooting and find the error.

Speculating through the log extracts you provide, the first one is the consequence of the master node closing the connection because it does not receive the keep alive from the worker so the next time the worker tries to send a message through the network it gets a broken pipe.

It would be great if I could get the full log to see lines below this one:

2020/03/09 11:15:30 wazuh-clusterd: ERROR: [Cluster] [Main] Unhandled exception:

I would like to know more about your environment to get the problem.

What type of AWS EC2 instances are running clusterd?
It would be interesting monitoring the free memory of your instances. For example, run a 'top' command and press 'm' to sort by memory consumption and share with us the output.
Related to the previous point, I would also check the CPU consumption. It could be the machine having a poor performance. If you use the 'top' utility the 'c' key will sort by CPU usage.
The clusterd process also needs some disk space. You can check it with command 'df -k /var/ossec'.
How many agents does the cluster manage?
Do you have a load balancer configuration? Could you share it?
Are these machines fully dedicated to wazuh-manager service?

I hope these clues help us to fix your cluster.

Regards,

Daniel Ruiz

Charl Jordan

unread,

Mar 16, 2020, 6:08:30 AM3/16/20

to Wazuh mailing list

Hi Daniel,

Thank you so much for the prompt response.

I had the same thoughts regarding the issue.

Not sure if I should post the content of the logs here or upload to another platform and share link, let me know what you prefer.

That being said, there is nothing below the Unhandled exception, thats when the clusterd crashes

I have answered the questions below in-line

Again, thank you for the time thus far and let me know how else I can help to understand the issue.

Regards,

Charl

On Friday, March 13, 2020 at 10:44:14 AM UTC+2, Daniel Ruiz wrote:

Hi Charl,

I'm sorry you have problems with Wazuh cluster. Let's do some troubleshooting and find the error.

Speculating through the log extracts you provide, the first one is the consequence of the master node closing the connection because it does not receive the keep alive from the worker so the next time the worker tries to send a message through the network it gets a broken pipe.

It would be great if I could get the full log to see lines below this one:
2020/03/09 11:15:30 wazuh-clusterd: ERROR: [Cluster] [Main] Unhandled exception:

I would like to know more about your environment to get the problem.
What type of AWS EC2 instances are running clusterd?

Initially, t3a-small for both master and worker, I then bumped up the worker to t3a-medium as I also considered a lack of resources but this did not resolve the issue. So currently master= t3a-small and worker=t3a-medium

It would be interesting monitoring the free memory of your instances. For example, run a 'top' command and press 'm' to sort by memory consumption and share with us the output.

Master:

top - 09:44:49 up 100 days, 18:14, 1 user, load average: 0.07, 0.06, 0.02

Tasks: 109 total, 1 running, 70 sleeping, 0 stopped, 0 zombie

%Cpu(s): 1.7 us, 0.2 sy, 0.0 ni, 97.8 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem : 59.5/2008484 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]

KiB Swap: 0.0/0 [ ]

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

31958 ossec 20 0 1236456 144308 2212 S 2.0 7.2 436:11.29 ossec-analysisd

2644 root 20 0 897380 17648 4668 S 0.7 0.9 172:10.35 filebeat

31978 ossecr 20 0 1165344 6784 2344 S 0.7 0.3 180:34.67 ossec-remoted

32009 root 20 0 860648 150300 4820 S 0.3 7.5 141:53.00 wazuh-modulesd

32036 ossec 20 0 724044 514880 5304 S 0.3 25.6 75:57.51 python3

1 root 20 0 43792 4436 2856 S 0.0 0.2 1:38.45 systemd

2 root 20 0 0 0 0 S 0.0 0.0 0:01.41 kthreadd

4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H

6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq

7 root 20 0 0 0 0 S 0.0 0.0 0:26.42 ksoftirqd/0

8 root 20 0 0 0 0 I 0.0 0.0 17:43.28 rcu_sched

Worker:

top - 09:58:02 up 4 days, 21:59, 1 user, load average: 0.00, 0.00, 0.00

Tasks: 99 total, 1 running, 59 sleeping, 0 stopped, 0 zombie

%Cpu(s): 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem : 33.0/3990524 [||||||||||||||||||||||||||||||||| ]

KiB Swap: 0.0/0 [ ]

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

2327 ossecr 20 0 1165308 7736 3544 S 0.7 0.2 15:27.56 ossec-remoted

2305 ossec 20 0 1275348 107744 3236 S 0.3 2.7 20:58.84 ossec-analysisd

19977 root 20 0 0 0 0 I 0.3 0.0 0:00.02 kworker/1:1

1 root 20 0 43716 5520 4016 S 0.0 0.1 0:06.52 systemd

2 root 20 0 0 0 0 S 0.0 0.0 0:00.08 kthreadd

4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H

6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq

Related to the previous point, I would also check the CPU consumption. It could be the machine having a poor performance. If you use the 'top' utility the 'c' key will sort by CPU usage.

Master:

top - 09:45:25 up 100 days, 18:15, 1 user, load average: 0.03, 0.05, 0.02

Tasks: 109 total, 1 running, 70 sleeping, 0 stopped, 0 zombie

%Cpu(s): 2.2 us, 0.3 sy, 0.0 ni, 97.2 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem : 59.4/2008484 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]

KiB Swap: 0.0/0 [ ]

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

31958 ossec 20 0 1236456 144308 2212 S 4.0 7.2 436:12.40 /var/ossec/bin/ossec-analysisd -d

31978 ossecr 20 0 1165344 6784 2344 S 1.0 0.3 180:34.99 /var/ossec/bin/ossec-remoted -d

32009 root 20 0 860648 150300 4820 S 0.7 7.5 141:53.06 /var/ossec/bin/wazuh-modulesd -d

2644 root 20 0 897380 17648 4668 S 0.3 0.9 172:10.48 /usr/share/filebeat/bin/filebeat -e -c /etc/filebeat/filebeat.yml -path.home /usr/share/filebeat -path.config /etc/filebeat -path.data+

31910 ossecm 20 0 139404 44116 1756 S 0.3 2.2 9:44.51 /var/ossec/bin/ossec-integratord -d

1 root 20 0 43792 4436 2856 S 0.0 0.2 1:38.45 /usr/lib/systemd/systemd --switched-root --system --deserialize 22

2 root 20 0 0 0 0 S 0.0 0.0 0:01.41 [kthreadd]

4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 [kworker/0:0H]

6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 [mm_percpu_wq]

7 root 20 0 0 0 0 S 0.0 0.0 0:26.42 [ksoftirqd/0]

8 root 20 0 0 0 0 I 0.0 0.0 17:43.28 [rcu_sched]

Worker:

top - 09:58:38 up 4 days, 21:59, 1 user, load average: 0.00, 0.00, 0.00

Tasks: 99 total, 1 running, 59 sleeping, 0 stopped, 0 zombie

%Cpu(s): 0.5 us, 0.3 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.2 st

KiB Mem : 33.0/3990524 [||||||||||||||||||||||||||||||||| ]

KiB Swap: 0.0/0 [ ]

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

2615 ossec 20 0 880064 675116 9512 S 0.7 16.9 18:33.62 /var/ossec/framework/python/bin/python3 /var/ossec/framework/scripts/wazuh-clusterd.py -d

2305 ossec 20 0 1275348 107744 3236 S 0.3 2.7 20:58.97 /var/ossec/bin/ossec-analysisd -d

2327 ossecr 20 0 1165308 7736 3544 S 0.3 0.2 15:27.65 /var/ossec/bin/ossec-remoted -d

19996 ec2-user 20 0 152740 4584 3256 S 0.3 0.1 0:00.03 sshd: ec2-user@pts/0

1 root 20 0 43716 5520 4016 S 0.0 0.1 0:06.52 /usr/lib/systemd/systemd --switched-root --system --deserialize 22

2 root 20 0 0 0 0 S 0.0 0.0 0:00.08 [kthreadd]

4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 [kworker/0:0H]

6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 [mm_percpu_wq]

7 root 20 0 0 0 0 S 0.0 0.0 0:00.69 [ksoftirqd/0]

8 root 20 0 0 0 0 I 0.0 0.0 0:27.98 [rcu_sched]

The clusterd process also needs some disk space. You can check it with command 'df -k /var/ossec'.

Master:

Filesystem Size Used Avail Use% Mounted on

/dev/nvme0n1p1 22G 17G 5.5G 75% /

Debugging mode killing disk here :)

[root@ip-10-0-0-72 ossec]# du -hs /var/ossec/* | sort -h

8.0K /var/ossec/backup

28K /var/ossec/integrations

60K /var/ossec/agentless

92K /var/ossec/active-response

2.4M /var/ossec/stats

3.9M /var/ossec/ruleset

4.2M /var/ossec/etc

8.3M /var/ossec/var

18M /var/ossec/lib

39M /var/ossec/api

40M /var/ossec/tmp

41M /var/ossec/bin

66M /var/ossec/wodles

252M /var/ossec/framework

1.4G /var/ossec/queue

9.7G /var/ossec/logs

Worker

Filesystem Size Used Avail Use% Mounted on

/dev/nvme0n1p1 22G 7.3G 15G 34% /

How many agents does the cluster manage?

22

Do you have a load balancer configuration? Could you share it?

Its an standard AWS network balancer with a target group of both the master and worker in it. If it helps at all, I used the Wazuh built Cloudformation template and modified it to suit my needs

Are these machines fully dedicated to wazuh-manager service?

Yes

Charl Jordan

unread,

Mar 16, 2020, 8:18:50 AM3/16/20

to Wazuh mailing list

I need to correct an earlier statement.

Logs that we are dissecting now are from 09/03/2020. I changed the worker to t3a-medium on 11/03/2020 and the service has not crashed since.

That being said, because I don't have a specific time-frame around the frequency that this happens, I am uncertain whether the issue has been resolved.

Daniel Ruiz

unread,

Mar 16, 2020, 8:59:24 AM3/16/20

to Wazuh mailing list

Hi Charl,

thank you so much for the information. Now I have a pretty clear idea of how your environment works.

The key point here is the type of EC2 instance you are using. Normally, t2, t3 and t3a instances are based in CPU credits per hour. This can make very unpredictable how your cluster will perform because if it runs out of CPU credits your server will have to wait until it gets some more (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances.html). Keep in mind that the wazuh-clusterd process performs some CPU intensive operations such us compress/decompress of synchronized files. Maybe t3a-medium could work most of the time but I would recommend upgrading your instances to a higher family with dedicated virtual CPUs such as m4.large or you can have a look to the unlimited mode with an extra cost (t3 instances are launch in unlimited mode by default). More information here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances-unlimited-mode.html.

Another important thing apart from your initial problem is the disk space. I recommend preventing your servers from running out of space or you will get a highly unstable environment very hard to operate or recover.

Regarding the memory consumption be aware of this issue https://github.com/wazuh/wazuh/issues/4444 that we fixed in version 3.11.2. Your server could run out of memory in some time, normally some days, but depends how stressed your cluster is.

I hope you find this information useful and get your cluster working smoothly.

Do not hesitate to ask any other questions you may have.

Regards,

Daniel Ruiz

Charl Jordan

unread,

Mar 19, 2020, 2:28:06 AM3/19/20

to Wazuh mailing list

Thank you so much for the time and assistance.

In the past I have paid for support which has not been up to this excellent standard and for this I am grateful.

With regards to other questions, not on this topic, but a separate issue, is it advisable to start a different thread?

On Thursday, March 12, 2020 at 10:21:34 PM UTC+2, Charl Jordan wrote:

Daniel Ruiz

unread,

Mar 19, 2020, 3:25:32 AM3/19/20

to Charl Jordan, Wazuh mailing list

Thank you for your feedback!

Yes, I would recommend starting a new thread for a different topic. That way you help us keeping the forum well organized.

Thank you again!

--
You received this message because you are subscribed to the Google Groups "Wazuh mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wazuh+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wazuh/98d602af-ecd0-4d5b-ba3c-4dde4ce76ca3%40googlegroups.com.

--

Daniel Ruiz Capilla
IT Security Developer

The Open Source Security Platform

Reply all

Reply to author

Forward