Clusterd on worker node randomly crashing

501 views
Skip to first unread message

Charl Jordan

unread,
Mar 12, 2020, 4:21:34 PM3/12/20
to Wazuh mailing list

Hi All,

 

I have a very odd issue that I am struggling to understand the root cause.

The clusterd on my worker node is crashing randomly, I cannot establish a time-frame pattern thus seems random.

 

Extract of Worker node clusterd logs:

 

2020/03/09 07:21:42 wazuh-clusterd: DEBUG: [Worker wazuh-worker] [Integrity] Received 0 shared files to update from master.
2020/03/09 07:21:42 wazuh-clusterd: INFO: [Worker wazuh-worker] [Integrity] Updating files: End.
2020/03/09 07:23:13 wazuh-clusterd: INFO: [Worker wazuh-worker] [Agent info] Starting to send agent status files
2020/03/09 07:23:48 wazuh-clusterd: DEBUG: [Local Server] [Keep alive] Calculating.
2020/03/09 07:23:48 wazuh-clusterd: DEBUG: [Local Server] [Keep alive] Calculated.
2020/03/09 07:23:48 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Main] Connection closed due to an unhandled error: [Errno 32] Broken pipe
2020/03/09 07:23:48 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Integrity] Error synchronizing integrity:
2020/03/09 07:23:48 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Agent info] Error synchronizing agent status files:

 

There seems to be a whole almost 2 minutes of zero activity in the logs.

 

Corresponsing time in Master clusterd

 

2020/03/09 07:23:06 wazuh-clusterd: ERROR: [Master] [Keep alive] No keep alives have been received from wazuh-worker in the last minute. Disconnecting

2020/03/09 07:23:06 wazuh-clusterd: DEBUG: [Master] [Keep alive] Calculated.

2020/03/09 07:23:06 wazuh-clusterd: INFO: [Worker wazuh-worker] [Main] Disconnected.

2020/03/09 07:23:06 wazuh-clusterd: INFO: [Worker wazuh-worker] [Main] Cancelling pending tasks.

2020/03/09 07:23:09 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:09 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:17 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:17 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:25 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:25 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:33 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:33 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:41 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:41 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:49 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:49 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:57 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:23:57 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:23:58 wazuh-clusterd: INFO: [Worker] [Main] Connection from ('10.0.0.38', 41940)

2020/03/09 07:23:58 wazuh-clusterd: DEBUG: [Worker] [Main] Command received: b'hello'

2020/03/09 07:24:05 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating

2020/03/09 07:24:05 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.

2020/03/09 07:24:06 wazuh-clusterd: DEBUG: [Local Server] [Keep alive] Calculating.

2020/03/09 07:24:06 wazuh-clusterd: DEBUG: [Local Server] [Keep alive] Calculated.

2020/03/09 07:24:06 wazuh-clusterd: DEBUG: [Master] [Keep alive] Calculating.

2020/03/09 07:24:06 wazuh-clusterd: DEBUG: [Master] [Keep alive] Calculated.


This then seems to set off a chain reaction of events


Master:


2020/03/09 10:47:55 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Main] File is not a zip file


Worker:


2020/03/09 07:24:37 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Integrity] Error asking for permission: Error sending request: timeout expired.


2020/03/09 11:15:20 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Integrity] Error synchronizing integrity: 

2020/03/09 11:15:20 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Integrity] Error synchronizing integrity: 

2020/03/09 11:15:20 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Agent info] Error synchronizing agent status files: 

2020/03/09 11:15:30 wazuh-clusterd: ERROR: [Cluster] [Main] Unhandled exception:


Master:


2020/03/09 11:11:07 wazuh-clusterd: ERROR: [Master] [Keep alive] No keep alives have been received from wazuh-worker in the last minute. Disconnecting

2020/03/09 11:15:32 wazuh-clusterd: ERROR: [Worker] [Main] Error during handshake with incoming connection.


The end ... of the process

If I stop and start the wazuh-manager service all is well for a while again.


Wazuh version: 3.10.2

Both master and worker are hosted in AWS in a single VPC and availability zone


Any assistance would be greatly apprieciated

Regards,

Charl

 

Daniel Ruiz

unread,
Mar 13, 2020, 4:44:14 AM3/13/20
to Wazuh mailing list
Hi Charl,

I'm sorry you have problems with Wazuh cluster. Let's do some troubleshooting and find the error.

Speculating through the log extracts you provide, the first one is the consequence of the master node closing the connection because it does not receive the keep alive from the worker so the next time the worker tries to send a message through the network it gets a broken pipe.

It would be great if I could get the full log to see lines below this one:
2020/03/09 11:15:30 wazuh-clusterd: ERROR: [Cluster] [Main] Unhandled exception:


I would like to know more about your environment to get the problem.
  • What type of AWS EC2 instances are running clusterd?
  • It would be interesting monitoring the free memory of your instances. For example, run a 'top' command and press 'm' to sort by memory consumption and share with us the output.
  • Related to the previous point, I would also check the CPU consumption. It could be the machine having a poor performance. If you use the 'top' utility the 'c' key will sort by CPU usage.
  • The clusterd process also needs some disk space. You can check it with command 'df -k /var/ossec'.
  • How many agents does the cluster manage?
  • Do you have a load balancer configuration? Could you share it?
  • Are these machines fully dedicated to wazuh-manager service?
I hope these clues help us to fix your cluster.

Regards,
Daniel Ruiz

Charl Jordan

unread,
Mar 16, 2020, 6:08:30 AM3/16/20
to Wazuh mailing list
Hi Daniel,

Thank you so much for the prompt response.

I had the same thoughts regarding the issue.

Not sure if I should post the content of the logs here or upload to another platform and share link, let me know what you prefer.
That being said, there is nothing below the Unhandled exception, thats when the clusterd crashes

I have answered the questions below in-line

Again, thank you for the time thus far and let me know how else I can help to understand the issue.
Regards,
Charl

On Friday, March 13, 2020 at 10:44:14 AM UTC+2, Daniel Ruiz wrote:
Hi Charl,

I'm sorry you have problems with Wazuh cluster. Let's do some troubleshooting and find the error.

Speculating through the log extracts you provide, the first one is the consequence of the master node closing the connection because it does not receive the keep alive from the worker so the next time the worker tries to send a message through the network it gets a broken pipe.

It would be great if I could get the full log to see lines below this one:
2020/03/09 11:15:30 wazuh-clusterd: ERROR: [Cluster] [Main] Unhandled exception:


I would like to know more about your environment to get the problem.
  • What type of AWS EC2 instances are running clusterd?
Initially, t3a-small for both master and worker, I then bumped up the worker to t3a-medium as I also considered a lack of resources but this did not resolve the issue. So currently master= t3a-small and worker=t3a-medium
  • It would be interesting monitoring the free memory of your instances. For example, run a 'top' command and press 'm' to sort by memory consumption and share with us the output.
Master:

top - 09:44:49 up 100 days, 18:14,  1 user,  load average: 0.07, 0.06, 0.02

Tasks: 109 total,   1 running,  70 sleeping,   0 stopped,   0 zombie

%Cpu(s):  1.7 us,  0.2 sy,  0.0 ni, 97.8 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st

KiB Mem : 59.5/2008484  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                                        ]

KiB Swap:  0.0/0        [                                                                                                    ]


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                

31958 ossec     20   0 1236456 144308   2212 S   2.0  7.2 436:11.29 ossec-analysisd                                                                                                                        

 2644 root      20   0  897380  17648   4668 S   0.7  0.9 172:10.35 filebeat                                                                                                                               

31978 ossecr    20   0 1165344   6784   2344 S   0.7  0.3 180:34.67 ossec-remoted                                                                                                                          

32009 root      20   0  860648 150300   4820 S   0.3  7.5 141:53.00 wazuh-modulesd                                                                                                                         

32036 ossec     20   0  724044 514880   5304 S   0.3 25.6  75:57.51 python3                                                                                                                                

    1 root      20   0   43792   4436   2856 S   0.0  0.2   1:38.45 systemd                                                                                                                                

    2 root      20   0       0      0      0 S   0.0  0.0   0:01.41 kthreadd                                                                                                                               

    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H                                                                                                                           

    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq                                                                                                                           

    7 root      20   0       0      0      0 S   0.0  0.0   0:26.42 ksoftirqd/0                                                                                                                            

    8 root      20   0       0      0      0 I   0.0  0.0  17:43.28 rcu_sched         

Worker:

 top - 09:58:02 up 4 days, 21:59,  1 user,  load average: 0.00, 0.00, 0.00

Tasks:  99 total,   1 running,  59 sleeping,   0 stopped,   0 zombie

%Cpu(s):  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

KiB Mem : 33.0/3990524  [|||||||||||||||||||||||||||||||||                                                                   ]

KiB Swap:  0.0/0        [                                                                                                    ]


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                

 2327 ossecr    20   0 1165308   7736   3544 S   0.7  0.2  15:27.56 ossec-remoted                                                                                                                          

 2305 ossec     20   0 1275348 107744   3236 S   0.3  2.7  20:58.84 ossec-analysisd                                                                                                                        

19977 root      20   0       0      0      0 I   0.3  0.0   0:00.02 kworker/1:1                                                                                                                            

    1 root      20   0   43716   5520   4016 S   0.0  0.1   0:06.52 systemd                                                                                                                                

    2 root      20   0       0      0      0 S   0.0  0.0   0:00.08 kthreadd                                                                                                                               

    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H                                                                                                                           

    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq       


  • Related to the previous point, I would also check the CPU consumption. It could be the machine having a poor performance. If you use the 'top' utility the 'c' key will sort by CPU usage.

Master:

top - 09:45:25 up 100 days, 18:15,  1 user,  load average: 0.03, 0.05, 0.02

Tasks: 109 total,   1 running,  70 sleeping,   0 stopped,   0 zombie

%Cpu(s):  2.2 us,  0.3 sy,  0.0 ni, 97.2 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st

KiB Mem : 59.4/2008484  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                                        ]

KiB Swap:  0.0/0        [                                                                                                    ]


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                

31958 ossec     20   0 1236456 144308   2212 S   4.0  7.2 436:12.40 /var/ossec/bin/ossec-analysisd -d                                                                                                      

31978 ossecr    20   0 1165344   6784   2344 S   1.0  0.3 180:34.99 /var/ossec/bin/ossec-remoted -d                                                                                                        

32009 root      20   0  860648 150300   4820 S   0.7  7.5 141:53.06 /var/ossec/bin/wazuh-modulesd -d                                                                                                       

 2644 root      20   0  897380  17648   4668 S   0.3  0.9 172:10.48 /usr/share/filebeat/bin/filebeat -e -c /etc/filebeat/filebeat.yml -path.home /usr/share/filebeat -path.config /etc/filebeat -path.data+

31910 ossecm    20   0  139404  44116   1756 S   0.3  2.2   9:44.51 /var/ossec/bin/ossec-integratord -d                                                                                                    

    1 root      20   0   43792   4436   2856 S   0.0  0.2   1:38.45 /usr/lib/systemd/systemd --switched-root --system --deserialize 22                                                                     

    2 root      20   0       0      0      0 S   0.0  0.0   0:01.41 [kthreadd]                                                                                                                             

    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 [kworker/0:0H]                                                                                                                         

    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 [mm_percpu_wq]                                                                                                                         

    7 root      20   0       0      0      0 S   0.0  0.0   0:26.42 [ksoftirqd/0]                                                                                                                          

    8 root      20   0       0      0      0 I   0.0  0.0  17:43.28 [rcu_sched]  
 
Worker:

top - 09:58:38 up 4 days, 21:59,  1 user,  load average: 0.00, 0.00, 0.00

Tasks:  99 total,   1 running,  59 sleeping,   0 stopped,   0 zombie

%Cpu(s):  0.5 us,  0.3 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.2 st

KiB Mem : 33.0/3990524  [|||||||||||||||||||||||||||||||||                                                                   ]

KiB Swap:  0.0/0        [                                                                                                    ]


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                

 2615 ossec     20   0  880064 675116   9512 S   0.7 16.9  18:33.62 /var/ossec/framework/python/bin/python3 /var/ossec/framework/scripts/wazuh-clusterd.py -d                                              

 2305 ossec     20   0 1275348 107744   3236 S   0.3  2.7  20:58.97 /var/ossec/bin/ossec-analysisd -d                                                                                                      

 2327 ossecr    20   0 1165308   7736   3544 S   0.3  0.2  15:27.65 /var/ossec/bin/ossec-remoted -d                                                                                                        

19996 ec2-user  20   0  152740   4584   3256 S   0.3  0.1   0:00.03 sshd: ec2-user@pts/0                                                                                                                   

    1 root      20   0   43716   5520   4016 S   0.0  0.1   0:06.52 /usr/lib/systemd/systemd --switched-root --system --deserialize 22                                                                     

    2 root      20   0       0      0      0 S   0.0  0.0   0:00.08 [kthreadd]                                                                                                                             

    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 [kworker/0:0H]                                                                                                                         

    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 [mm_percpu_wq]                                                                                                                         

    7 root      20   0       0      0      0 S   0.0  0.0   0:00.69 [ksoftirqd/0]                                                                                                                          

    8 root      20   0       0      0      0 I   0.0  0.0   0:27.98 [rcu_sched]         

  • The clusterd process also needs some disk space. You can check it with command 'df -k /var/ossec'.
 Master:

Filesystem      Size  Used Avail Use% Mounted on

/dev/nvme0n1p1   22G   17G  5.5G  75% /



Debugging mode killing disk here :)


[root@ip-10-0-0-72 ossec]# du -hs /var/ossec/* | sort -h

8.0K /var/ossec/backup

28K /var/ossec/integrations

60K /var/ossec/agentless

92K /var/ossec/active-response

2.4M /var/ossec/stats

3.9M /var/ossec/ruleset

4.2M /var/ossec/etc

8.3M /var/ossec/var

18M /var/ossec/lib

39M /var/ossec/api

40M /var/ossec/tmp

41M /var/ossec/bin

66M /var/ossec/wodles

252M /var/ossec/framework

1.4G /var/ossec/queue

9.7G /var/ossec/logs


Worker

 

Filesystem      Size  Used Avail Use% Mounted on

/dev/nvme0n1p1   22G  7.3G   15G  34% / 
  • How many agents does the cluster manage?
 22
  • Do you have a load balancer configuration? Could you share it?
Its an standard AWS network balancer with a target group of both the master and worker in it. If it helps at all, I used the Wazuh built Cloudformation template and modified it to suit my needs
  • Are these machines fully dedicated to wazuh-manager service?
Yes 

Charl Jordan

unread,
Mar 16, 2020, 8:18:50 AM3/16/20
to Wazuh mailing list
I need to correct an earlier statement.
Logs that we are dissecting now are from 09/03/2020. I changed the worker to t3a-medium on 11/03/2020 and the service has not crashed since.
That being said, because I don't have a specific time-frame around the frequency that this happens, I am uncertain whether the issue has been resolved. 

Daniel Ruiz

unread,
Mar 16, 2020, 8:59:24 AM3/16/20
to Wazuh mailing list
Hi Charl,

thank you so much for the information. Now I have a pretty clear idea of how your environment works.

The key point here is the type of EC2 instance you are using. Normally, t2, t3 and t3a instances are based in CPU credits per hour. This can make very unpredictable how your cluster will perform because if it runs out of CPU credits your server will have to wait until it gets some more (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances.html). Keep in mind that the wazuh-clusterd process performs some CPU intensive operations such us compress/decompress of synchronized files. Maybe t3a-medium could work most of the time but I would recommend upgrading your instances to a higher family with dedicated virtual CPUs such as m4.large or you can have a look to the unlimited mode with an extra cost (t3 instances are launch in unlimited mode by default). More information here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances-unlimited-mode.html.

Another important thing apart from your initial problem is the disk space. I recommend preventing your servers from running out of space or you will get a highly unstable environment very hard to operate or recover.

Regarding the memory consumption be aware of this issue https://github.com/wazuh/wazuh/issues/4444 that we fixed in version 3.11.2. Your server could run out of memory in some time, normally some days, but depends how stressed your cluster is.

I hope you find this information useful and get your cluster working smoothly.

Do not hesitate to ask any other questions you may have.

Regards,
Daniel Ruiz

Charl Jordan

unread,
Mar 19, 2020, 2:28:06 AM3/19/20
to Wazuh mailing list
Thank you so much for the time and assistance.
In the past I have paid for support which has not been up to this excellent standard and for this I am grateful.

With regards to other questions, not on this topic, but a separate issue, is it advisable to start a different thread?


On Thursday, March 12, 2020 at 10:21:34 PM UTC+2, Charl Jordan wrote:

Daniel Ruiz

unread,
Mar 19, 2020, 3:25:32 AM3/19/20
to Charl Jordan, Wazuh mailing list
Thank you for your feedback!

Yes, I would recommend starting a new thread for a different topic. That way you help us keeping the forum well organized.

Thank you again!

--
You received this message because you are subscribed to the Google Groups "Wazuh mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wazuh+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wazuh/98d602af-ecd0-4d5b-ba3c-4dde4ce76ca3%40googlegroups.com.


--
Wazuh Daniel Ruiz Capilla
IT Security Developer
Wazuh The Open Source Security Platform
Wazuh's Github
Wazuh's Twitter
Reply all
Reply to author
Forward
0 new messages