Hi All,
I have a very odd issue that I am struggling to understand the root cause.
The clusterd on my worker node is crashing randomly, I cannot establish a time-frame pattern thus seems random.
Extract of Worker node clusterd logs:
2020/03/09 07:21:42 wazuh-clusterd: DEBUG: [Worker wazuh-worker] [Integrity] Received 0 shared files to update from master.
2020/03/09 07:21:42 wazuh-clusterd: INFO: [Worker wazuh-worker] [Integrity] Updating files: End.
2020/03/09 07:23:13 wazuh-clusterd: INFO: [Worker wazuh-worker] [Agent info] Starting to send agent status files
2020/03/09 07:23:48 wazuh-clusterd: DEBUG: [Local Server] [Keep alive] Calculating.
2020/03/09 07:23:48 wazuh-clusterd: DEBUG: [Local Server] [Keep alive] Calculated.
2020/03/09 07:23:48 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Main] Connection closed due to an unhandled error: [Errno 32] Broken pipe
2020/03/09 07:23:48 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Integrity] Error synchronizing integrity:
2020/03/09 07:23:48 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Agent info] Error synchronizing agent status files:
There seems to be a whole almost 2 minutes of zero activity in the logs.
Corresponsing time in Master clusterd
2020/03/09 07:23:06 wazuh-clusterd: ERROR: [Master] [Keep alive] No keep alives have been received from wazuh-worker in the last minute. Disconnecting
2020/03/09 07:23:06 wazuh-clusterd: DEBUG: [Master] [Keep alive] Calculated.
2020/03/09 07:23:06 wazuh-clusterd: INFO: [Worker wazuh-worker] [Main] Disconnected.
2020/03/09 07:23:06 wazuh-clusterd: INFO: [Worker wazuh-worker] [Main] Cancelling pending tasks.
2020/03/09 07:23:09 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating
2020/03/09 07:23:09 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.
2020/03/09 07:23:17 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating
2020/03/09 07:23:17 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.
2020/03/09 07:23:25 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating
2020/03/09 07:23:25 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.
2020/03/09 07:23:33 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating
2020/03/09 07:23:33 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.
2020/03/09 07:23:41 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating
2020/03/09 07:23:41 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.
2020/03/09 07:23:49 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating
2020/03/09 07:23:49 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.
2020/03/09 07:23:57 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating
2020/03/09 07:23:57 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.
2020/03/09 07:23:58 wazuh-clusterd: INFO: [Worker] [Main] Connection from ('10.0.0.38', 41940)
2020/03/09 07:23:58 wazuh-clusterd: DEBUG: [Worker] [Main] Command received: b'hello'
2020/03/09 07:24:05 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculating
2020/03/09 07:24:05 wazuh-clusterd: DEBUG: [Master] [File integrity] Calculated.
2020/03/09 07:24:06 wazuh-clusterd: DEBUG: [Local Server] [Keep alive] Calculating.
2020/03/09 07:24:06 wazuh-clusterd: DEBUG: [Local Server] [Keep alive] Calculated.
2020/03/09 07:24:06 wazuh-clusterd: DEBUG: [Master] [Keep alive] Calculating.
2020/03/09 07:24:06 wazuh-clusterd: DEBUG: [Master] [Keep alive] Calculated.
This then seems to set off a chain reaction of events
Master:
2020/03/09 10:47:55 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Main] File is not a zip file
Worker:
2020/03/09 07:24:37 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Integrity] Error asking for permission: Error sending request: timeout expired.
2020/03/09 11:15:20 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Integrity] Error synchronizing integrity:
2020/03/09 11:15:20 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Integrity] Error synchronizing integrity:
2020/03/09 11:15:20 wazuh-clusterd: ERROR: [Worker wazuh-worker] [Agent info] Error synchronizing agent status files:
2020/03/09 11:15:30 wazuh-clusterd: ERROR: [Cluster] [Main] Unhandled exception:
Master:
2020/03/09 11:11:07 wazuh-clusterd: ERROR: [Master] [Keep alive] No keep alives have been received from wazuh-worker in the last minute. Disconnecting
2020/03/09 11:15:32 wazuh-clusterd: ERROR: [Worker] [Main] Error during handshake with incoming connection.
The end ... of the process
If I stop and start the wazuh-manager service all is well for a while again.
Wazuh version: 3.10.2
Both master and worker are hosted in AWS in a single VPC and availability zone
Any assistance would be greatly apprieciated
Regards,
Charl
2020/03/09 11:15:30 wazuh-clusterd: ERROR: [Cluster] [Main] Unhandled exception:Hi Charl,I'm sorry you have problems with Wazuh cluster. Let's do some troubleshooting and find the error.Speculating through the log extracts you provide, the first one is the consequence of the master node closing the connection because it does not receive the keep alive from the worker so the next time the worker tries to send a message through the network it gets a broken pipe.It would be great if I could get the full log to see lines below this one:2020/03/09 11:15:30 wazuh-clusterd: ERROR: [Cluster] [Main] Unhandled exception:I would like to know more about your environment to get the problem.
- What type of AWS EC2 instances are running clusterd?
- It would be interesting monitoring the free memory of your instances. For example, run a 'top' command and press 'm' to sort by memory consumption and share with us the output.
top - 09:44:49 up 100 days, 18:14, 1 user, load average: 0.07, 0.06, 0.02
Tasks: 109 total, 1 running, 70 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.7 us, 0.2 sy, 0.0 ni, 97.8 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 59.5/2008484 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
KiB Swap: 0.0/0 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31958 ossec 20 0 1236456 144308 2212 S 2.0 7.2 436:11.29 ossec-analysisd
2644 root 20 0 897380 17648 4668 S 0.7 0.9 172:10.35 filebeat
31978 ossecr 20 0 1165344 6784 2344 S 0.7 0.3 180:34.67 ossec-remoted
32009 root 20 0 860648 150300 4820 S 0.3 7.5 141:53.00 wazuh-modulesd
32036 ossec 20 0 724044 514880 5304 S 0.3 25.6 75:57.51 python3
1 root 20 0 43792 4436 2856 S 0.0 0.2 1:38.45 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:01.41 kthreadd
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
7 root 20 0 0 0 0 S 0.0 0.0 0:26.42 ksoftirqd/0
Tasks: 99 total, 1 running, 59 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 33.0/3990524 [||||||||||||||||||||||||||||||||| ]
KiB Swap: 0.0/0 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2327 ossecr 20 0 1165308 7736 3544 S 0.7 0.2 15:27.56 ossec-remoted
2305 ossec 20 0 1275348 107744 3236 S 0.3 2.7 20:58.84 ossec-analysisd
19977 root 20 0 0 0 0 I 0.3 0.0 0:00.02 kworker/1:1
1 root 20 0 43716 5520 4016 S 0.0 0.1 0:06.52 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.08 kthreadd
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
- Related to the previous point, I would also check the CPU consumption. It could be the machine having a poor performance. If you use the 'top' utility the 'c' key will sort by CPU usage.
top - 09:45:25 up 100 days, 18:15, 1 user, load average: 0.03, 0.05, 0.02
Tasks: 109 total, 1 running, 70 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.2 us, 0.3 sy, 0.0 ni, 97.2 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 59.4/2008484 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
KiB Swap: 0.0/0 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31958 ossec 20 0 1236456 144308 2212 S 4.0 7.2 436:12.40 /var/ossec/bin/ossec-analysisd -d
31978 ossecr 20 0 1165344 6784 2344 S 1.0 0.3 180:34.99 /var/ossec/bin/ossec-remoted -d
32009 root 20 0 860648 150300 4820 S 0.7 7.5 141:53.06 /var/ossec/bin/wazuh-modulesd -d
2644 root 20 0 897380 17648 4668 S 0.3 0.9 172:10.48 /usr/share/filebeat/bin/filebeat -e -c /etc/filebeat/filebeat.yml -path.home /usr/share/filebeat -path.config /etc/filebeat -path.data+
31910 ossecm 20 0 139404 44116 1756 S 0.3 2.2 9:44.51 /var/ossec/bin/ossec-integratord -d
1 root 20 0 43792 4436 2856 S 0.0 0.2 1:38.45 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
2 root 20 0 0 0 0 S 0.0 0.0 0:01.41 [kthreadd]
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 [kworker/0:0H]
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 [mm_percpu_wq]
7 root 20 0 0 0 0 S 0.0 0.0 0:26.42 [ksoftirqd/0]
top - 09:58:38 up 4 days, 21:59, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 99 total, 1 running, 59 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.5 us, 0.3 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.2 st
KiB Mem : 33.0/3990524 [||||||||||||||||||||||||||||||||| ]
KiB Swap: 0.0/0 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2615 ossec 20 0 880064 675116 9512 S 0.7 16.9 18:33.62 /var/ossec/framework/python/bin/python3 /var/ossec/framework/scripts/wazuh-clusterd.py -d
2305 ossec 20 0 1275348 107744 3236 S 0.3 2.7 20:58.97 /var/ossec/bin/ossec-analysisd -d
2327 ossecr 20 0 1165308 7736 3544 S 0.3 0.2 15:27.65 /var/ossec/bin/ossec-remoted -d
19996 ec2-user 20 0 152740 4584 3256 S 0.3 0.1 0:00.03 sshd: ec2-user@pts/0
1 root 20 0 43716 5520 4016 S 0.0 0.1 0:06.52 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
2 root 20 0 0 0 0 S 0.0 0.0 0:00.08 [kthreadd]
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 [kworker/0:0H]
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 [mm_percpu_wq]
7 root 20 0 0 0 0 S 0.0 0.0 0:00.69 [ksoftirqd/0]
8 root 20 0 0 0 0 I 0.0 0.0 0:27.98 [rcu_sched]
- The clusterd process also needs some disk space. You can check it with command 'df -k /var/ossec'.
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 22G 17G 5.5G 75% /
Debugging mode killing disk here :)
[root@ip-10-0-0-72 ossec]# du -hs /var/ossec/* | sort -h
8.0K /var/ossec/backup
28K /var/ossec/integrations
60K /var/ossec/agentless
92K /var/ossec/active-response
2.4M /var/ossec/stats
3.9M /var/ossec/ruleset
4.2M /var/ossec/etc
8.3M /var/ossec/var
18M /var/ossec/lib
39M /var/ossec/api
40M /var/ossec/tmp
41M /var/ossec/bin
66M /var/ossec/wodles
252M /var/ossec/framework
1.4G /var/ossec/queue
9.7G /var/ossec/logs
Worker
Filesystem Size Used Avail Use% Mounted on
- How many agents does the cluster manage?
- Do you have a load balancer configuration? Could you share it?
- Are these machines fully dedicated to wazuh-manager service?
--
You received this message because you are subscribed to the Google Groups "Wazuh mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wazuh+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wazuh/98d602af-ecd0-4d5b-ba3c-4dde4ce76ca3%40googlegroups.com.