Hi there,
This morning, I found a crashed Wazuh 😁
I tried to restart Dashboard service but it indicated me the max of shards was reached (which is strange, I had 1500 since a while)
{"type":"log","@timestamp":"2024-06-13T05:39:32Z","tags":["error","opensearch","data"],"pid":5442,"message":"[validation_exception]: Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [1499]/[1500] maximum shards open;"}
I checked the health and active shards and noticed that I had a lot of 2022 shards.
I increased max shards temporarily to gain access to the GUI again and deleted all 2022 indices.
Now, I'm back to a normal shards amount and lowered the max back to initial 1000 but I still have a problem: retention policies doesn't seems to apply since a while anymore.
I have 119 policy managed indices where I have 690 indices in total. Apparently, policies stopped to apply the 1st of April 2023 (good joke!)
I can't relate this to any crash.
Here is one of my policies (I have similar for each kind of indices):
{
"id": "xxxxx_statistics_retention",
"seqNo": 1595,
"primaryTerm": 6,
"policy": {
"policy_id": "xxxxx_statistics_retention",
"description": "Wazuh index state management for OpenDistro to move indices into a cold state after 3 months and delete them after a year.",
"last_updated_time": 1656601320673,
"schema_version": 1,
"error_notification": null,
"default_state": "hot",
"states": [
{
"name": "hot",
"actions": [
{
"retry": {
"count": 3,
"backoff": "exponential",
"delay": "1m"
},
"replica_count": {
"number_of_replicas": 0
}
}
],
"transitions": [
{
"state_name": "cold",
"conditions": {
"min_index_age": "92d"
}
}
]
},
{
"name": "cold",
"actions": [
{
"retry": {
"count": 3,
"backoff": "exponential",
"delay": "1m"
},
"read_only": {}
}
],
"transitions": [
{
"state_name": "delete",
"conditions": {
"min_index_age": "366d"
}
}
]
},
{
"name": "delete",
"actions": [
{
"retry": {
"count": 3,
"backoff": "exponential",
"delay": "1m"
},
"delete": {}
}
],
"transitions": []
}
],
"ism_template": [
{
"index_patterns": [
"wazuh-statistics*"
],
"priority": 100,
"last_updated_time": 1656229281151
}
]
}
}
Where can I start looking?
Thanks in advance for you help!
PS: my system is back to business, but would be better to fix this no? 😊