Retention policies not applied anymore

234 views
Skip to first unread message

Franck Ehret

unread,
Jun 13, 2024, 3:18:39 AM6/13/24
to Wazuh | Mailing List
Hi there,

This morning, I found a crashed Wazuh 😁
I tried to restart Dashboard service but it indicated me the max of shards was reached (which is strange, I had 1500 since a while)

{"type":"log","@timestamp":"2024-06-13T05:39:32Z","tags":["error","opensearch","data"],"pid":5442,"message":"[validation_exception]: Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [1499]/[1500] maximum shards open;"}

I checked the health and active shards  and noticed that I had a lot of 2022 shards.
I increased max shards temporarily to gain access to the GUI again and deleted all 2022 indices.

Now, I'm back to a normal shards amount and lowered the max back to initial 1000 but I still have a problem: retention policies doesn't seems to apply since a while anymore.

I have 119 policy managed indices where I have 690 indices in total. Apparently, policies stopped to apply the 1st of April 2023 (good joke!)
I can't relate this to any crash.

Here is one of my policies (I have similar for each kind of indices):

{
    "id": "xxxxx_statistics_retention",
    "seqNo": 1595,
    "primaryTerm": 6,
    "policy": {
        "policy_id": "xxxxx_statistics_retention",
        "description": "Wazuh index state management for OpenDistro to move indices into a cold state after 3 months and delete them after a year.",
        "last_updated_time": 1656601320673,
        "schema_version": 1,
        "error_notification": null,
        "default_state": "hot",
        "states": [
            {
                "name": "hot",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "replica_count": {
                            "number_of_replicas": 0
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "cold",
                        "conditions": {
                            "min_index_age": "92d"
                        }
                    }
                ]
            },
            {
                "name": "cold",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "read_only": {}
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "366d"
                        }
                    }
                ]
            },
            {
                "name": "delete",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "delete": {}
                    }
                ],
                "transitions": []
            }
        ],
        "ism_template": [
            {
                "index_patterns": [
                    "wazuh-statistics*"
                ],
                "priority": 100,
                "last_updated_time": 1656229281151
            }
        ]
    }
}

Where can I start looking?
Thanks in advance for you help!

PS: my system is back to business, but would be better to fix this no? 😊


John E

unread,
Jun 13, 2024, 6:32:53 AM6/13/24
to Wazuh | Mailing List
Hi Franck,
I would vote to understand the root cause. and then deciding to fix it is up to you  😊.

First, we need to ensure the cluster has enough resources (CPU, memory, disk space) to handle ISM tasks. Overloaded resources can prevent ISM policies from executing properly.  
Also, it did be great to check the overall health of your clusters.
you can do that by:
GET _cluster/health

Let me know, what you find.

Regards.

Franck Ehret

unread,
Jun 13, 2024, 6:55:25 AM6/13/24
to Wazuh | Mailing List
Hi there,

So I've 3 VMs, one for dashboard, one for manager and one for indexer.
The indexer has 16 GB of memory, 4 CPUs and more than 100GB of free disk so hopefully, it's not an issue.
I've seen a big peak of CPU earlier this morning, but this was probably the upgrade to 4.8, that is when I lost the service, but usually, the CPU is pretty calm (it's my own private infra/lab)

This is the result of GET _cluster/health:
{
  "cluster_name" : "wazuh-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 721,
  "active_shards" : 721,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Now I have 721 shards when I had 1499 this morning (limit was 1500 so it was the reason why dashboard didn't restart)
I put all policies in place to keep "only" one year of data and avoid those crashes 😊

If you need any log, let me know which ones. Thanks in advance 😉
Franck

John E

unread,
Jun 13, 2024, 7:59:57 AM6/13/24
to Wazuh | Mailing List
Hello Franck,

Great specs you got there, now I am convinced that's not the problem.

Now, we can take a look at some OpenSearch logs, especially on April 1.

grep "2023-04-01" opensearch.log | grep -i "error"
grep "2023-04-01" opensearch.log | grep -i "warning"

Regards

Franck Ehret

unread,
Jun 13, 2024, 8:23:01 AM6/13/24
to Wazuh | Mailing List
Hi,

Sorry, I couldn't spot the opensearch.log file on my system, maybe you could quickly tell me where it should be?

Thanks 😉

John E

unread,
Jun 13, 2024, 2:39:07 PM6/13/24
to Wazuh | Mailing List
Hello Franck,

It simply means you dont have it as a separate component.
I will be escalating this to the dashboard team. I will come back with a response from them.

Regards.

John E

unread,
Jun 18, 2024, 1:00:06 PM6/18/24
to Wazuh | Mailing List
Hello Frank,

So sorry for the late reply, had issues with my computer.
I tried testing your retention policy, but its plagued with errors, so I would suggest recreating the retention policy.
Also, you can achieve same result using cronjobs. 

# crontab -e 0 0 * * * find /var/ossec/logs/alerts/ -type f -mtime +365 -exec rm -f {} \; 0 0 * * * find /var/ossec/logs/archives/ -type f -mtime +365 -exec rm -f {} \;

Regards.

Franck Ehret

unread,
Jun 20, 2024, 5:00:29 AM6/20/24
to Wazuh | Mailing List
Hello,

I've an issue, because I can't remove the policies.
When I try to delete a policy I get this message (but it makes sense if the policy is still assigned):

Failed to delete the policy, [cluster_block_exception] index [.opendistro-ism-config] blocked by: [FORBIDDEN/8/index write (api)];

And if I try to remove policy from indices, I get that:

[index_management_exception] Failed to clean metadata for remove policy indices.

How should I proceed? Thanks in advance

Kind regards
Franck

John E

unread,
Jun 20, 2024, 6:55:36 AM6/20/24
to Wazuh | Mailing List
Hello Franck,

There are several reasons for this.
which can be permission issues, or a locked index.
i will guess its the latter so to unlock the index you can follow below steps.

dev-tools.png
setting.png

Regards.

Franck Ehret

unread,
Jun 21, 2024, 8:11:12 AM6/21/24
to Wazuh | Mailing List
Hello John,

Even after starting the command (was same result as yours), I get the following when I try to remove the policy from indices:
[index_management_exception] Failed to clean metadata for remove policy indices.

If I try to remove the policy itself, I get this
Could not delete the policy "xxx_statistics_retention" : [cluster_block_exception] index [.opendistro-ism-config] blocked by: [FORBIDDEN/8/index write (api)];

And if I try to create a new policy from scratch (using same values), I get this:
Failed to create policy: [cluster_block_exception] index [.opendistro-ism-config] blocked by: [FORBIDDEN/8/index write (api)];

Anything else I can try? Thx in advance
PS: I use the admin user

Kind regards
Franck
Message has been deleted

John E

unread,
Jun 24, 2024, 8:54:15 AM6/24/24
to Wazuh | Mailing List
Hello Franck,

I was able to find a similar issue being discussed here that i think might be helpful.

Regards.

Franck Ehret

unread,
Jun 24, 2024, 9:09:44 AM6/24/24
to Wazuh | Mailing List
Hi John,

That might be the issue because I had disk almost full because policies were not working anymore. :-)
I did increase the partition on my VM and deleted some indexes in the meantime.

Can you help me spot the ones that would be locked?
(is there any command to list them)

Thanks and kind regards
Franck

John E

unread,
Jun 25, 2024, 6:58:43 AM6/25/24
to Wazuh | Mailing List
Hello Franck,

Below are some ElasticSearch commands to help troubleshoot.

Get all indexes and their settings (block status is included):
GET _all/_settings

Directly get the status:
GET _all/_settings/index.blocks.write?pretty

Regards.

Franck Ehret

unread,
Jul 19, 2024, 9:26:16 AM7/19/24
to Wazuh | Mailing List
Hello,

The second command gave me one index locked and it was .opendistro-job-scheduler-lock
After launching the following command, I could confirm, none was locked anymore:

PUT /.opendistro-job-scheduler-lock/_settings
{
  "index.blocks.read_only_allow_delete": null,
  "index.blocks.write": null
}


But this didn't change the result unfortunately. Trying to create very simple policy failed the same way, same for removing a policy from an existing index.
Any clue what to do next?

Thanks in advance
Franck

Franck Ehret

unread,
Jul 19, 2024, 10:03:14 AM7/19/24
to Wazuh | Mailing List
PS: I tried deleting all indices still managed by policies. No problem by deleteing...
Normally I keep a year of data but in my case, I wanted to see if it solves the issue.

Unfortunately, it didn't solve anything: I can't remove a policy nor create a new one. 
Error messages are the same as before:

Create:
Failed to create policy: [cluster_block_exception] index [.opendistro-ism-config] blocked by: [FORBIDDEN/8/index write (api)];

Delete:

Could not delete the policy "xxx_statistics_retention" : [cluster_block_exception] index [.opendistro-ism-config] blocked by: [FORBIDDEN/8/index write (api)];

Thanks 🙏 

Franck Ehret

unread,
Jul 19, 2024, 12:23:53 PM7/19/24
to Wazuh | Mailing List
Well, I had the answer written in the error message. I did launch the same command with .opendistro-ism-config index and it solved the issue, my policies started to apply again without any further action (so they were OK since the beginning I guess).

PUT /.opendistro-ism-config/_settings
{
  "index.blocks.read_only_allow_delete": null,
  "index.blocks.write": null
}


The weird thing is that the command used to display any block was not displaying this index so if you have an answer... 
This issue is closed, thanks for the help, you put me on the track to solve it myself! 😉

K.R.
Franck
Reply all
Reply to author
Forward
0 new messages