Cluter in red status

302 views
Skip to first unread message

Francesc G

unread,
Sep 13, 2024, 2:11:32 AM9/13/24
to Wazuh | Mailing List
Hi,

We have a Wazuh 4.7.2 single node with archive indexes activated in addition to the alert indexes.

Three days ago the archive index wasn't automatically generated as every day does it and during that day alert index stop showing new events.

Checking cluster health showed in red status. There was 1480 shards, 44 of them in unassigned state.

After searching a bit it seemed to me that the problem was that the maximum number of shards had been reached so I closed and deleted several indexes to lower the shards below 1000 and at the same time increased the shards to 1500.

This seemed to work as alerts and archive indexes started working again. However the cluster status was still red, so I deleted unassigned indexes except the following ones which remains unassigned:

.opendistro-anomaly-detection-state                  0 r UNASSIGNED
.opendistro-alerting-alert-history-2024.07.24-000016 0 r UNASSIGNED
.opendistro-anomaly-detectors                        0 r UNASSIGNED
.opendistro-ism-config                               0 p UNASSIGNED
.opendistro-ism-config                               0 r UNASSIGNED
.opensearch-control-center                           0 r UNASSIGNED
.opendistro-alerting-alerts                          0 r UNASSIGNED
.opendistro-anomaly-checkpoints                      0 r UNASSIGNED
.opendistro-alerting-config                          0 r UNASSIGNED
.opendistro-anomaly-detector-jobs                    0 r UNASSIGNED


I also deleted some ".opendistro-ism-managed-index-history-*" indexes that I believe caused ISM policies stop working. In fact, ISM policies are no longer displayed in "State managemtn policies" and this message appears:

[search_phase_execution_exception] all shards failed

The same message is displayed when I click in "Policy management indices", "Rollup jobs, "Transform jobs".

Message " {"error":{"root_cause":[{"type":"status_exception","reason":"all shards failed"}],"type":"status_exception","reason":"all shards failed"},"status":500} " is displayed when I click in "Snapshot Policies".

Aliases only shows three alias. I think there was a fourth related with ISM:

.opendistro-anomaly-results
.opendistro-alerting-alert-history-write
.kibana

How can I get ISM policies working again? By simply creating them again? Do I have to do anything else in "Rollup jobs" and "Transform jobs"? Why do "Snapshot policies" also fails? I had one even it was disabled.

Thanks for your help.


Stuti Gupta

unread,
Sep 13, 2024, 3:05:41 AM9/13/24
to Wazuh | Mailing List
Hi Francesc,

The default shard limit is 1000, and increasing it is not recommended because each shard consumes significant resources (CPU, memory, and disk I/O), even if it holds minimal data. Too many small shards can lead to slower query performance, higher overhead, cluster instability, and resource exhaustion. Additionally, the current increased shard count is already nearing capacity. Instead of increasing the shard limit, we recommend the following solutions:

Solution 1: Manually Delete Indices
You should review the stored indices using the following API call:
GET _cat/indices
From there, you can delete unnecessary or old indices. Note that deleted indices cannot be retrieved unless backed up through snapshots or Wazuh alert backups. The API call to delete indices is:
DELETE <index_name>
Or via the CLI:
curl -k -u admin:admin -XDELETE https://<WAZUH_INDEXER_IP>:9200/wazuh-alerts-4.x-YYYY.MM.DD
You can also use wildcards (*) to delete multiple indices in one go.

Solution 2: Index Management Policies
You can automate index deletion by setting up Index Lifecycle Management (ILM) policies, as explained in this post:(https://wazuh.com/blog/wazuh-index-management). Additionally, you can set up snapshots to automatically back up Wazuh indices to local or cloud storage for restoration when needed. More details on this can be found in the (https://wazuh.com/blog/index-backup-management) article.

Solution 3: Add an Indexer Node
Adding another indexer node will increase the capacity and resilience of your Wazuh monitoring infrastructure. For more information on how to do this, refer to the official guide: (https://documentation.wazuh.com/current/user-manual/upscaling/adding-indexer-node.html).

ISM Policies Not Working
If your ISM policies are failing, it's likely due to resource constraints. At least 20% free disk space is required to apply ISM policies. The recent surge in events may be contributing to this. If your storage usage is excessive, we recommend reviewing agent logs or syslogs to identify the types of events contributing to the high log volume. This analysis will help you fine-tune log generation.

Unassigned Shards (.opendistro Pattern)
This pattern relates to system indices, which are protected. To address this issue, you can remove the replica by setting the number of replicas to 0.

 Solution 1: Change System Index Settings
- Modify the `/etc/wazuh-indexer/opensearch.yml` file by changing `plugins.security.system_indices.enabled: true` to `false`.
- Restart the indexer:
  systemctl restart wazuh-indexer
- From the Wazuh UI, go to Index Management> Indexes, select the unassigned indices, and change the number of replicas to 0.

Solution 2: Use the CLI
Create index templates for all OpenSearch system indices and set the number of replicas to 0:
curl --key admin-key.pem --cert admin.pem --insecure -XPUT https://127.0.0.1:9200/_index_template/ism_history_indices -H 'Content-Type: application/json' -d'
{
  "index_patterns": [
    ".opendistro-ism-managed-index-history-*"
  ],
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  }
}'

You can also follow this (https://groups.google.com/g/wazuh/c/tgaabFMiAL8) for further guidance. Afterward, make sure to revert `plugins.security.system_indices.enabled` to `true`.

Hope this helps 

Francesc G

unread,
Sep 13, 2024, 10:08:58 AM9/13/24
to Wazuh | Mailing List
Hi,

Thanks for your reply.

I performed the following actions:

I increased the size of the hard drive. It is true that before there was only 10% (20 GB) free. Now there is 40% (115 GB) available.

I deleted some indexes and shrank about 200 so they only use one shard instead of the 3 they used before. Now I am using a total of 663 shards and I have set the maximum per node to 1100:

root@wazuh:~# curl -k -u user:password https://127.0.0.1:9200/_cat/shards?pretty  | wc -l
663

I have configured 0 replicas for the indexes that were UNASSIGNED. The only index that remains UNASSIGNED is not a replica:

root@wazuh:~# curl -k -u user:password https://127.0.0.1:9200/_cat/shards?pretty | grep UNASSIGNED
.opendistro-ism-config 0 p UNASSIGNED

And this seems to be related to the errors in the dashboard when I click in some pages ("State management policies", "Policy management indices", "Rollup jobs" and "Transform jobs").

I have run the following command to see the reason why the index opendistro-ism-config is not allocated: 

GET /_cluster/allocation/explain
{
  "index": ".opendistro-ism-config",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "ALLOCATION_FAILED",
    "at": "2024-09-13T11:42:41.594Z",
    "failed_allocation_attempts": 5,
    "details": "failed shard on node [-xcoLcBKQySoEFnceoAaRg]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog from source [/var/lib/wazuh-indexer/nodes/0/indices/UOejjGTARCe_LEe-geBRHg/0/translog/translog-701.tlog] is corrupted, checksum verification failed - expected: 0x85318b1a, got: 0x0]; ",
    "last_allocation_status": "no"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions": [
    {
      "node_id": "-xcoLcBKQySoEFnceoAaRg",
      "node_name": "node-1",
      "transport_address": "10.1.1.1:9300",
      "node_attributes": {
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "store": {
        "in_sync": true,
        "allocation_id": "j2QXko3_RtOtg3f-Ar80wg"
      },
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-09-13T11:42:41.594Z], failed_attempts[5], failed_nodes[[-xcoLcBKQySoEFnceoAaRg]], delayed=false, details[failed shard on node [-xcoLcBKQySoEFnceoAaRg]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog from source [/var/lib/wazuh-indexer/nodes/0/indices/UOejjGTARCe_LEe-geBRHg/0/translog/translog-701.tlog] is corrupted, checksum verification failed - expected: 0x85318b1a, got: 0x0]; ], allocation_status[deciders_no]]]"
        }
      ]
    }
  ]
}

Finally, when I try to create a new ISM policy I get the following error in wazuh-dashboard log:

sep 13 15:59:39 wazuh opensearch-dashboards[826]: Index Management - PolicyService - putPolicy: StatusCodeError: [search_phase_execution_exception] all shards failed
sep 13 15:59:39 wazuh opensearch-dashboards[826]:     at respond (/usr/share/wazuh-dashboard/node_modules/elasticsearch/src/lib/transport.js:349:15)
sep 13 15:59:39 wazuh opensearch-dashboards[826]:     at checkRespForFailure (/usr/share/wazuh-dashboard/node_modules/elasticsearch/src/lib/transport.js:306:7)
sep 13 15:59:39 wazuh opensearch-dashboards[826]:     at HttpConnector.<anonymous> (/usr/share/wazuh-dashboard/node_modules/elasticsearch/src/lib/connectors/http.js:173:7)
sep 13 15:59:39 wazuh opensearch-dashboards[826]:     at IncomingMessage.wrapper (/usr/share/wazuh-dashboard/node_modules/lodash/lodash.js:4991:19)
sep 13 15:59:39 wazuh opensearch-dashboards[826]:     at IncomingMessage.emit (node:events:525:35)
sep 13 15:59:39 wazuh opensearch-dashboards[826]:     at IncomingMessage.emit (node:domain:489:12)
sep 13 15:59:39 wazuh opensearch-dashboards[826]:     at endReadableNT (node:internal/streams/readable:1358:12)
sep 13 15:59:39 wazuh opensearch-dashboards[826]:     at processTicksAndRejections (node:internal/process/task_queues:83:21) {
sep 13 15:59:39 wazuh opensearch-dashboards[826]:   status: 503,
sep 13 15:59:39 wazuh opensearch-dashboards[826]:   displayName: 'ServiceUnavailable',
sep 13 15:59:39 wazuh opensearch-dashboards[826]:   path: '/_plugins/_ism/policies/test?refresh=wait_for',
sep 13 15:59:39 wazuh opensearch-dashboards[826]:   query: {},
sep 13 15:59:39 wazuh opensearch-dashboards[826]:   body: {
sep 13 15:59:39 wazuh opensearch-dashboards[826]:     error: {
sep 13 15:59:39 wazuh opensearch-dashboards[826]:       root_cause: [],
sep 13 15:59:39 wazuh opensearch-dashboards[826]:       type: 'search_phase_execution_exception',
sep 13 15:59:39 wazuh opensearch-dashboards[826]:       reason: 'all shards failed',
sep 13 15:59:39 wazuh opensearch-dashboards[826]:       phase: 'query',
sep 13 15:59:39 wazuh opensearch-dashboards[826]:       grouped: true,
sep 13 15:59:39 wazuh opensearch-dashboards[826]:       failed_shards: []
sep 13 15:59:39 wazuh opensearch-dashboards[826]:     },
sep 13 15:59:39 wazuh opensearch-dashboards[826]:     status: 503
sep 13 15:59:39 wazuh opensearch-dashboards[826]:   },
sep 13 15:59:39 wazuh opensearch-dashboards[826]:   statusCode: 503,
sep 13 15:59:39 wazuh opensearch-dashboards[826]:   response: '{"error":{"root_cause":[],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[]},"status":503}',          
sep 13 15:59:39 wazuh opensearch-dashboards[826]:   toString: [Function (anonymous)],
sep 13 15:59:39 wazuh opensearch-dashboards[826]:   toJSON: [Function (anonymous)]
sep 13 15:59:39 wazuh opensearch-dashboards[826]: }
sep 13 15:59:39 wazuh opensearch-dashboards[826]: {"type":"response","@timestamp":"2024-09-13T13:59:39Z","tags":[],"pid":826,"method":"put","statusCode":200,"req":{"url":"/api/ism/policies/test","method":"put","headers":{"host":"wazuh.domain.local","user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:130.0) Gecko/20100101 Firefox/130.0","accept":"*/*","accept-language":"es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3","accept-encoding":"gzip, deflate, br, zstd","referer":"https://wazuh.domain.local/app/opensearch_index_management_dashboards","content-type":"application/json","osd-version":"2.8.0","osd-xsrf":"osd-fetch","content-length":"458","origin":"https://wazuh.domain.local","dnt":"1","connection":"keep-alive","sec-fetch-dest":"empty","sec-fetch-mode":"cors","sec-fetch-site":"same-origin","priority":"u=0"},"remoteAddress":"192.168.1.10","userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:130.0) Gecko/20100101 Firefox/130.0","referer":"https://wazuh.domain.local/app/opensearch_index_management_dashboards"},"res":{"statusCode":200,"responseTime":18,"contentLength":9},"message":"PUT /api/ism/policies/test 200 18ms - 9.0B"}

Seems like the index /var/lib/wazuh-indexer/nodes/0/indices/ UOejjGTARCe_LEe-geBRHg is corrupted. What should I do? I have to delete the folder in filesystem? Or may I try to recover it with som tool or backup?

Best regards.

Stuti Gupta

unread,
Sep 17, 2024, 4:55:05 AM9/17/24
to Wazuh | Mailing List
hi 

Index State Management (ISM) stores its configuration in the .opendistro-ism-config index. Don’t modify this index without using the ISM API operations. https://opster.com/es-errors/failed-to-recover-from-translog/#:~:text=Briefly%2C%20this%20error%20occurs%20when,to%20restart%20the%20Elasticsearch%20node, Briefly, this error occurs when Elasticsearch is unable to recover data from the transaction log. In some cases (a bad drive, user error) the translog can become corrupted. When this corruption is detected by Elasticsearch due to mismatching checksums, Elasticsearch will fail the shard and refuse to allocate that copy of the data to the node, recovering from a replica if available.https://www.elastic.co/guide/en/elasticsearch/reference/5.6/index-modules-translog.html#corrupt-translog-truncation

So it is not suggested to delete the .opendistro-ism-config as this indices is formed when you created ISM policies. To recover this please restart the wazuh-indexer first using th following command:
systemctl restart wazuh-indexer

 If the error persists, consider restoring from a snapshot if available. If not, you may need to delete the corrupted translog files, but this could lead to data loss. Always ensure to have a backup strategy to prevent such issues

Hope this helps 

Francesc G

unread,
Sep 17, 2024, 10:53:27 AM9/17/24
to Wazuh | Mailing List
Hi,

Restart wazuh-indexer did not solve the problem. Delete the translog files did not either.

I finally recovered a copy of the corrupted index from a VM snapshot (not a index snapshot) I took two weeks ago.

I have stopped wazuh-indexer, the I copied the entire directory containing the index and started wazuh-indexer again. After a few moments the ISM policies have appeared again and I have been able to access "State management policies", "Policy management indices", "Rollup jobs" and "Transform jobs" without errors.

Thanks for your help.

Stuti Gupta

unread,
Sep 20, 2024, 4:48:11 AM9/20/24
to Wazuh | Mailing List
Glad to this works
Reply all
Reply to author
Forward
0 new messages