All shards failure after upgrade to 4.8.0

727 views
Skip to first unread message

Karl DeBisschop

unread,
Jun 14, 2024, 5:30:46 AM6/14/24
to Wazuh | Mailing List
I am getting repeated messages like this:

[2024-06-13T20:09:39,866][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-06-13T20:09:40,044][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-06-13T20:09:40,529][INFO ][o.o.p.PluginsService     ] [node-1] PluginService:onIndexModule index:[wazuh-alerts-4.x-2024.01.27/g31K9JWcQ6CoalcwA2lnkQ]
[2024-06-13T20:09:40,657][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-06-13T20:09:41,182][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-06-13T20:09:41,413][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-06-13T20:09:41,690][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-06-13T20:09:41,966][INFO ][o.o.p.PluginsService     ] [node-1] PluginService:onIndexModule index:[wazuh-alerts-4.x-2024.01.26/welKPHhjTnGoFUS_oW3aOw]
[2024-06-13T20:09:42,045][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node-1] Detected cluster change event for destination migration
[2024-06-13T20:09:42,071][WARN ][r.suppressed             ] [node-1] path: /.kibana/_count, params: {index=.kibana}
org.opensearch.action.search.SearchPhaseExecutionException: all shards failed
at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:677) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:373) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:716) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:485) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:274) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:351) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]


The cluster allocation is 

$ curl -k -u user:pass https://127.0.0.1:9200/_cluster/allocation/explain

{"index":".opendistro-ism-managed-index-history-2024.05.31-000554","shard":0,"primary":false,"current_state":"unassigned","unassigned_info":{"reason":"CLUSTER_RECOVERED","at":"2024-06-13T20:28:55.996Z","last_allocation_status":"no_attempt"},"can_allocate":"no","allocate_explanation":"cannot allocate because allocation is not permitted to any of the nodes","node_allocation_decisions":[{"node_id":"QMuy1mhfTHy4GWXntDrbyA","node_name":"node-1","transport_address":"10.148.120.132:9300","node_attributes":{"shard_indexing_pressure_enabled":"true"},"node_decision":"no","deciders":[{"decider":"same_shard","decision":"NO","explanation":"a copy of this shard is already allocated to this node [[.opendistro-ism-managed-index-history-2024.05.31-000554][0], node[QMuy1mhfTHy4GWXntDrbyA], [P], s[STARTED], a[id=NiwBAkSMQr2LLMcotioQOg]]"}]}]}


Any thoughts appreciated.

Rafael Bailon Robles

unread,
Jun 17, 2024, 6:26:29 AM6/17/24
to Wazuh | Mailing List

Hi, I have investigated your case. From the message a copy of this shard is already allocated to this node it seems that the cluster size has been reduced. The message indicates that there is already a replica on that node and you cannot have a replica number greater than the number of nodes in the cluster.

How many nodes do you have? Is it the same number you had before upgrading to 4.8.0?

If this is the problem, the solution is to adjust the number of replicas. You have more information here https://documentation.wazuh.com/current/user-manual/wazuh-indexer/wazuh-indexer-tuning.html#shards-and-replicas

Karl DeBisschop

unread,
Jun 18, 2024, 4:12:11 AM6/18/24
to Wazuh | Mailing List
Thanks Rafael.

I was finally able to resolve my issue. The cluster size was not reduced ... it was always one. I did try setting replicas to zero:

  curl -k -u user:pass -X PUT -H "Content-Type: application/json" -d \
  '{"persistent":{"opendistro":{"index_state_management":{"history":{"number_of_replicas":"0"}}}}}' https://127.0.0.1:9200/_cluster/settings

Followed by this the remove the duplicate shards:

  curl -k -u user:pass https://127.0.0.1:9200/_cat/shards | grep UNASSIGNED | \
  awk '{print $1}' | xargs -i curl -k -XDELETE -u user:pass "https://127.0.0.1:9200/{}"

Even after that, I was the dashboard did not become ready. The java error was from https://github.com/wazuh/wazuh-indexer/issues/171 ... I was able top resolve it with


But, by that time the damage had been done and the only waqy I could get back running again was:

  curl -k -u user:pass -XDELETE "https://127.0.0.1:9200/.kibana_*"

I cannot be sure, but I suspect the real problem was "Wazuh-indexer error: "index template ss4o_metrics_template has index patterns ss4o_metrics" and everything else was caused by trying to migrate and restart while that issue was making it impossible to run cleanly.


Reply all
Reply to author
Forward
0 new messages