maximum shards open limit - looking for clues

9,518 views
Skip to first unread message

filip faredge

unread,
Jul 5, 2021, 11:34:23 PM7/5/21
to Wazuh mailing list
Hello,

I have recently encountered the issue with   [1000]/[1000] maximum shards open in the logs that I briefly summarized below my questions.
That raises a few questions on my end that I am unable to answer.

Once I identified the problem all the answers google led me to was similar to:
Increase the shards limit by nodes  temporarily -use with caution or add more nodes..   

1-  Why the caution ? are there any specific risk of having more then 1000 Shards?
2-  Probably related to the above but  why temporarily ?
3 - If temporarily what should I do for a permanent fix ? 
4 - I have resources on my VM so I m not considering of adding more nodes currently and yet the limit has been reached, am I missing something ?

My current setup (planning to triple the amount of agents  very soon and add more routers logs):
wazuh 4.1.5 +  opendistroforelasticsearch  + single node 
50 agents + 5 logs files from routers
used ram 8/16 
cpu load average low
used  hdd 68G/520G
shards disk.indices disk.used disk.avail disk.total disk.percent host      ip        node
   873       39.5gb    67.2gb    452.7gb    519.9gb           12 127.0.0.1 127.0.0.1 node-1
   143                                                                               UNASSIGNED

ip        heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1           48          99   3    0.10    0.21     0.25 dimr      *      node-1



Encountered problem:
No graphs/events 

Relevant log found (systemctl status filebeat -l | grep -i -E "err|warn") :
[...] FileStateOS:file.StateOS{Inode:0x4199a8a, Device:0x801}, IdentifierName:"native"}, TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:common.MapStr(currently has [1000]/[1000] maximum shards open;"}

Solution applied
curl -k -u admin:admin -XPUT https://localhost:9200/_cluster/settings -H 'Content-type: application/json' --data-binary $'{"transient":{"cluster.max_shards_per_node":3000}}'

Relevant Link + best answer found so far (copied above):
https://groups.google.com/g/wazuh/c/U0MdHBfpiR8

Possible solutions from the link above are:
- Add more nodes to your Elasticsearch cluster.
- Delete old indices if these couldn't be necessary anymore.
- Increase the shards limit by nodes. Use with caution.


I would appreciate if someone could enlighten me a little bit.
Thank you!
 /Filip

Alexander Bohorquez

unread,
Jul 6, 2021, 3:40:17 PM7/6/21
to Wazuh mailing list
Hello Philip,

Thank you for using Wazuh!

As my colleague mentioned, 

By default, the shards limit by node is 1000 shards and this issue happens when the server reaches the maximum shards limit in the cluster.

As you mentioned, to fix this issue, you have multiple options:
  • Delete indices. This frees shards. You could do it with old indices you don't want/need. Or even, you could automate it with ILM policies to delete old indices after a period of time, as explained in this post: https://wazuh.com/blog/wazuh-index-management.
  • Add more nodes to your Elasticserach cluster.
  • Increment the max shards per node (not recommended) but if you do this option, make sure you do not increase it too much, as it could provoke inoperability and performance issues in your Elasticsearch cluster. 
I take this opportunity to leave you this blog that explains how shards work and more importantly, how many you should have in your cluster:


Based on this, I take the opportunity to recommend modifying the Filebeat template in your environment (If you only have a single node, generally a shard per node is recommended) so that it generates only one shard per index. This template is at: /etc/filebeat/wazuh-template.json

{
  "order": 0,
  "index_patterns": [
    "wazuh-alerts-4.x-*",
    "wazuh-archives-4.x-*"
  ],
  "settings": {
    "index.refresh_interval": "5s",
    "index.number_of_shards": "1",
    "index.number_of_replicas": "0",
    "index.auto_expand_replicas": "0-1",
    "index.mapping.total_fields.limit": 10000,

And modify the value from 3 (By default) to 1: ""index.number_of_shards ":"1""

and to load the template, execute the command:

# filebeat setup --index-management

Once this is done, the "wazuh-alerts *" indices will be generated with just one shard and the number will decrease from now on.

I hope this information helps. Please let us know if you have any other questions.

Regards,
Alexander
.

filip faredge

unread,
Jul 6, 2021, 11:44:06 PM7/6/21
to Wazuh mailing list
Hi Alexander,
Thank you for your answer.

If you don t mind I will dig a little bit deeper :-)

  • Delete indices. This frees shards. 
  • -> I have created similar policy few days ago based on this exact article (read only after 30 days - delete after 1y) . To my understanding deleted indices will free shards after one year. More or less 7 months from now in my case (  first alert: wazuh-alerts-4.x-2021.02.15)

  • I have now re-read more calmly the second article and things are slowly clearer but I it seems understanding the intricacies of Elasticsearch requires a training on my end...
    -> that still does not clarifies to me my where the limit of 1000  shards comes from. I guess it has something to do with java heap and memory limit. 

  • modify the value from 3 (By default) to 1: ""index.number_of_shards ":"1""
  • -> If I understand correctly, this will change from 3 shards to one per day which will limit my total per year to 365 shards for wazuh alerts :
     -XGET "https://127.0.0.1:9200/_cat/shards/wazuh-alerts-4.x-2021.07.07*"
    wazuh-alerts-4.x-2021.07.07 1 p STARTED 27854 42.6mb 127.0.0.1 node-1
    wazuh-alerts-4.x-2021.07.07 2 p STARTED 27706   42mb 127.0.0.1 node-1
    wazuh-alerts-4.x-2021.07.07 0 p STARTED 28031   43mb 127.0.0.1 node-1
    Did I get it right ?

  • in the above we can see the size to be around 42mb , is this the size of the shard that is mentioned in the article ( Tip:Aim to keep the average shard size between at least a few GB and a few tens of GB. )  ?

  • Do you know the reasoning behind    "index.number_of_shards": "3" as default ?  Is it needed for huge deployments to keep the  shard size  smaller ? or it is required when you have more nodes ?
    https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html  in this link it is mentioned the number of primary shards that an index should have: Defaults to 1


    /regards
    Filip

Alexander Bohorquez

unread,
Jul 8, 2021, 2:35:28 PM7/8/21
to Wazuh mailing list
Hello Filip,

I apologize for the delay in replying,

Answering your questions:

1.- "That still does not clarifies to me where the limit of 1000  shards comes from. I guess it has something to do with java heap and memory limit."  
I start with the definition commented on the shared blog: "Data in Elasticsearch is organized into indices. Each index is made up of one or more shards. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster." An index can be divided into different shards to distribute the information between nodes (when there is more than 1 Elasticsearch node). In this case, by only having one this is not necessary. And this limit is associated with avoiding performance problems in the Elasticsearch cluster.

This link: https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html contains even more information/details that could answer all your doubts.

2.- "modify the value from 3 (By default) to 1: ""index.number_of_shards ":"1""
-> If I understand correctly, this will change from 3 shards to one per day which will limit my total per year to 365 shards for wazuh alerts :
wazuh-alerts-4.x-2021.07.07 1 p STARTED 27854 42.6mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.07 2 p STARTED 27706   42mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.07 0 p STARTED 28031   43mb 127.0.0.1 node-1
Did I get it right ?" Yes, it will change that they are generated instead of 3 to 1 shard per day and it will generate fewer shards over time.

3.- "Do you know the reasoning behind    "index.number_of_shards": "3" as default ?" Because this configuration is intended for a three-node Elasticsearch cluster. By default, it is the most common option to have this structure for cluster functionality reasons.

4.- "Is it needed for huge deployments to keep the  shard size  smaller ? or it is required when you have more nodes ?" 

In the blog, this point is mentioned and it is because the larger the size/quantity of shards, the more time the cluster will take to process the information and therefore affect the performance.

Here the reference from the Blog about it:

How does shard size affect performance?

In Elasticsearch, each query is executed in a single thread per shard. Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard.
This means that the minimum query latency, when no caching is involved, will depend on the data, the type of query, as well as the size of the shard. Querying lots of small shards will make the processing per shard faster, but as many more tasks need to be queued up and processed in sequence, it is not necessarily going to be faster than querying a smaller number of larger shards. Having lots of small shards can also reduce the query throughput if there are multiple concurrent queries.

Also, this Elasticisearch documentation mentions the considerations to have in mind: https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html#shard-sizing-considerations

I hope this information helps. Please let me know if you have any other questions!

Regards,

Alexander Bohorquez

filip faredge

unread,
Jul 11, 2021, 11:43:45 PM7/11/21
to Wazuh mailing list
Hi Alexander,
That was a pretty quick response in my book :-)

Thank you for all you answers,

I just ran the proposed modifications. Let's see tomorrow the results.

Regards
Filip

filip faredge

unread,
Jul 18, 2021, 10:49:11 PM7/18/21
to Wazuh mailing list
Hi Alexander,

I think my problem is not fully resolved.

Wazuh is working correctly but I noticed I have a lot of unassigned shards.
I did a little bit of googling to understand the problem and to my understanding these 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1420  100  1420    0     0  12320      0 --:--:-- --:--:-- --:--:-- 12456
security-auditlog-2021.07.19 0 r UNASSIGNED
security-auditlog-2021.07.10 0 r UNASSIGNED
security-auditlog-2021.07.16 0 r UNASSIGNED
security-auditlog-2021.07.14 0 r UNASSIGNED
security-auditlog-2021.07.15 0 r UNASSIGNED
security-auditlog-2021.07.13 0 r UNASSIGNED
security-auditlog-2021.07.17 0 r UNASSIGNED
security-auditlog-2021.07.12 0 r UNASSIGNED
security-auditlog-2021.07.18 0 r UNASSIGNED
security-auditlog-2021.07.11 0 r UNASSIGNED

I did a little bit of googling/research to figure out the problem  and to my understanding these  are all related to replicas which is not possible with my one node setup.
In  /etc/filebeat/wazuh-template.json I have the ".number_of_replicas" set to 0

{
  "order": 0,
  "index_patterns": [
    "wazuh-alerts-4.x-*",
    "wazuh-archives-4.x-*"
  ],
  "settings": {
    "index.refresh_interval": "5s",
    "index.number_of_shards": "1",
    "index.number_of_replicas": "0",
    "index.auto_expand_replicas": "0-1",


This is probably not related but should I maybe change ""index.auto_expand_replicas" to false  ? 

Do you know how to disable replicas for security-auditlog  ?

Research led me to this option   -XDELETE "https://127.0.0.1:9200/security-auditlog-2021.02.16"  but I suppose this will delete all the shards with this name ( as opposed to deleting only the UNASSIGNED one)
Is it possible to delete the UNASSIGNED only shards ? 


Kind regards
Filip

Alexander Bohorquez

unread,
Jul 23, 2021, 3:17:57 PM7/23/21
to Wazuh mailing list
Hi Filip,

Sorry for the delay in replying. Answering your questions:

"This is probably not related but should I maybe change" "index.auto_expand_replicas" to false?" Yes, this parameter is defined by default but the recommendation in this case if you have only one node, would be to change it to "false".

Do you know how to disable replicas for security-auditlog? For this, you'll need to check the current config of the index first:

You can use Elasticsearch Dev-tools to check this:

For example:

GET security-auditlog-2021.05.06/_settings

{
  "security-auditlog-2021.05.06" : {
    "settings" : {
      "index" : {
        "creation_date" : "1620316510169",
        "number_of_shards" : "3",
        "number_of_replicas" : "1",
        "uuid" : "yxzJFZk3QXWv_p5_POaCUw",
        "version" : {
          "created" : "7100099"
        },
        "provided_name" : "security-auditlog-2021.05.06"
      }
    }
  }
}

In this case, if you want to reduce the number of shards or replicas, you can create a template for it:

PUT _index_template/auditlog
{
  "index_patterns": [
    "security-auditlog-*"
  ],
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "auto_expand_replicas": false
    }
  }
}

This will cause those future indices from that index-pattern to have the new settings applied.

You can use GET _cat/templates to check if an index-pattern already has a template to avoid overwriting your settings.

In this case, the option you would have is deleting the indices and this would delete the unassigned shards. After applying the template when creating the new index, you won't have this problem. 

I hope this information helps. Please let me know if you have any other questions!

filip faredge

unread,
Jul 27, 2021, 11:15:34 PM7/27/21
to Wazuh mailing list
Hi Alexander,
Thank you for your reply. Your help is much appreciated!

I have some good and bad news.


First the good:
As per your suggestion I executed this yesterday:
-XPUT https://localhost:9200/_index_template/auditlog -H 'Content-type: application/json' --data-binary $' {"index_patterns": [ "security-auditlog-*" ], "template": {"settings": {"number_of_shards": 1, "number_of_replicas": 0, "auto_expand_replicas": false }}}'

I verified the results using this command  -XGET "https://localhost:9200/_index_template?pretty"
{
  "index_templates" : [
    {
      "name" : "auditlog",
      "index_template" : {
        "index_patterns" : [
          "security-auditlog-*"
        ],
        "template" : {
          "settings" : {
            "index" : {
              "number_of_shards" : "1",
              "auto_expand_replicas" : "false",
              "number_of_replicas" : "0"
            }
          }
        },
        "composed_of" : [ ]

And today I have this result:
security-auditlog-2021.07.24 0 p STARTED     933 201.4kb 127.0.0.1 node-1
security-auditlog-2021.07.24 0 r UNASSIGNED
security-auditlog-2021.07.23 0 p STARTED     971   203kb 127.0.0.1 node-1
security-auditlog-2021.07.23 0 r UNASSIGNED
security-auditlog-2021.07.20 0 p STARTED     963 266.8kb 127.0.0.1 node-1
security-auditlog-2021.07.20 0 r UNASSIGNED
security-auditlog-2021.07.28 0 p STARTED     172 206.7kb 127.0.0.1 node-1      -> no more  replica 
security-auditlog-2021.07.25 0 p STARTED     953 208.9kb 127.0.0.1 node-1
security-auditlog-2021.07.25 0 r UNASSIGNED
security-auditlog-2021.07.27 0 p STARTED    1028 289.5kb 127.0.0.1 node-1
security-auditlog-2021.07.27 0 r UNASSIGNED
security-auditlog-2021.07.22 0 p STARTED     960 265.4kb 127.0.0.1 node-1
security-auditlog-2021.07.22 0 r UNASSIGNED
security-auditlog-2021.07.21 0 p STARTED     983 280.5kb 127.0.0.1 node-1
security-auditlog-2021.07.21 0 r UNASSIGNED
security-auditlog-2021.07.26 0 p STARTED     982 282.7kb 127.0.0.1 node-1
security-auditlog-2021.07.26 0 r UNASSIGNED

In other words great success!



------------------------------------------------------------------------------------------------------------------------------------
Small side note:
    While doing my checks I also noticed this which led to more confussion:
    -XGET "https://127.0.0.1:9200/_cat/shards/wazuh-monitoring-2021.07.28*"
    wazuh-monitoring-2021.07.28 1 p STARTED 497 192.3kb 127.0.0.1 node-1
    wazuh-monitoring-2021.07.28 0 p STARTED 427 176.4kb 127.0.0.1 node-1

    So for anyone reading this  there is a third place (yikes!) to set up shard settings...
    After logging into Wazuh go to -> settings -> configuration 
    There you can change shards settings for wazuh-monitoring  and wazuh-statistics 
------------------------------------------------------------------------------------------------------------------------------------

Now the weird stuff:
wazuh-alerts-4.x-2021.07.14 0 p STARTED 596344 781.2mb 127.0.0.1 node-1   ->good
wazuh-alerts-4.x-2021.07.18 0 p STARTED 498060 642.8mb 127.0.0.1 node-1    ->good
wazuh-alerts-4.x-2021.07.12 2 p STARTED 236699 335.2mb 127.0.0.1 node-1    
wazuh-alerts-4.x-2021.07.12 1 p STARTED 236829 332.8mb 127.0.0.1 node-1   
wazuh-alerts-4.x-2021.07.12 0 p STARTED 237499 333.5mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.17 0 p STARTED 520561 690.8mb 127.0.0.1 node-1    ->good
wazuh-alerts-4.x-2021.07.16 0 p STARTED 556151 722.9mb 127.0.0.1 node-1    ->good
wazuh-alerts-4.x-2021.07.13 0 p STARTED 725358 948.5mb 127.0.0.1 node-1    -> after applying changes in /etc/filebeat/wazuh-template.json  -> expected results 1 shard/ no replica -> all good
wazuh-alerts-4.x-2021.07.15 0 p STARTED 596922 786.5mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.19 0 p STARTED 620122 807.6mb 127.0.0.1 node-1   ->good
wazuh-alerts-4.x-2021.07.10 2 p STARTED  60061  81.5mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.10 1 p STARTED  60189  82.2mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.10 0 p STARTED  60285  81.7mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.11 2 p STARTED 102471 140.3mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.11 1 p STARTED 102444 141.1mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.11 0 p STARTED 103179 141.6mb 127.0.0.1 node-1

And something happened on the 19/20th and now I have replicas created daily which I can't explain:
wazuh-alerts-4.x-2021.07.22 0 p STARTED    646123 831.8mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.22 0 r UNASSIGNED
wazuh-alerts-4.x-2021.07.28 0 p STARTED    103449 153.1mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.28 0 r UNASSIGNED
wazuh-alerts-4.x-2021.07.26 0 p STARTED    662604 878.9mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.26 0 r UNASSIGNED
wazuh-alerts-4.x-2021.07.23 0 p STARTED    592761 773.6mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.23 0 r UNASSIGNED
wazuh-alerts-4.x-2021.07.21 0 p STARTED    634629 826.5mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.21 0 r UNASSIGNED
wazuh-alerts-4.x-2021.07.24 0 p STARTED    470985 599.7mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.24 0 r UNASSIGNED
wazuh-alerts-4.x-2021.07.20 0 p STARTED    676460 909.7mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.20 0 r UNASSIGNED
wazuh-alerts-4.x-2021.07.25 0 p STARTED    545639   710mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.25 0 r UNASSIGNED
wazuh-alerts-4.x-2021.07.27 0 p STARTED    674310   888mb 127.0.0.1 node-1
wazuh-alerts-4.x-2021.07.27 0 r UNASSIGNED

My first lines from /etc/filebeat/wazuh-template.json:::::::
{
  "order": 0,
  "index_patterns": [
    "wazuh-alerts-4.x-*",
    "wazuh-archives-4.x-*"
  ],
  "settings": {
    "opendistro.index_state_management.policy_id": "wazuh_hot_cold_workflow",
    "index.refresh_interval": "5s",
    "index.number_of_shards": "1",
    "index.number_of_replicas": "0",
    "index.auto_expand_replicas": "false",
    "index.mapping.total_fields.limit": 10000,

When I apply the config i get:  
filebeat setup --index-management
ILM policy and write alias loading not enabled.

Index setup finished.

I check the config  using   -XGET "https://localhost:9200/_template?pretty"  and the config from above is applied:
Top lines:
 "wazuh" : {
    "order" : 0,
    "version" : 1,
    "index_patterns" : [
      "wazuh-alerts-4.x-*",
      "wazuh-archives-4.x-*"
    ],
    "settings" : {
      "index" : {
        "mapping" : {
          "total_fields" : {
            "limit" : "10000"
          }
        },
        "opendistro" : {
          "index_state_management" : {
            "policy_id" : "wazuh_hot_cold_workflow"
          }
        },
        "refresh_interval" : "5s",
        "number_of_shards" : "1",
        "auto_expand_replicas" : "false",
<--  Part of results omitted  -->
          ]
        },
        "number_of_replicas" : "0"
      }
    },


What am I missing ? Did I change some setting by accident ? 
How can I disable these wazuh-alerts- replicas?  

Regards,
Filip



Alexander Bohorquez

unread,
Jul 28, 2021, 4:20:33 PM7/28/21
to Wazuh mailing list
Hi Filip,

Hope you're well,

The detail I'm seeing is that you have a retention policy configured in the Wazuh template. Therefore, it will apply in the creation of the indices.

The problem may be related to the fact that this policy has been configured to assign a replica when the index is created.

You can verify this by looking at the "wazuh_hot_cold_workflow" policy and not having something like:

    {
        "replica_count": {
            "number_of_replicas": 1
        }
    }
],

If this is the cause, you must change the policy and set this value to "0".

I hope this helps. Please let me know how it goes!

filip faredge

unread,
Jul 29, 2021, 8:58:17 PM7/29/21
to Wazuh mailing list
Hi Alexander

You nailed it!
I completely skipped my mind to check the wazuh_hot_cold_workflow that I created myself...

I have to say that settings are a little bit "all over the place" but at least I learned a lot and have a much better understanding of the shards intricacies. 


Now the only question left for me is there a way to clean my environment and delete all the UNASSIGNED shards gracefully.
I read multiple articles but to my understanding is if I run e.g.   -XDELETE "https://127.0.0.1:9200/security-auditlog-2021.07.27"  I will delete both shards from that day and I would not like to break my environment so I am reluctant to test such solution.  
Is there a way to delete UNASSIGNED shards only ?  

Once again for all the help!
/regards
Filip

Alexander Bohorquez

unread,
Aug 4, 2021, 10:53:04 AM8/4/21
to Wazuh mailing list
Hello Philip, 

Sorry for the delay,

Answering your question, 

Unfortunately, it is not possible to delete only unassigned shards. If you had an environment with more Elasticsearch nodes it might be possible to re-locate these shards. But since this was caused by a detail in the template configuration, the alternative, in this case, is to configure the template correctly (As you already did) and for the above-unassigned shards, the indices should be deleted and this will remove the unassigned shards. If it happens with the Wazuh alerts indices, they could be deleted and then reindex the data with the correct template.

I hope this information helps.

Best regards,

Alexander Bohorquez

filip faredge

unread,
Aug 5, 2021, 8:37:47 PM8/5/21
to Wazuh mailing list
Hi Alexander,
Once again thank you for clearing things up!

Finally my shards are under control!

Kind regards
Filip
Reply all
Reply to author
Forward
0 new messages