hspace says my clusters lack enough disk space to allocate any instances

21 views
Skip to first unread message

Daniel Howard

unread,
Oct 26, 2023, 6:35:42 PM10/26/23
to Ganeti Users list
Folks,

HAIL has seemed goofy these past years. Can someone help me out?

Ganeti Version: 3.0.1-1~ubuntu20.04+1

Our typical Ganeti node has:
 - 512GB of RAM
 - 4x 2TB SSDs

Our "medium" Ganeti instance:
 - 16GB RAM
 - 8 vcpus
 - 128GB disk via drbd

When I run hspace, I consistently get "FailDisk" but ... plenty of disk ... ?! Here's a smaller cluster:

15:23 djh@dr64-tomsk ~> sudo gnt-node list
Node       DTotal DFree MTotal  MNode  MFree Pinst Sinst
dr64-tomsk   7.0T  5.2T 376.5G 150.0G 220.5G    18    16
dr64-wuhan   7.0T  5.0T 376.5G 202.2G 237.4G    16    18
dr64-yalta   7.0T  5.2T 376.5G 140.4G 231.1G    16    16
15:25 djh@dr64-tomsk ~> sudo hspace --standard-alloc 128G,8G,8 --disk-template drbd -L -p

Initial cluster status:
 F Name        t_mem n_mem  i_mem  x_mem  f_mem  u_mem  r_mem t_dsk f_dsk pcpu vcpu pcnt scnt p_fmem p_fdsk r_cpu   lCpu   lMem   lDsk   lNet
   dr64-tomsk 385580  4096 241664 -86022 225842 139820 122880  7151  5333   48  168   18   16 0.3626 0.7457  3.50 18.000 18.000 34.000 18.000
   dr64-wuhan 385580  4096 192512 -54101 243073 188972 122880  7151  5157   48  142   16   18 0.4901 0.7211  2.96 16.000 16.000 34.000 16.000
   dr64-yalta 385580  4096 217088 -72213 236609 164396 122880  7151  5363   48  154   16   16 0.4264 0.7500  3.21 16.000 16.000 32.000 16.000

The cluster has 3 nodes and the following resources:
  MEM 1156740, DSK 21968640, CPU 144, VCPU 576.
There are 50 initial instances on the cluster.
Tiered (initial size) instance spec is:
  MEM 98304, DSK 1048576, CPU 32, using disk template 'drbd'.

Tiered allocation status:
 F Name        t_mem n_mem  i_mem  x_mem  f_mem  u_mem  r_mem t_dsk f_dsk pcpu vcpu pcnt scnt p_fmem p_fdsk r_cpu   lCpu   lMem   lDsk   lNet
   dr64-tomsk 385580  4096 241664 -86022 225842 139820 122880  7151  5333   48  168   18   16 0.3626 0.7457  3.50 18.000 18.000 34.000 18.000
   dr64-wuhan 385580  4096 192512 -54101 243073 188972 122880  7151  5157   48  142   16   18 0.4901 0.7211  2.96 16.000 16.000 34.000 16.000
   dr64-yalta 385580  4096 217088 -72213 236609 164396 122880  7151  5363   48  154   16   16 0.4264 0.7500  3.21 16.000 16.000 32.000 16.000

Tiered allocation results:
  - no instances allocated
  - most likely failure reason: FailDisk
  - initial cluster score: 4.31138397
  -   final cluster score: 4.31138397
  - memory usage efficiency: 56.30%
  -   disk usage efficiency: 26.10%
  -   vcpu usage efficiency: 80.56%
Standard (fixed-size) instance spec is:
  MEM 7629, DSK 122070, CPU 8, using disk template 'drbd'.

Standard allocation status:
 F Name        t_mem n_mem  i_mem  x_mem  f_mem  u_mem  r_mem t_dsk f_dsk pcpu vcpu pcnt scnt p_fmem p_fdsk r_cpu   lCpu   lMem   lDsk   lNet
   dr64-tomsk 385580  4096 241664 -86022 225842 139820 122880  7151  5333   48  168   18   16 0.3626 0.7457  3.50 18.000 18.000 34.000 18.000
   dr64-wuhan 385580  4096 192512 -54101 243073 188972 122880  7151  5157   48  142   16   18 0.4901 0.7211  2.96 16.000 16.000 34.000 16.000
   dr64-yalta 385580  4096 217088 -72213 236609 164396 122880  7151  5363   48  154   16   16 0.4264 0.7500  3.21 16.000 16.000 32.000 16.000

Normal (fixed-size) allocation results:
  -   0 instances allocated
  - most likely failure reason: FailDisk
  - initial cluster score: 4.31138397
  -   final cluster score: 4.31138397
  - memory usage efficiency: 56.30%
  -   disk usage efficiency: 26.10%
  -   vcpu usage efficiency: 80.56%


For a larger cluster:

djh@dr64-frown ~> sudo gnt-node list
Node       DTotal DFree MTotal  MNode  MFree Pinst Sinst
dr64-frown   7.0T  4.2T 503.3G 328.9G 220.2G    19    20
dr64-india   7.0T  4.6T 503.3G 305.8G 212.6G    22    18
dr64-macau   7.0T  4.4T 503.3G 188.0G 297.7G    12    28
dr64-malta   7.0T  4.1T 503.3G 293.9G 206.5G    23    16
dr64-mauve   7.0T  4.2T 503.3G 255.9G 214.5G    22    17
dr64-mocha   7.0T  4.5T 503.3G 263.2G 222.7G    27    12
dr64-nauru   7.0T  3.6T 503.3G 143.9G 314.2G    13    27
dr64-nepal   7.0T  3.7T 503.3G 209.2G 298.5G    14    25
dr64-samoa   7.0T  4.4T 503.3G 148.9G 327.4G    23    17
dr64-twice   7.0T  4.2T 503.4G 276.2G 208.6G    22    17
djh@dr64-frown ~> sudo hspace --standard-alloc 64G,8G,4 --disk-template drbd -L -p
Warning: cluster has inconsistent data:
  - node dr64-mocha is missing -76829 MB ram and 184 GB disk


Initial cluster status:
 F Name        t_mem n_mem  i_mem   x_mem  f_mem  u_mem  r_mem t_dsk f_dsk pcpu vcpu pcnt scnt p_fmem p_fdsk r_cpu   lCpu   lMem   lDsk   lNet
   dr64-frown 515429  4096 389120  -95135 217348 122213  81920  7154  4300   48  248   20   19 0.2371 0.6012  5.17 20.000 20.000 39.000 20.000
   dr64-india 515429  4096 368640  -74678 217371 142693  86016  7154  4696   48  226   22   18 0.2768 0.6565  4.71 22.000 22.000 40.000 22.000
   dr64-macau 515429  4096 290816  -91818 312335 220517 176128  7154  4554   48  190   11   29 0.4278 0.6367  3.96 11.000 11.000 40.000 11.000
   dr64-malta 515429  4096 372736  -73258 211855 138597  73728  7154  4230   48  231   23   16 0.2689 0.5913  4.81 23.000 23.000 39.000 23.000
   dr64-mauve 515429  4096 348160  -56503 219676 163173  81920  7154  4252   48  218   22   17 0.3166 0.5945  4.54 22.000 22.000 39.000 22.000
   dr64-mocha 515429  4096 360448  -76829 227714 150885  98304  7154  4440   48  226   27   12 0.2927 0.6207  4.71 27.000 27.000 39.000 27.000
   dr64-nauru 515429  4096 270336  -88236 329233 240997 131072  7154  3681   48  180   12   27 0.4676 0.5147  3.75 12.000 12.000 39.000 12.000
   dr64-nepal 515429  4096 286720  -81212 305825 224613 131072  7154  3752   48  182   14   25 0.4358 0.5246  3.79 14.000 14.000 39.000 14.000
   dr64-samoa 515429  4096 360448 -176053 326938 150885  73728  7154  4488   48  224   24   16 0.2927 0.6274  4.67 24.000 24.000 40.000 24.000
   dr64-twice 515442  4096 389120  -91463 213689 122226  73728  7154  4266   48  224   22   18 0.2371 0.5963  4.67 22.000 22.000 40.000 22.000

The cluster has 10 nodes and the following resources:
  MEM 5154303, DSK 73254400, CPU 480, VCPU 1920.
There are 197 initial instances on the cluster.
Tiered (initial size) instance spec is:
  MEM 98304, DSK 1048576, CPU 32, using disk template 'drbd'.

Tiered allocation status:
 F Name        t_mem n_mem  i_mem   x_mem  f_mem  u_mem  r_mem t_dsk f_dsk pcpu vcpu pcnt scnt p_fmem p_fdsk r_cpu   lCpu   lMem   lDsk   lNet
   dr64-frown 515429  4096 389120  -95135 217348 122213  81920  7154  4300   48  248   20   19 0.2371 0.6012  5.17 20.000 20.000 39.000 20.000
   dr64-india 515429  4096 368640  -74678 217371 142693  86016  7154  4696   48  226   22   18 0.2768 0.6565  4.71 22.000 22.000 40.000 22.000
   dr64-macau 515429  4096 290816  -91818 312335 220517 176128  7154  4554   48  190   11   29 0.4278 0.6367  3.96 11.000 11.000 40.000 11.000
   dr64-malta 515429  4096 372736  -73258 211855 138597  73728  7154  4230   48  231   23   16 0.2689 0.5913  4.81 23.000 23.000 39.000 23.000
   dr64-mauve 515429  4096 348160  -56503 219676 163173  81920  7154  4252   48  218   22   17 0.3166 0.5945  4.54 22.000 22.000 39.000 22.000
   dr64-mocha 515429  4096 360448  -76829 227714 150885  98304  7154  4440   48  226   27   12 0.2927 0.6207  4.71 27.000 27.000 39.000 27.000
   dr64-nauru 515429  4096 270336  -88236 329233 240997 131072  7154  3681   48  180   12   27 0.4676 0.5147  3.75 12.000 12.000 39.000 12.000
   dr64-nepal 515429  4096 286720  -81212 305825 224613 131072  7154  3752   48  182   14   25 0.4358 0.5246  3.79 14.000 14.000 39.000 14.000
   dr64-samoa 515429  4096 360448 -176053 326938 150885  73728  7154  4488   48  224   24   16 0.2927 0.6274  4.67 24.000 24.000 40.000 24.000
   dr64-twice 515442  4096 389120  -91463 213689 122226  73728  7154  4266   48  224   22   18 0.2371 0.5963  4.67 22.000 22.000 40.000 22.000

Tiered allocation results:
  - no instances allocated
  - most likely failure reason: FailDisk
  - initial cluster score: 43.11738389
  -   final cluster score: 43.11738389
  - memory usage efficiency: 66.67%
  -   disk usage efficiency: 40.36%
  -   vcpu usage efficiency: 111.93%
Standard (fixed-size) instance spec is:
  MEM 7629, DSK 61035, CPU 4, using disk template 'drbd'.

Standard allocation status:
 F Name        t_mem n_mem  i_mem   x_mem  f_mem  u_mem  r_mem t_dsk f_dsk pcpu vcpu pcnt scnt p_fmem p_fdsk r_cpu   lCpu   lMem   lDsk   lNet
   dr64-frown 515429  4096 389120  -95135 217348 122213  81920  7154  4300   48  248   20   19 0.2371 0.6012  5.17 20.000 20.000 39.000 20.000
   dr64-india 515429  4096 368640  -74678 217371 142693  86016  7154  4696   48  226   22   18 0.2768 0.6565  4.71 22.000 22.000 40.000 22.000
   dr64-macau 515429  4096 290816  -91818 312335 220517 176128  7154  4554   48  190   11   29 0.4278 0.6367  3.96 11.000 11.000 40.000 11.000
   dr64-malta 515429  4096 372736  -73258 211855 138597  73728  7154  4230   48  231   23   16 0.2689 0.5913  4.81 23.000 23.000 39.000 23.000
   dr64-mauve 515429  4096 348160  -56503 219676 163173  81920  7154  4252   48  218   22   17 0.3166 0.5945  4.54 22.000 22.000 39.000 22.000
   dr64-mocha 515429  4096 360448  -76829 227714 150885  98304  7154  4440   48  226   27   12 0.2927 0.6207  4.71 27.000 27.000 39.000 27.000
   dr64-nauru 515429  4096 270336  -88236 329233 240997 131072  7154  3681   48  180   12   27 0.4676 0.5147  3.75 12.000 12.000 39.000 12.000
   dr64-nepal 515429  4096 286720  -81212 305825 224613 131072  7154  3752   48  182   14   25 0.4358 0.5246  3.79 14.000 14.000 39.000 14.000
   dr64-samoa 515429  4096 360448 -176053 326938 150885  73728  7154  4488   48  224   24   16 0.2927 0.6274  4.67 24.000 24.000 40.000 24.000
   dr64-twice 515442  4096 389120  -91463 213689 122226  73728  7154  4266   48  224   22   18 0.2371 0.5963  4.67 22.000 22.000 40.000 22.000

Normal (fixed-size) allocation results:
  -   0 instances allocated
  - most likely failure reason: FailDisk
  - initial cluster score: 43.11738389
  -   final cluster score: 43.11738389
  - memory usage efficiency: 66.67%
  -   disk usage efficiency: 40.36%
  -   vcpu usage efficiency: 111.93%

Can anyone please give me a hint of what is going on here? Thank you!!

-danny

--

Daniel Howard

unread,
Oct 26, 2023, 8:47:08 PM10/26/23
to ganeti
Spindles!

djh@dr64-frown ~> sudo gnt-cluster info | grep -- -ratio
  vcpu-ratio: 4
  spindle-ratio: 32


The obvious tell on the smaller cluster was that the nodes were maxing out at 32 primary/secondary instances. (We had forced a few more ...)

So:
  1. if you run out of "CPU" check the vcpu-ratio
  2. if you run out of "disk" then check disk ... and check the spindle-ratio.
Modify these values thus, for example:

gnt-cluster modify --ipolicy-spindle-ratio=64
gnt-cluster modify --ipolicy-vcpu-ratio=8

-danny

c sights

unread,
Oct 26, 2023, 11:03:57 PM10/26/23
to gan...@googlegroups.com
Thanks! We haven't run into this problem yet but it would be quite annoying to figure out! Hi future me!

C.

Rudolph Bott

unread,
Oct 27, 2023, 3:43:14 AM10/27/23
to gan...@googlegroups.com
Hi Daniel,

we finally decided to set spindle-ratio to something like 1000 on all of our clusters because it has bitten us over and over again in the past. I think the underlying concept of spindles does not make any sense on todays systems (first it had been made obsolete by RAID controllers hiding the real disk topology and eventually we had SSDs which do not have problems with concurrent access anyways).
I have not yet looked into this code-wise, but I think it would be good to drop the concept of spindles in Ganeti altogether. Until that, documenting this topic in a better way (probably along with 'good default settings for your new cluster') should be the way to go.

Cheers,
Rudi

--
You received this message because you are subscribed to the Google Groups "ganeti" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ganeti+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ganeti/681af385-a367-4862-988e-cdafabb859f5n%40googlegroups.com.


--
 Rudolph Bott - bo...@sipgate.de

 sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
 HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
 Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391

Reply all
Reply to author
Forward
0 new messages