Slow disk speed? and EtcdHighFsyncDurations alerts

1,065 views
Skip to first unread message

Lumír Jasiok

unread,
Nov 30, 2021, 10:00:04 AM11/30/21
to etcd-dev
Hi, 

we have ETCD 3.5.1 with three members running on k8s 1.19.9 (on VMware).

We are receiving alerts from Prometheus:

etcdHighNumberOfFailedGRPCRequests

and

etcdHighFsyncDurations

Also, I can see in the logs lines like this:

{"level":"warn","ts":"2021-11-30T14:50:22.374Z","caller":"etcdserver/raft.go:369","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"697a9b1567d7ec42","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"4.668974367s"}

Is all of this related to slow disk speed? We are not using SSD currently. 




Mike Spreitzer

unread,
Nov 30, 2021, 10:34:56 AM11/30/21
to Lumír Jasiok, etcd-dev

You would need more investigation to identify the root cause.

Here is an article about investigating your "disk" speed: https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd?mhsrc=ibmsearch_a&mhq=etcd%20fsync

Regards,
Mike


etcd...@googlegroups.com wrote on 11/30/2021 10:00:04 AM:

> From: "Lumír Jasiok" <lumir....@gmail.com>

>
> --
> You received this message because you are subscribed to the Google
> Groups "etcd-dev" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to etcd-dev+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/
> d/msgid/etcd-dev/bdab839d-5678-4b21-8158-3f32a01c29aen%40googlegroups.com.


Message has been deleted

Lumír Jasiok

unread,
Nov 30, 2021, 6:09:52 PM11/30/21
to etcd-dev
Hi Mike,

thanks for help. If I am interpreting fio output correctly, our disks are too slow and fsync/fdatasync is too slow (above 35ms). Also, our IOPS is horrible.


------------------------------------------------------------------------------------------------------------------------------------------------------------

Starting 1 process
Jobs: 1 (f=1): [W(1)][14.5%][w=1578KiB/s][w=702 IOPS][eta 17m:45s]
Jobs: 1 (f=1): [W(1)][100.0%][w=1745KiB/s][w=777 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=36: Tue Nov 30 22:36:54 2021
  write: IOPS=627, BW=1410KiB/s (1444kB/s)(2048MiB/1487253msec); 0 zone resets
    clat (usec): min=7, max=15271, avg=27.25, stdev=131.22
     lat (usec): min=7, max=15273, avg=27.65, stdev=131.25
    clat percentiles (usec):
     |  1.00th=[    9],  5.00th=[   10], 10.00th=[   10], 20.00th=[   12],
     | 30.00th=[   13], 40.00th=[   14], 50.00th=[   15], 60.00th=[   16],
     | 70.00th=[   19], 80.00th=[   21], 90.00th=[   28], 95.00th=[   42],
     | 99.00th=[  367], 99.50th=[  404], 99.90th=[ 1516], 99.95th=[ 3130],
     | 99.99th=[ 5538]
   bw (  KiB/s): min=  201, max= 2565, per=100.00%, avg=1411.22, stdev=529.56, samples=2971
   iops        : min=   89, max= 1142, avg=628.46, stdev=235.77, samples=2971
  lat (usec)   : 10=12.76%, 20=65.15%, 50=18.55%, 100=1.56%, 250=0.06%
  lat (usec)   : 500=1.70%, 750=0.07%, 1000=0.02%
  lat (msec)   : 2=0.04%, 4=0.06%, 10=0.03%, 20=0.01%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=456, max=208108, avg=1560.48, stdev=1913.38
    sync percentiles (usec):
     |  1.00th=[  553],  5.00th=[  586], 10.00th=[  603], 20.00th=[  635],
     | 30.00th=[  676], 40.00th=[ 1020], 50.00th=[ 1237], 60.00th=[ 1303],
     | 70.00th=[ 1352], 80.00th=[ 1467], 90.00th=[ 2704], 95.00th=[ 5014],
     | 99.00th=[ 9896], 99.50th=[11731], 99.90th=[16909], 99.95th=[20579],
     | 99.99th=[35390]
  cpu          : usr=0.63%, sys=3.45%, ctx=2091801, majf=0, minf=34
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,933689,0,0 short=933688,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1410KiB/s (1444kB/s), 1410KiB/s-1410KiB/s (1444kB/s-1444kB/s), io=2048MiB (2147MB), run=1487253-1487253msec

Disk stats (read/write):
  sdc: ios=17828/1945744, merge=0/0, ticks=9324/1797424, in_queue=1805990, util=94.29%
------------------------------------------------------------------------------------------------------------------------------------------------------------

Lumir 


Dne úterý 30. listopadu 2021 v 16:34:56 UTC+1 uživatel mspr...@us.ibm.com napsal:
Reply all
Reply to author
Forward
0 new messages