Hello team,
Setup details:
etcd Version: 3.5.4
Number of nodes/members in etcd cluster: 3
Deployment details: etcd pods deployed on the same Kubernetes cluster; pods are scheduled on different yet co-located kubernetes nodes. These nodes do not have any other workloads running on them and at no point during any tests exhibited any io/memory/cpu stress.
Resource requests/limits: cpu - 16/32 ; memory: 128GiB/128GiB
etcd data directory: "memory-backed" emptydir mounted on the pods; size of this empty dir is set to 64000000000 bytes (64GB).
Environment variables: ETCD_QUOTA_BACKEND_BYTES = 64000000000 ; ETCD_SNAPSHOT_COUNT = 100000.
Configuration parameters/arguments passed: heartbeat-interval = 250 ; election-timeout = 2000 ; other etcd configuration params left at their default values
Expectations:
With a memory backed etcd we expect that the benchmark results should be at least as fast as
the official numbers and possibly even faster.
Benchmark Results:
We first ran the write benchmark using the same commands shared on the official benchmark results page. To our surprise we obtained very dismal results:
|---------|----------|------------|-------------|-----------|-------------|-----------|-----------------|
| Number | Key size | Value size | Number of | Number of | Target etcd | Average | Average latency |
| of keys | in bytes | in bytes | connections | clients | server | write QPS | per request |
|---------|----------|------------|-------------|-----------|-------------|-----------|-----------------|
| 10,000 | 8 | 256 | 1 | 1 | leader only | 1805 | 0.5ms |
| 100,000 | 8 | 256 | 100 | 1000 | leader only | 12,355 | 90.8ms |
| 100,000 | 8 | 256 | 100 | 1000 | all members | 12,351 | 95.3ms |
|---------|----------|------------|-------------|-----------|-------------|-----------|-----------------|
As can be seen above, while the QPS for the first test surpasses the official result (expected), the other two tests report a QPS that is about quarter of the official results. The average latency/request also seems to be about 4x the official numbers for these tests.
We didn't perform read tests since we were surprised by the results of the write tests and wanted to dig deeper. We therefore decided to first run the
etcdctl check perf tool to check if our cluster passes the tests for various load sizes and then also ran fio against the etcd data dir to check the performance of our memory.
Results of etcdctl check perf:
|------|-------|-------------|----------|--------|
| load | QPS | Slowest | Stddev | Result |
| size | | request (s) | (s) | |
|------|-------|-------------|----------|--------|
| s | 151 | 0.003675 | 0.000225 | Pass |
| m | 997 | 0.006241 | 0.000226 | Pass |
| l | 7885 | 0.040049 | 0.001170 | Pass |
| xl | 14126 | 0.151048 | 0.008136 | Pass |
|------|-------|-------------|----------|--------|
The xl load size option uses 1000 clients to issue write requests with key size 256 bytes and value size 1024 bytes for 60s. The resultant QPS is similar to the QPS observed in the benchmark test with 1000 clients, and barely crosses the pass criteria (13500). While this gives assurance that the cluster is fast enough (load tests pass) it is still not fast enough (benchmark numbers are still too slow).
Results of running fio:
Before we ran fio, we ran strace while the first benchmark test was running to check the average bs in write calls made to the WAL file. We found this to be 4767 bytes. We used this to run fio test using the command:
for i in 476700000 4767000000 47670000000; do
fio --rw=write --ioengine=sync --fdatasync=1 --size=${i}b --bs=4767 --filename=/var/etcd/fio-test --name=write_test
done
We obtained the following results:
|-------------|---------|----------|---------------|---------|
| size | IOPS | p99 clat | p99 fdatasync | BW |
| (bytes) | (avg) | (us) | (us) | (MiB/s) |
|-------------|---------|----------|---------------|---------|
| 476700000 | 2380000 | 5.472 | 0.596 | 1080 |
| 4767000000 | 2390000 | 5.408 | 0.612 | 1087 |
| 47670000000 | 2460000 | 5.344 | 0.652 | 1120 |
|-------------|---------|----------|---------------|---------|
The above results suggest that our memory is fast enough and is able to provide a throughput that is many orders of magnitude greater than what we are seeing with etcd benchmark tests.
Question:
Based on the above results we feel that while our memory is unlikely to be a bottleneck, something about our setup is sub-optimal which is causing such poor performance. Can you please help us with the following:
- What seems to be the likely cause for our results shared above? What should we check/focus on?
- What are some suggestions to better tune our etcd cluster?
- Please share the etcd params/configurations used when performing the official benchmark tests. We would like to use the same config to replicate the results on our end.
Thanks.