A proposal to improve db_bench/benchmark.sh

23 views

Skip to first unread message

Bernard Jiang

unread,

Apr 27, 2024, 1:32:18 PMApr 27

to rocksdb

Hi RocksDB community! I am Bernard Jiang, a master's student at SPAIL(System Performance Analytics and Intelligence Lab) from Zhejiang University. My current research direction is system performance analysis.

I'm doing performance testing on RocksDB via tools/benchmark.sh based on db_bench. I chose tools/benchmark.sh since it is the recommended benchmark tool in the RocksDB project, and it has been used to test the performance of several versions.

However, with some basic analysis, I found that the symbolrocksdb::Stats::FinishedOps in the db_bench consumes a lot of cycles in multi-threaded randomread conditions. And as the number of threads increases, so does the percentage of cycles that observe it using perf-record.

I suppose there might be something wrong with that. Such an issue can cause db_bench to report inaccurate values in multi-threaded conditions that do not reflect the performance level of RocksDB.

I explored this and conducted a series of experiments. My current conclusion is that the default random read-run method provided by the current tool db_bench a significant performance overhead in a multi-threaded environment, due to the inappropriate design of the writing QPS to a single CSV file over a period of time feature that was introduced nine years ago.

Next, I'm going to introduce my experiments.

The Environment of Experiment

RocksDB: version 9.2.0
CPU: 2 * Intel(R) Xeon(R) Platinum 8383C CPU @ 2.70GHz
- HyperThreading ON
- 40 cores per socket
- 160 hardware threads
CPU Cache: 61440 KB
Memory: 512 GB
OS: Ubuntu 22.04 5.15.0-102-generic
Workload Description: randomread in tools/benchmark.sh
- I configured the CPU affinity of the task viataskset.
- I've used the following parameters to run the Benchmark's randomread project. The only parameter that differs in all of the experiments below is the NUM_THREADS.

# the parameters of benchmark.sh
export DB_DIR="./db"
export WAL_DIR=./wal
export NUM_KEYS=900000000
export CACHE_SIZE=6442450944
export DURATION=300
export NUM_THREADS=1 # only this changed in the following different experiments

./tools/benchmark.sh randomread

The results are somewhat noisy, but should be enough to get a ballpark performance estimate.

1 thread vs. 160 threads

I started by comparing the case where the number of threads is equal to 1 and 160. Note that a thread count of 160 means that all threads on the server (distributed over 2 CPU sockets) are allocated to db_bench. I got the data below.

We can see that at 160 threads, the Ops_Sec is only about 7.4 times that of 1 thread.

At the same time, the CPI is much higher than the case where the number of threads is 1 (7.59 vs. 0.36).

These numbers lead me to suspect that perhaps there are mutexes or global variables that limit performance in multithreaded scenarios.

To further determine the cause, I conducted the following experiment.

from 1 to 40 threads

I configured CPU affinity to distribute all the threads to different physical cores of the same processor, and increased the number of threads from 1 to 40.

When the number of threads is greater than 8, the metric ops_sec that represents throughput has not increased steadily.

I also used perf-record for observations (perf record -F 97). In the data parsed by perf-report, the percentage of the cycles sample of the symbol rocksdb::Stats::FinishedOps rises as the number of threads increases. When the number of threads is greater than 20, this symbol accounts for more than 80% of all cycles sample. Note that when the number of threads is 1, this symbol only accounts for a small percentage (<5%).

Based on the experiments conducted, it is evident that performance improvement under multi-threading scenarios does not exhibit a linear growth pattern. This observation implies the presence of contention or blocking phenomena, which hinder the efficient utilization of parallel resources.

Remove Bottleneck

I did a series of analyses and finally found that the biggest performance bottleneck was on the reporter_agent in the functionrocksdb::Stats::FinishedOps. With the removal of the call here, the performance in multi-threaded scenarios has been greatly improved.

The ops_sec with 160 threads is about 61 times as many as 1 thread (323795384 vs 5293853).

When I repeat the 1 to 40 threads randomread experiment above, ops_sec grew almost linearly as the number of threads increased.

image (2).png

I have two suggestions for improvement: a) add a db_bench parameter and set the default parameter of this feature to not be enabled, and b) modify the implementation of this feature to reduce its overhead.

And I have some questions: I want to do random read/random write/random rw performance analysis for rocksdb, do you have any recommended benchmark? Or db_bench parameters?

If so, is it possible to set it as the default parameter for benchmark.sh?

Thanks in advance,

Bernard Jiang

Reply all

Reply to author

Forward

0 new messages