Why no linear scaling of Redis Cluster?

303 views
Skip to first unread message

Bin He

unread,
Aug 15, 2016, 8:24:27 AM8/15/16
to Redis DB
I'd like to build one horizontal scalaing system based on Redis cluster. So I have measured the throughput of Redis Cluster with different nodes. But seems the result don't show the linear scaling for Redis Cluster as stated in the cluster specs: "High performance and linear scalability up to 1000 nodes."
I am not sure what's wrong with Redis Cluster or my environment.

My test environment:
  • Server: 
    • HW: HP BL460c G9, 24 CPU (E5-2620 v3 @ 2.40GHz), 64G memory and 300G disk.
      I have two machines as above specs. In order to know the capacity for one HW, so one machine will run all master nodes in the redis cluster, another machine will run all slave nodes.

    • OS: SLES 12
      I have updates some system configuration
echo 65535 > /proc/sys/net/core/somaxconn
echo 65535 > /proc/sys/net/ipv4/tcp_max_syn_backlog
echo never > /sys/kernel/mm/transparent_hugepage/enabled
sysctl vm.overcommit_memory=1
sysctl vm.swappiness=0

Furthermore, I turned off the swap, which will cause very unstable throughput when AOF re-write happened. As observed, 15 million records occupy around 48G memory.
    • Redis 3.0.6: To eliminate the burst caused by RDB, I turned off all RDB and only enable AOF. For other configurations in redis.conf, just left with default values.

  • Client:
    • HW: HP DL380 G7, 16 CPU (E5620  @ 2.40GHz), 24G memory and 600G disk
    • OS: SLES 12

    • YCSB (0.6.0) with jedis (2.8.0),
      I am using Hash key to store all records (key and 21 fileds) and N sorted sets to store all keys and its random scores. N is the number of master nodes in the cluster. N sorted sets will be even distributed on each master nodes.
      The YCSB workload is configured as following. 
    workload=com.yahoo.ycsb.workloads.CoreWorkload
    recordcount=15000000
    operationcount=150000000
    insertstart=0
    fieldcount=21
    fieldlength=188
    readallfields=true
    writeallfields=false
    fieldlengthdistribution=zipfian
    readproportion=0.0
    updateproportion=1.0
    insertproportion=0
    readmodifywriteproportion=0.0
    scanproportion=0
    maxscanlength=1000
    scanlengthdistribution=uniform
    insertorder=hashed
    requestdistribution=zipfian
    hotspotdatafraction=0.2
    hotspotopnfraction=0.8
    table=subscriber
    measurementtype=histogram
    histogram.buckets=1000
    timeseries.granularity=1000

    Finally, I've measured result of cluster (3+3), (4+4), (5+5), (6+6), (8+8), (10+10) and (12+12) and got Create and Update result as following picture. Though there are always much CPU left (60~70% idle), I/O is not so busy, 30~40% at peak time. Only memory usage could achieve almost 100% at peak time, i.e. when AOF re-write happened. Besides the re-write, the memory usage is around 80%

    I'd like to know why those results don't show the linear scalability? How can I achieve the linear scalability till 1000 nodes?


                                                  The Real Bill

                                                  unread,
                                                  Aug 16, 2016, 1:14:21 AM8/16/16
                                                  to Redis DB
                                                  So what you are describing is you testing one node from one other node. That won't allow you to see linear scalability across multiple nodes, or lack thereof. Unless you take lot more fine grained system level effort you won't even see consistent performance with multiple Redis instances on a single node. Your artificial benchmark isn't even a good route to go. You want to profile your application, not a generic combination of factors.

                                                  Bin He

                                                  unread,
                                                  Aug 16, 2016, 10:02:19 PM8/16/16
                                                  to Redis DB
                                                  Hi Bill,

                                                  Thank you for your answer. But I can't fully catch your points.

                                                  First of all, let me clarify some terms during this conversation to avoid any confusing. Yes, in my last post, "node" is equal to "instance". If my understanding right, your "node" is more like one server or machine. By this terms, I've created one Redis Cluster on two nodes by script  redis-trib.rb. What's more I force all master instances deployed on one node, and all slave instances on another node.

                                                  Do you mean the throughput of this cluster deployment can't be increased linearly as the increased number of Redis instances on a single node? Shall I have to deploy them on multiple nodes to achieve the linear scalability?!

                                                  I've thought the Redis is a single-thread program and there are much more resource available, especially the CPU capacity. So running multiple instances on a single node is quite straight-forward. If it's NOT, are there any other resources/factors on a single node to limit its scalability?

                                                  Bill Anderson

                                                  unread,
                                                  Aug 17, 2016, 1:31:46 AM8/17/16
                                                  to redi...@googlegroups.com


                                                  On Aug 16, 2016, at 21:02, Bin He <heb...@gmail.com> wrote:

                                                  Hi Bill,

                                                  Thank you for your answer. But I can't fully catch your points.

                                                  First of all, let me clarify some terms during this conversation to avoid any confusing. Yes, in my last post, "node" is equal to "instance". If my understanding right, your "node" is more like one server or machine. By this terms, I've created one Redis Cluster on two nodes by script  redis-trib.rb. What's more I force all master instances deployed on one node, and all slave instances on another node.

                                                  Do you mean the throughput of this cluster deployment can't be increased linearly as the increased number of Redis instances on a single node? Shall I have to deploy them on multiple nodes to achieve the linear scalability?!

                                                  There are two aspects to this test. First yes you are dealing with contention of resources running multiple instances in a single node. Memory bandwidth, core caches, cores, network interrupts, network bandwidth, etc. all come into play. Without very precise and nontrivial care taken to avoid these you can not see full linear scalability. A single Redis instance can swamp gigabit networking almost trivially for example. 

                                                  This is especially true when forcing all master on a single node. Even if you could see it, you wouldn't reasonably deploy it to production so what would be the point beyond benchmarketing?

                                                  Next up is the client side. Remember all of those items I mentioned for why you can't just run more instances in a single node and see linear scalability? They all apply here - possibly more so. I've seen countless cars where beefing up the test machine, or even better expanding to multiples produces far higher throughput with absolutely no change in the servers.  


                                                  I've thought the Redis is a single-thread program and there are much more resource available, especially the CPU capacity. So running multiple instances on a single node is quite straight-forward.


                                                  Running it is simple. But at that level you need to be an expert is system administration it give it any real chance at being highly performance. CPU isn't the simple resource you are treating it as. It is a complex mistress. 

                                                  For just one small example, do you know what happens when data goes into a network card in either direction? Interrupts. Lots and lots of interrupts. Those interrupts *alone* can kill scalability. 


                                                  If it's NOT, are there any other resources/factors on a single node to limit its scalability?

                                                  Yes, see above. Cluster can't increase bandwidth in memory or network cards. It doesn't manage where your IRQs go, doesn't do CPU pinning for you, etc..  That is hardware and system expertise you need to have to do it well. 

                                                  And we haven't even delved into the test itself and such items as the read to write ratio, true CJ currency, pipelining, batch size, internal resource contention, how well it actually supports cluster mode, etc.. Last I saw if YCSB it doesn't properly support cluster at all anyway. Supporting the protocol/API and supporting cluster mode are two different things, and YCSB appears to have neither. 

                                                  As I mentioned at Redisconf, I achieved the linear scalability using sentinel and hundreds of instances. But they were all isolated rather this sharing everything, and I used a veritable bank of client side testing machines. Cluster, in that regard, is no different. 

                                                  Your setup can't even tell you where the bottleneck is. You are assuming server. You are probably mostly wrong. Nearly every case of "slow benchmarketing" I've seen has been client-side caused. Anyone who has spent a lot of time actually trying to write useful and generic benchmarks for systems, be the distributed or not, will tell you it is not easy. Configuring the server is the easy part by comparison. 

                                                  This is why I am telling you to drop the benchmark, write your application code, and test/profile your actual code and conditions with actual data. Anything less is spinning your wheels. Leave the benchmarketing to the sales teams. 

                                                  Cheers,
                                                  Bill


                                                  On Tuesday, August 16, 2016 at 1:14:21 PM UTC+8, The Real Bill wrote:
                                                  So what you are describing is you testing one node from one other node. That won't allow you to see linear scalability across multiple nodes, or lack thereof. Unless you take lot more fine grained system level effort you won't even see consistent performance with multiple Redis instances on a single node.  Your artificial benchmark isn't even a good route to go. You want to profile your application, not a generic combination of factors.

                                                  --
                                                  You received this message because you are subscribed to a topic in the Google Groups "Redis DB" group.
                                                  To unsubscribe from this topic, visit https://groups.google.com/d/topic/redis-db/dY07-dRyHz4/unsubscribe.
                                                  To unsubscribe from this group and all its topics, send an email to redis-db+u...@googlegroups.com.
                                                  To post to this group, send email to redi...@googlegroups.com.
                                                  Visit this group at https://groups.google.com/group/redis-db.
                                                  For more options, visit https://groups.google.com/d/optout.

                                                  Bin He

                                                  unread,
                                                  Aug 19, 2016, 5:52:56 AM8/19/16
                                                  to Redis DB

                                                  Hi Bill,

                                                   

                                                  Thank you very much for your more interpretation.

                                                   

                                                  Resource contention

                                                  For your comments about resource, yes, there are more resources/factors besides CPU/memory/IO, etc. As you point out, my question is that I can't find the bottleneck. I suppose there is something limit the scaling. That's why I come to here to seek any possible help.

                                                   

                                                  Bandwidth should not be the bottleneck. I have measured the max in/out bps on the master node as following figure. Insert could reach around 1 Gbps and Update never exceed 600 Mbps. They are far from our LAN bandwidth, 10G bps.


                                                  To me, network interrupts might be an issue. But I am not sure. As I observed, all interrupts are mainly distributed on 8 CPUs of 24 belonging to the same physical processor (there are 2 physical processors in total). The measured result shows 15~25% consumed on each CPU and consumed on the whole machine.


                                                  top - 15:44:11 up 49 days,  4:30,  4 users,  load average: 7.52, 5.47, 2.84 
                                                  Tasks: 287 total,   8 running, 279 sleeping,   0 stopped,   0 zombie
                                                  %Cpu0  : 24.2 us, 26.6 sy,  0.0 ni, 28.7 id,  0.0 wa,  0.0 hi, 20.5 si,  0.0 st
                                                  %Cpu1  : 29.0 us, 33.3 sy,  0.0 ni, 17.0 id,  0.0 wa,  0.0 hi, 20.7 si,  0.0 st
                                                  %Cpu2  : 15.8 us, 21.2 sy,  0.0 ni, 62.7 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  %Cpu3  : 26.8 us, 33.9 sy,  0.0 ni, 19.1 id,  0.0 wa,  0.0 hi, 20.1 si,  0.0 st
                                                  %Cpu4  : 20.9 us, 25.2 sy,  0.0 ni, 53.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  %Cpu5  : 27.5 us, 30.3 sy,  0.0 ni, 19.5 id,  0.0 wa,  0.0 hi, 22.6 si,  0.0 st
                                                  %Cpu6  :  0.7 us,  0.3 sy,  0.0 ni, 94.5 id,  0.0 wa,  0.0 hi,  4.6 si,  0.0 st
                                                  %Cpu7  :  0.0 us,  0.7 sy,  0.0 ni, 97.7 id,  0.0 wa,  0.0 hi,  1.6 si,  0.0 st
                                                  %Cpu8  :  0.7 us,  1.3 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  %Cpu9  :  0.3 us,  0.3 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  %Cpu10 :  0.3 us,  1.0 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  %Cpu11 :  0.3 us,  2.3 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  %Cpu12 :  9.6 us, 10.6 sy,  0.0 ni, 79.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  %Cpu13 : 27.7 us, 32.4 sy,  0.0 ni, 22.3 id,  0.0 wa,  0.0 hi, 17.6 si,  0.0 st
                                                  %Cpu14 : 19.0 us, 27.2 sy,  0.0 ni, 29.6 id,  0.0 wa,  0.0 hi, 24.1 si,  0.0 st
                                                  %Cpu15 : 24.5 us, 30.6 sy,  0.0 ni, 21.1 id,  0.0 wa,  0.0 hi, 23.8 si,  0.0 st
                                                  %Cpu16 :  7.2 us,  9.9 sy,  0.0 ni, 82.6 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  %Cpu17 : 27.6 us, 29.7 sy,  0.0 ni, 19.5 id,  0.0 wa,  0.0 hi, 23.2 si,  0.0 st
                                                  %Cpu18 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  %Cpu19 :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
                                                  %Cpu20 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  %Cpu21 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  %Cpu22 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  %Cpu23 :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
                                                  KiB Mem:  65865712 total, 65598776 used,   266936 free,    37316 buffers
                                                  KiB Swap:        0 total,        0 used,        0 free. 17710152 cached Mem

                                                   

                                                  As memory bandwidth, core caches, I didn't look into them and honestly don't know how. 

                                                   

                                                  I am thinking one scenario. If a single node can’t hold so many redis instance with higher throughput, what if we deploy redis instances on multiple cloud/VM nodes. And it’s happened all these VM nodes are located on the same physical node. Here, let’s limit the biggest scale as 8 VM nodes with similar EC2 m4.large (2 CPU, 8G memory). Will the cloud/hypervisor help to distribute all of these resources? E.g. CPU pining, memory/network bandwidth assignment, IRQ distribution, etc. Can we see the linear scalability?


                                                  Client: YCSB (0.6.0) + Jedis cluster (2.8.0)

                                                  In my tests, I didn't use pipelining. I just want to know the throughput without pipline. Furthermore as I know, redis cluster can't support pipline.

                                                  Actually, it's one of DB binding implemented with jedis (JedisCluster) to manage the clients inter-working with redis cluster. I believe the JedisCluster works well with the cluster.

                                                   

                                                  As my understanding, Sentinel manages failover in one HA solution and doesn’t care any data sharding. However, sharding could be one of the performance killers. I am not questioning the linear scalability of redis cluster. I just want to know how’s the linear relationship., i.e. the ratio, steep or not, which could be different from the case of Sentinel + independent instances. But now obviously, I've got problem and want to know anything wrong. Thanks!


                                                  Best Regards,

                                                  //Bin

                                                  Bill Anderson

                                                  unread,
                                                  Aug 19, 2016, 10:43:09 AM8/19/16
                                                  to redi...@googlegroups.com
                                                  On Aug 19, 2016, at 4:52 AM, Bin He <heb...@gmail.com> wrote:

                                                  Hi Bill,

                                                   

                                                  Thank you very much for your more interpretation.


                                                  Comments inline to preserve context.

                                                   

                                                  Resource contention

                                                  For your comments about resource, yes, there are more resources/factors besides CPU/memory/IO, etc. As you point out, my question is that I can't find the bottleneck. I suppose there is something limit the scaling. That's why I come to here to seek any possible help. 

                                                   

                                                  Bandwidth should not be the bottleneck. I have measured the max in/out bps on the master node as following figure. Insert could reach around 1 Gbps and Update never exceed 600 Mbps. They are far from our LAN bandwidth, 10G bps.


                                                  How are you measuring this? You say inout bandwidth, but the isn’t the only bandwidth to consider. For replicated setups you have to double it because the masters are also sending data back out to the slaves. It isn’t just the “listed bandwidth” of the network cards but packet distribution, windows size, kernel, latency, activity of *other* applications on the network, etc.. Unless you are using a crossover cable between two computers, or the *only* computers on the physical network are the ones in your test never assume you have full theoretical max of a network’s listed bandwidth. Similarly CPU load isn’t the sole determinant of CPU utilization.


                                                  To me, network interrupts might be an issue. But I am not sure. As I observed, all interrupts are mainly distributed on 8 CPUs of 24 belonging to the same physical processor (there are 2 physical processors in total). The measured result shows 15~25% consumed on each CPU and consumed on the whole machine.



                                                  Interrupts you don’t measure as CPU usage. Interrupt are a concern because Redis is single threaded so any other usage of the core a Redis instance is on will cause a relatively massive delay. It isn’t apparent in other, slower, datastore because they don’t operate at the microsecond scale which Redis does. A burst of interrupts won’t show as even a blip on load, but will cause a drop in performance for the Redis process(es) running on the interrupted cores.

                                                  As memory bandwidth, core caches, I didn't look into them and honestly don't know how. 


                                                  Most don’t. ;) But if you’re going to pound a single machine into performance submission you need to.

                                                  I am thinking one scenario. If a single node can’t hold so many redis instance with higher throughput, what if we deploy redis instances on multiple cloud/VM nodes. And it’s happened all these VM nodes are located on the same physical node. Here, let’s limit the biggest scale as 8 VM nodes with similar EC2 m4.large (2 CPU, 8G memory). Will the cloud/hypervisor help to distribute all of these resources? E.g. CPU pining, memory/network bandwidth assignment, IRQ distribution, etc. Can we see the linear scalability?


                                                  You *could* see better scalability and it would help, but you are unlikely to see 100% linear. It will depend heavily on the hypervisor used. With EC2 it is difficult to say. I was able to do it using Rackspaces’ OnMetal nodes with Docker Swarm powering the setup. On AWS you get *really* dependent on time of day on various networks.

                                                  Client: YCSB (0.6.0) + Jedis cluster (2.8.0)

                                                  In my tests, I didn't use pipelining. I just want to know the throughput without pipline. Furthermore as I know, redis cluster can't support pipeline. 

                                                  It can, but you need to ensure all requests being pipelined go to the same node.

                                                  Actually, it's one of DB binding implemented with jedis (JedisCluster) to manage the clients inter-working with redis cluster. I believe the JedisCluster works well with the cluster.


                                                  It isn’t (primarily) a matter of Jedis supporting the cluster protocol, but of the test design being proper for the cluster setup. YCSB isn’t designed for it - it doesn’t know how to evenly distribute across multiple nodes. Recall a moment ago the comment about pipelining? Here the same issue comes into play. The hashing is done by keyname, and I’d guarantee YCSB doesn’t ensure each thread is working on a different part of the keyspace of the cluster. From here it only gets worse. As you try to increase throughput by adding more threads/concurrency, and you are not ensuring you are working on different nodes, you will hit the limit of a single instance. I would feel quite confident that given YCSB itself does’t supply cluster, it doesn’t do this kind of work. This is what I mean by YCSB not supporting redid-cluster, not merely using a driver which understands the protocol.

                                                   As my understanding, Sentinel manages failover in one HA solution and doesn’t care any data sharing.

                                                  Yes, which is why I designed a method to have Sentinel by my shard lookup, and did benchmarking appropriately.  There is little end-level difference to how I did it w/Sentinel other than the key pieces being multiple nodes (and it was done w/Docker to run multiple instances per-node) and the design of a benchmark which leveraged the shared setup rather than ignoring it. This is probably the most significant problem in your benchmarketing. Your performance chart looks pretty much like what you see on a single instance when increasing concurrency. Which either your instances are hitting the same bottleneck due to client test implementation, individual latency drops, or - most likely - all of the above.

                                                  This is why I am saying your setup isn’t a good one - because you can’t isolate the issues. Because there are too many you can’t isolate and identify them.

                                                  However, sharding could be one of the performance killers.


                                                  Not sharding itself, but I wouldn’t be surprised if the fact that it is a shared solution fundamentally “breaks” YCSB which isn’t designed to be smart about it.
                                                  In the real world your data will not be uniformly split across shards. In benchmarketing you need to ensure that the data is not only uniformly distributed but that the test is also uniformly distributing all access all the time. 

                                                  I appreciate that you are trying to figure it out, I do. But I am telling you that the basic setup you are using simply will not let you. The range of known causes is too broad and within the framework you are describing can not be reduced simply. This is why I recommend you stop the above and move to profiling your actual system. Even if you wrote a test which properly knew and leveraged the sharded nature of RC, and perfectly coordinated and balanced the resources, you won’t be testing how your application will work and unless you are trying to write marketing for a Redis offering what you care about is how your application actually performs. How many instances of Redis you need n a cluster is driven by the distribution of your data, not by linear scalability.

                                                  Cheers,
                                                  Bill

                                                  Reply all
                                                  Reply to author
                                                  Forward
                                                  0 new messages