Hi Yuhe,
we are using the open source redis cluster, at scale. The data size is around 2 TB master data, with 150+ master nodes, and replication factor of 1.
we feel pretty confident with redis cluster.
although we dont work with million OPS, but one of our clusters had reached ~2 million QPS with pipelining.
A few things which i would suggest.
1) Use a cluster aware client (we use Jedis/Lettuce in java)
2) Some of the cluster commands are slow(we had ~60 master nodes in one cluster and the slowest command was cluster nodes taking around 3 ms, which was used by Lettuce to continuously refresh the network topology). We ended up asking Lettuce to refresh its cache with cluster topology every 2 hrs or in case it gets a moved or ask redirection event. Make sure your client does not hit commands like "cluster nodes" or "cluster slots" frequently enough.
3) we had found out that more often than not, the number of clients was the issue. The cluster ops almost linearly increased with increase in clients(although for us the clients were less).
4) keep a note on the "info cpu". If you do only the basic operations and no processing(no lua), then ideally you are fully utilizing the redis instances only if "used_cpu_sys" + "used_cpu_user" is equal to N in an N second interval. we have N as 300 as we monitor and put the data on graph every 300 secs. Most probably, your CPUs will not be utilized, and the issue is not redis but the network, or most probably your clients which are not fast enough.
5) You could also try to run redis instances in a cluster on separate machines with good connectivity.
6) Also, disable automatic save if running multiple instances on same machine(so that not all instances start saving at the same time and slow down redis), instead do a BGSAVE for each instance one by one and waiting for a few seconds after each save(but i think you already disabled the saves)
I had presented our use case where we use open source redis cluster with our dataset and multiple stacks in the redis conf 2017. The presentation is
here.