Read/Connect timeouts to Jedis Cluster

1,298 views
Skip to first unread message

Alan Roche

unread,
Jul 20, 2020, 9:54:59 AM7/20/20
to Jedis
Hi,

Just trying to understand timeouts with regards to Jedis Cluster.
We have tight requirements around timing out quickly to avoid cascading failures etc. 
I have set all the timeouts I can think of to low values, ie. less than 100ms but no matter what I do, the quickest I can make a Get key timeout is 5 seconds.
The time to get an exception back is always very consistent, usually 5006 or 5007 milliseconds.
I am wondering what is happening here.
I have set attempts to 1 and all of the following to 50ms, and it still takes approx 5 seconds to time out:
poolConfig.setMaxWaitMillis
JedisCluster.connectionTimeout
JedisCluster.soTimeeout

Any ideas what is happening here? Any way to make it time out faster?

Thanks!

Allan Wax

unread,
Jul 21, 2020, 3:40:12 PM7/21/20
to Jedis
If the server making the request was directly connected on the same internal network as the Redis server fielding the request, you might get sub-millisecond responses.  In a normal environment you have multiple servers making request to a particular Redis machine in the cluster.  Since Redis is atomic, if there are ten requests outstanding to a particular server, your particular request might be processed 10th or maybe 10 milliseconds later.  On top of that there is network latency, garbage collection, and a number of other things that factor into response time.  So in short, your expectations are too high.  I've found in a medium loaded system that I was getting 1-5 millisecon responses on average.

If you truly need to handle the case where you do not receive a reply from the cluster in some number of milliseconds, then I suggest making the request in an Executor, by using invokeAll() with a certain timeout.  This call returns a Future which can be used to see if the call succeeded or timed out.  If timed out, take some recovery action.

Alan Roche

unread,
Jul 22, 2020, 7:31:29 AM7/22/20
to Jedis
Hi Allan, 
Thanks for your response. 
I reality, I just want to set the timeout to 1 or 2 seconds, - to fail fast on any transient network glitches to Redis, - however I can never make a GET timeout faster than 5 seconds.

I reduced the timeouts down as far as 50ms just for testing. I also tested at 1 or 2 second timeouts each. 
However, with all of these, I only ever got it to time out at slightly over 5 seconds, - never below. 

So, my question I guess, is whether there is some other limit/timeout at play other than the ones I listed below when dealing with Redis Cluster?

Thanks....

Allan Wax

unread,
Jul 22, 2020, 4:51:59 PM7/22/20
to Jedis
So first, having more than 5 second response times is way too long even in a networked environment.  Something else is going on.  How many simultaneous requests are you making.  Do you use multiple threads.  Do you use a pool of connections or a single connection.  In all cases you should be using a jedis pool to acquire and release connections.  something like:

JedisPool pool = new JedisPool(...);
try (Jedis jedis = pool.getConnection()) { // getConnection may be the wrong call
... do something with jedis
}

Note that no release is necessary since you are using a try-with-resources and it all happen automagically for you.

Second possibility is your keys.  Redis clusters are sharded.  The key is use to pick the master within the cluster that contains your key.  It's possible the cluster is not balanced so a great deal of calls go to the same server.  You can use redis-trib to rebalance the cluster if that is the case.

Last, if you can, send me a sample of your code that demonstrates the problem.  I'll see if it is something obvious.

Allan Wax

Alan Roche

unread,
Jul 27, 2020, 11:27:27 AM7/27/20
to Jedis
Hi, 

I have been simulating transient network delays by blocking the ports on the firewall, then sending requests and logging how long it takes for an Exception to appear to be caught from Jedis.
Usually the exception I get is saying something to the effect of "no nodes/slots available"
I am not saying that is taking 5 seconds for a response. I am saying that it takes 5 seconds to time out, - when I make the network hang, simulating a network glitch.
We wish to set relatively low timeouts of 1 second so that we fail-fast to our clients. This does not appear to happen, as the shortest time I can make it time-out is 5 seconds.
I am using a Jedis Pool. I have also set the wait timeout on the Jedis Pool. In my tests I set this to various values including 1 second, 500ms and 100ms.

The code looks like this:

final JedisPoolConfig poolConfig = new JedisPoolConfig();
poolConfig
.setMaxWaitMillis(config.getPoolMaxWaitMillis());
poolConfig
.setMaxTotal(200);
poolConfig
.setMaxIdle(20);
poolConfig
.setMinIdle(5);
poolConfig
.setTestOnCreate(true);
poolConfig
.setTestOnBorrow(false);
poolConfig
.setTestOnReturn(false);
poolConfig
.setTestWhileIdle(true);
poolConfig
.setNumTestsPerEvictionRun(30); // number of conns tested for eviction per run
poolConfig
.setMinEvictableIdleTimeMillis(60000); // A connection needs to be idle for this duration to be eligible for evicton
poolConfig
.setTimeBetweenEvictionRunsMillis(30000);

final Set<HostAndPort> jedisClusterNode = Arrays.stream(hosts.split(","))
 
.map(host -> new HostAndPort(host, port))
 
.collect(Collectors.toSet());

final JedisCluster cluster = new JedisCluster(jedisClusterNode,
 
1000,
 
500,
 
1, poolConfig);

Allan Wax

unread,
Jul 28, 2020, 4:44:31 PM7/28/20
to Jedis
Thanks for the updated information.

I have only minor comments on the code you supplied.

Try setting test on create to false and testonborrow to true.  You only care about the connection when you use it, not create it, and you're creating 200 of them.  also set testwhileidle to false as well for the same reason.
Extremely minor, change the related eviction settings to prime numbers.  This reduces the chance (or pushes it to the future) when all the eviction related processes kick off at the same time.

I have always found for production purposes that you need to 'prime the pump' before using the connections.  Create a little startup task that in multiple concurrent threads (200 in this case) makes a simple call to the server.  This will force the creation of a connection and put it into the pool.  Then start your test when all the threads complete.  This is easy to do with an Executor configured to more than 200 threads and invokeAll().

You seem to use the same port for each instance.  There is nothing wrong with that.  But in production you usually specify both host and port in the config file since that stuff can change.  A more tolerant split string is "\\s*,\\s*"

A newbie error which I've fixed for people many times, and submitted a Jedis patch long time ago for HostAndPort is NEVER USE 'LOCALHOST' as the name of the host.  This only applies to Redis Cluster.  Always configure with the ip address of the machine (not 127.0.0.1).  Bad things happen in a cluster if you use localhost.  It's quite fine to have an entire cluster for test purposes on the same machine, but use the ip address like 192.168.100.231 which is your machines address on your local network, but not the string localhost.

Please try at least 'prime the pump' before running your test and see if that helps.

Alan Roche

unread,
Jul 30, 2020, 6:26:38 PM7/30/20
to Jedis
Thanks for the suggestions below, - I will take onboard, I will do a test run with primed threads and pool etc, - but typically the test is doing continuous Gets in a loop for a while before I block the port, so it should be "warmed up".

On the substantive question, - it is still unclear why it consistently takes 5 seconds+ to time out (which is quite long for the requirements of a low latency application),  - when all of the timeouts (socket connect, socket so/read, and pool wait timeouts) are set to sub 100 milliseconds. It is far from ideal as it delays us failing fast and continuing to query another backend instead. 

Also, just FYI all of the below values are pulled from the config object, - I temporarily changed to magic numbers etc.for this thread so values could be seen - agreed, such values should come from config.

Allan Wax

unread,
Aug 12, 2020, 3:17:17 PM8/12/20
to Jedis
I just saw your reply today.

So ignoring all the other advice I've given (except 'prime the pump') then there is some issues that appear unrelated to Redis.  Possible threading issues:  you're spending all your time swapping context.  Also possible machine issues -- what's also running on your machine with the note that the more cores you have the better you can do but that is bound by the I/O going on in the machine.

If you can send me your complete test program (I will modify ports and addresses accordingly), with whatever your config file(s) is(are) I can play around and see what going on.  I have a pretty good machine that's lightly loaded so the results from the test should indicate, or hint at where the issue might be.

Allan Wax
Reply all
Reply to author
Forward
0 new messages