io.lettuce.core.cluster.PartitionSelectorException: Cannot determine a partition to read for slot

varun....@capitalone.com

unread,

Sep 26, 2018, 5:37:37 PM9/26/18

to lettuce-redis-client-users

We use Lettuce 5.1.0.M1 as a client for connecting to Redis managed by AWS/Elasticache, in clustered mode, encryption on, 3 shards, 3 nodes each.

When we run this setup at around 2000 transactions per second (TPS), we get a trickle (~ 80 TPS) of the following error - always on a get(String) operation:

io.lettuce.core.cluster.PartitionSelectorException: Cannot determine a partition to read for slot nnnn. at io.lettuce.core.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:163) at io.lettuce.core.cluster.PooledClusterConnectionProvider.getConnectionAsync(PooledClusterConnectionProvider.java:108) at io.lettuce.core.cluster.ClusterDistributionChannelWriter.doWrite(ClusterDistributionChannelWriter.java:122) at io.lettuce.core.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:72) at io.lettuce.core.RedisChannelHandler.dispatch(RedisChannelHandler.java:167) at io.lettuce.core.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:207) at io.lettuce.core.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:461) at io.lettuce.core.AbstractRedisAsyncCommands.get(AbstractRedisAsyncCommands.java:635) at sun.reflect.GeneratedMethodAccessor74.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(ClusterFutureSyncInvocationHandler.java:114) at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)

-----

We do not see anything abnormal in the redis shards themselves. Can somebody help debug this please?

Mark Paluch

unread,

Sep 27, 2018, 12:44:59 PM9/27/18

to lettuce-redis-client-users

This exception tells you basically that no node was found that is capable of accepting Redis commands for the actual slot. This can be because of a crashing node of when Redis reports an unhealthy cluster node.

Please check the output of CLUSTER NODES on each cluster node when this issue happens. You should be able to find hints regarding the cause in there.

Cheers,

Mark

varun....@capitalone.com

unread,

Sep 27, 2018, 2:18:27 PM9/27/18

to lettuce-redis-client-users

Hi Mark,

Thanks for your reply. We suspect crashed nodes too, that we hope to investigate more with AWS Support. Meantime, we are hoping that periodic topology refresh (described here: https://github.com/lettuce-io/lettuce-core/wiki/Client-options#cluster-specific-options) can alleviate (at least some of) this?

Are there any other viable options on the client side?

With Regards.

Varun.

varun....@capitalone.com

unread,

Oct 1, 2018, 3:43:36 PM10/1/18

to lettuce-redis-client-users

Hi Mark,

We've investigated with AWS, and found no redis instances were crashing during our tests. However, there is always a trickle (some 50 requests per 3000 per sec, intermittently) that fail with this error.

The stacktrace says this error occurs always on the first redis 'get' on a io.lettuce.core.cluster.api.StatefulRedisClusterConnection. We reuse this connection. One one particular client, this error occurred 16343 times out of 531725 requests.

The other interesting thing is this error is always localized to a few docker instances (that the java/lettuce client runs on). Most of the docker instances run error free, but a few don't.

We have also tried ClusterTopologyRefreshOptions - refresh every 10 minutes.

Does this help diagnose the problem?

Many thanks.

Varun.

On Thursday, September 27, 2018 at 12:44:59 PM UTC-4, Mark Paluch wrote:

Mark Paluch

unread,

Oct 2, 2018, 6:03:51 AM10/2/18

to lettuce-redis-client-users

Hi Varun,

Thanks for further details. It would help to isolate the issue by narrowing down a particular request or scenario to reproduce the problem. Disabling topology updates (if you just run a test) is the first step to have a stable Partitions object. If you run on this error, please dump the Partitions object (nodes along with its slots, that they serve) for later inspection, something like:

for (RedisClusterNode partition : partitions) {
    System.out.println("Node:" + partition);
    System.out.println(" Served slots: " + partition.getSlots());
}

This way you capture most details, optionally retrieving CLUSTER NODES output for all nodes. What really puzzles me is that these errors become evident above a certain throughput. Is always the same slot affected?

Make also sure to upgrade to Lettuce 5.1.0.RELEASE. There shouldn't be a direct relation between the version and the bug, however having a GA release instead of a milestone is the right thing to do.