Single Failed Node Results in RedisConnectionException

Eliot

unread,

Apr 7, 2017, 3:55:11 PM4/7/17

to lettuce-redis-client-users

Hi all,

I have an application where I need to do some expensive reads that I do not want to send to the master as it will block other operations. The application is currently on Lettuce 4.2.2 so I ended up writing my own slave preferred ReadFrom implementation based on the MASTER_PREFERRED. I have looked at 4.4.0 and my implementation is the same, and all my following testing was on Lettuce 4.4.0 as well.

The issue I am running into is where a single failed SLAVE node will cause commands sent to that slave's hash slot coverage to throw a RedisConnectionException. Even if there are other slaves that could be used.

Here is my current test cluster setup. The most important info is that the command I am running (get hello) will hash to node 7380. Nodes 7383, 7384, 7385 have all been setup to be slaves of that master node. So with a SLAVE_PREFERRED connection, I would expect any read commands to use any of the three slaves, and then falling back to the master if they are all down.

f9c5d7155976e509bee3d614b2b8bca8d160e37d HOSTNAME:7380 myself,master - 0 0 1 connected 0-5460
250dda526cce9f88ac64ed11c448ec57ac4c22be HOSTNAME:7381 master - 0 1491593806043 2 connected 5461-10922
7d78ecbc1acfd2091fcfbb108aef986e8c2c51ea HOSTNAME:7382 master - 0 1491593810055 3 connected 10923-16383
874807965ee64a3f193d8b73ea4cb667e30c7501 HOSTNAME:7383 slave f9c5d7155976e509bee3d614b2b8bca8d160e37d 0 1491593809053 4 connected
6cf0f2df36efbca159bd3510a02e7b59843843a3 HOSTNAME:7384 slave f9c5d7155976e509bee3d614b2b8bca8d160e37d 0 1491593811058 5 connected
6f4abd117524a6dee1d560b5209ab31cd1679c3f HOSTNAME:7385 slave f9c5d7155976e509bee3d614b2b8bca8d160e37d 0 1491593808153 6 connected

Here is my test code:

RedisClusterClient client = RedisClusterClient.create(RedisURI.create(HOSTNAME, 7380));
StatefulRedisClusterConnection<String, String> conn = client.connect();

conn.setTimeout(2, TimeUnit.SECONDS);
conn.setReadFrom(ReadFrom.SLAVE_PREFERRED);

System.in.read(); // Kill slave 7385 here

try {
    logger.debug(conn.sync().get("hello"));
} catch (RedisException e) {
    logger.error(e);
}

client.reloadPartitions();

StatefulRedisClusterConnection<String, String> newConn = client.connect();
newConn.setTimeout(2, TimeUnit.SECONDS);
newConn.setReadFrom(new ReadFromSlavePreferred());

try {
    logger.debug(newConn.sync().get("hello"));
} catch (RedisException e)
{
    logger.error(e);
}

When running this (with node 7385 killed on the line indicated), it results in both get commands throwing a RedisConnectionException (example at bottom of post). Stepping through the code, I see that the PooledClusterConnectionProvider is correctly providing the valid nodes (code). However, after that it will make a call getReadFromConnections(selection). This will loop through all the selected nodes and call connectionProvider.getConnection (code).

That eventually ends up calling SynchronizingClusterConnectionProvider.InProgress::getConnection (code). Here for the slave node 7385 the RedisConnectionException is thrown. Nothing will catch this exception and it's throw all the way back up to my Main class. So even though there were 2 good slave nodes and a master node still available, the single failed node killed the entire command.

I'm wondering if this is the expected behavior? Is there something I can do to achieve the behavior I listed at the beginning of the post? I have tried changing the ClusterTopologyRefreshOptions to enableAllAdaptiveRefreshTriggers but that doesn't help.

Thanks,

Eliot

15:25:32.070 [main] ERROR Main - Unexpected exception:
com.lambdaworks.redis.RedisConnectionException: Unable to connect to HOSTNAME:7385
	at com.lambdaworks.redis.cluster.SynchronizingClusterConnectionProvider$InProgress.getConnection(SynchronizingClusterConnectionProvider.java:246) ~[classes/:?]
	at com.lambdaworks.redis.cluster.SynchronizingClusterConnectionProvider.getConnection(SynchronizingClusterConnectionProvider.java:88) ~[classes/:?]
	at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadFromConnections(PooledClusterConnectionProvider.java:182) ~[classes/:?]
	at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:151) ~[classes/:?]
	at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:86) ~[classes/:?]
	at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:128) ~[classes/:?]
	at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:133) ~[classes/:?]
	at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:206) ~[classes/:?]
	at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1959) ~[classes/:?]
	at com.lambdaworks.redis.AbstractRedisAsyncCommands.get(AbstractRedisAsyncCommands.java:398) ~[classes/:?]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_51]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_51]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_51]
	at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_51]
	at com.lambdaworks.redis.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(ClusterFutureSyncInvocationHandler.java:143) ~[classes/:?]
	at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80) ~[classes/:?]
	at com.sun.proxy.$Proxy8.get(Unknown Source) ~[?:?]
	at com.eliotwong.Main.main(Main.java:69) [classes/:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /HOSTNAME:7385
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_51]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_51]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[netty-transport-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[netty-transport-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:612) ~[netty-transport-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:559) ~[netty-transport-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:476) ~[netty-transport-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:438) ~[netty-transport-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[netty-common-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) ~[netty-common-4.1.9.Final.jar:4.1.9.Final]
	at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_51]
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_51]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_51]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[netty-transport-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[netty-transport-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:612) ~[netty-transport-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:559) ~[netty-transport-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:476) ~[netty-transport-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:438) ~[netty-transport-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[netty-common-4.1.9.Final.jar:4.1.9.Final]
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) ~[netty-common-4.1.9.Final.jar:4.1.9.Final]
	at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_51]

migr...@gmail.com

unread,

Apr 10, 2017, 10:08:58 AM4/10/17

to lettuce-redis-client-users, migr...@gmail.com

It looks like when he PooledClusterConnectionProvider is looking for read candidates, there is information available about the RedisClusterNodes as to whether they are connected. Would it make sense to only include nodes that are connected as read candidates?

https://github.com/lettuce-io/lettuce-core/blob/master/src/main/java/com/lambdaworks/redis/cluster/PooledClusterConnectionProvider.java#L196

return partition.isConnected() && (master.getNodeId().equals(partition.getNodeId()) || master.getNodeId().equals(partition.getSlaveOf()));

This allows me to continue to read from the cluster after a ClusterClient::reloadPartitions has been called. However, before reloadPartitions() is called it seems that all nodes will be considered connected and a RedisConnectionException will be thrown from https://github.com/lettuce-io/lettuce-core/blob/master/src/main/java/com/lambdaworks/redis/cluster/PooledClusterConnectionProvider.java#L167. What is the recommended approach to getting the partitions to reload immediately after a failed node has been discovered? It doesn't seem like this would be covered by the cluster topology refresh triggers.

Could a try catch be added to https://github.com/lettuce-io/lettuce-core/blob/master/src/main/java/com/lambdaworks/redis/cluster/PooledClusterConnectionProvider.java#L182 and if an exception is caught, make a call to ClusterClient::reloadPartitions?

mpa...@pivotal.io

unread,

Apr 10, 2017, 12:57:36 PM4/10/17

to lettuce-redis-client-users, migr...@gmail.com

Hi Eliot,

thanks for your detailed explanation. In fact, a single unavailable node is able to prevent command dispatching. I think we can change behavior to cope better with degradation. PooledClusterConnectionProvider.getReadConnection already checks for a functional connection – A slight change to exception behavior (from "fail on any failed connect" to "fail on all connection attempts failed") should do the job.

It could make also sense to record connection failures as an adaptive trigger. We don't trigger a partition refresh immediately but rather collect signals to not trigger refreshes on every I/O problem. Your Redis cluster + Redis client would be kept busy with refreshing otherwise.

Care to file a ticket at https://github.com/lettuce-io/lettuce-core/issues?

Cheers,

Mark

migr...@gmail.com

unread,

Apr 10, 2017, 5:22:37 PM4/10/17

to lettuce-redis-client-users, migr...@gmail.com, mpa...@pivotal.io

Hi Mark,

Thanks for your reply. Here is the issue I created: https://github.com/lettuce-io/lettuce-core/issues/503