Getting RedisCommandTimeoutException after updating to 5.1.7.RELEASE

4,592 views
Skip to first unread message

vleu...@gmail.com

unread,
Jun 25, 2019, 9:33:39 AM6/25/19
to lettuce-redis-client-users
Heya! I'm scratching my head over new problem on our production server.

We recently have added some more nodes to our Redis (5.0.5) cluster (now it has 12x r3.large instances, each 16gb of ram + 12x slaves for them). I noticed that we started to have some noticeable CPU increase, I traced them to 
https://github.com/lettuce-io/lettuce-core/issues/1011 and decided to update to 5.1.7.RELEASE (from 5.1.6.RELEASE)

We have 16 application nodes. At some point (after couple of hours of running on prod) we started getting RedisCommandTimeoutException (on one node, then on another)

io.lettuce.core.RedisCommandTimeoutException: Command timed out after 10 second(s) at io.lettuce.core.ExceptionFactory.createTimeoutException(ExceptionFactory.java:51) at io.lettuce.core.protocol.CommandExpiryWriter.lambda$potentiallyExpire$0(CommandExpiryWriter.java:167) at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127) at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:66) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:835) 

Not sure how to begin to troubleshoot the problem, it feels that something trips and all communication to Redis is lost. What logging or anything I can do to get more information about what is getting lost/timed out?

Also, we are using custom ClusterClientOptions:
        return ClusterClientOptions.builder()
                .topologyRefreshOptions(ClusterTopologyRefreshOptions.builder()
                        .enablePeriodicRefresh(Duration.ofMinutes(3))
                        .enableAllAdaptiveRefreshTriggers()
                        .build())
                .timeoutOptions(TimeoutOptions.builder()
                        .timeoutCommands(true)
                        .fixedTimeout(Duration.of(10, ChronoUnit.SECONDS))
                        .build())
                .build();

Also, I noticed that if I remove enableAllAdaptiveRefreshTriggers() -- there is no noticeable CPU increase with 5.1.6.RELEASE\

Thanks in advance and best regards,
Valery

vleu...@gmail.com

unread,
Jun 25, 2019, 9:35:27 AM6/25/19
to lettuce-redis-client-users
Also, forgot to mention that Redis cluster is healthy and there are no topology changes in it.

Mark Paluch

unread,
Jun 26, 2019, 2:15:54 AM6/26/19
to lettuce-redis-client-users
Hi Valery, 

you're already the second one to report CPU spikes [0] with Redis Cluster. We need to investigate what's causing these.

Command timeouts are a quite broad category of issues. You can enable debug logging for io.lettuce.core.protocol. Logging settings are cached once Lettuce is started so you need to restart your JVM to enable logging. Timeouts happen when Lettuce does not see a response in time for a command. This can be due to real network issues, slow Redis, usage of blocking commands, high CPU consumption on your App server or protocol issues that requests are no longer correlated to responses.

Cheers, 
Mark


vleu...@gmail.com

unread,
Jun 26, 2019, 2:56:21 AM6/26/19
to lettuce-redis-client-users
Hey Mark!

So far I managed to isolate the problem to switching Lettuce to 5.1.7. And 5.1.6 works fine but with higher CPU (disabling enableAllAdaptiveRefreshTriggers reduced CPU noticeably)

The problem with lettuce debug logging is that is too verbose for production. Our Graylog servers gets overwhelmed in couple minutes (literally millions of messages from Lettuce, Graylog can keep up with only like 10k messages per second, others are queued and this queue will grow up very large -- hours if kept for something like 30 minutes with Lettuce debug logging).

And the problem with timeout exception manifests in couple hours of running on prod, so we will not notice those exceptions in time.

Maybe I could turn on debug logging only on some specific classes, like those responsible for topology updates?

Also, I noticed that on Info level Lettuce is very quite, there are almost no log messages.

Any thoughts?

Mark Paluch

unread,
Jun 26, 2019, 6:28:02 AM6/26/19
to lettuce-redis-client-users
I already feared that debug logging and a long time until you can reproduce the issue is not the best approach.
I pushed two changes to master to reduce object allocations during cluster topology refresh (see Lettuce 5.2.0.BUILD-SNAPSHOT, [0] [1]).

Reproducing that state is somewhat challenging. Is it an option for you do enable the debug port and interrupt your application at the moment you run into timeouts?
If so, then you could have a look into CommandHandler.stack to see which commands are in the queue. There are not much options though besides looking onto the wire and figuring out why commands weren't completed.

vleu...@gmail.com

unread,
Jun 26, 2019, 10:37:14 AM6/26/19
to lettuce-redis-client-users
Thanks! I will try new library tomorrow morning.

Interrupting is quite challenging because Kubernetes will kill pod if it is not responding to health checks. But I will look into it and probably will come up with something. My current option is to make logic on server that I could invoke anytime and it will print mentioned CommandHandler.stack . It is likely well encapsulated but I will copy necessary Lettuce classes into our code base and make access public to them (hacky but helpful in difficult situations), just have to know what to look for. I will come back tomorrow when I will have more data.

vleu...@gmail.com

unread,
Jun 27, 2019, 8:07:08 AM6/27/19
to lettuce-redis-client-users
I tried 5.2.0.BUILD-SNAPSHOT library today. Got RedisCommandTimeoutException in 20 minutes. No significant CPU usage. Will proceed with stack logging.

Mark Paluch

unread,
Jun 27, 2019, 8:12:53 AM6/27/19
to lettuce-redis-client-users, vleu...@gmail.com
Thanks a lot. This is at least one good news. We reduced object allocations and this likely increased GC pressure. 
--
You received this message because you are subscribed to the Google Groups "lettuce-redis-client-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lettuce-redis-clien...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lettuce-redis-client-users/464ad47c-81c1-43b2-8809-584723bb2466%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

vleu...@gmail.com

unread,
Jul 1, 2019, 5:38:33 AM7/1/19
to lettuce-redis-client-users
I've managed to reproduce the problem on Dev environment. Without any considerable load. It has lettuce logs enabled and here are how they look:

https://gist.github.com/vleushin/f46f8b6fd53358fe55f8ea2417f07928

Command stack seems to be mostly empty. Any thoughts?
To unsubscribe from this group and stop receiving emails from it, send an email to lettuce-redis-client-users+unsub...@googlegroups.com.

vleu...@gmail.com

unread,
Jul 5, 2019, 4:26:27 AM7/5/19
to lettuce-redis-client-users
Sharing some of my findings.

I did run profiler and took some memory snapshots.

lettuce-1.png


Ir was amazing how much memory lettuce metrics took. Also, we had this piece of code which consumed those metrics:


public static void collectRedisMetrics(IStatsEventCollector statsEventCollector, ClientResources clientResources, String storage) {
EventBus eventBus = clientResources.eventBus();
eventBus.get()
.filter(redisEvent ->
redisEvent instanceof CommandLatencyEvent)
.cast(CommandLatencyEvent.class)
.subscribe(e -> {
Set<Map.Entry<CommandLatencyId, CommandMetrics>> commandLatencies = e.getLatencies().entrySet();
for (Map.Entry<CommandLatencyId, CommandMetrics> commandLatency : commandLatencies) {
String keyword = commandLatency.getKey().commandType().name();
CommandMetrics commandMetric = commandLatency.getValue();
long count = commandMetric.getCount();
CommandMetrics.CommandLatency completion = commandMetric.getCompletion();
Map<Double, Long> percentiles = completion.getPercentiles();

statsEventCollector.durationEvent("RedisLatency", Map.of("storage", storage, "command", keyword, "measure", "count"), count);
statsEventCollector.durationEvent("RedisLatency", Map.of("storage", storage, "command", keyword, "measure", "min"), completion.getMin());
statsEventCollector.durationEvent("RedisLatency", Map.of("storage", storage, "command", keyword, "measure", "90"), percentiles.get(90d));
statsEventCollector.durationEvent("RedisLatency", Map.of("storage", storage, "command", keyword, "measure", "95"), percentiles.get(95d));
statsEventCollector.durationEvent("RedisLatency", Map.of("storage", storage, "command", keyword, "measure", "99"), percentiles.get(99d));
statsEventCollector.durationEvent("RedisLatency", Map.of("storage", storage, "command", keyword, "measure", "max"), completion.getMax());
}
});
}

Not sure if it mattered, anyway I tried to disable metrics:

@Provides
ClientResources provideClientResources() {
return DefaultClientResources.builder()
.commandLatencyCollectorOptions(DefaultCommandLatencyCollectorOptions.disabled())
.build();
}

And both CPU and memory went drastically down. This is still on Lettuce 5.1.6. Will try on 5.1.7 soon after I will run some more experiments.

Just to give you some perspective how drastically latency changed (well, this is still sort of average temperature):

Screenshot 2019-07-05 at 11.15.01.png

And memory:

Screenshot 2019-07-05 at 11.16.36.png

Also I recall that we had discussion about client resources here: https://github.com/lettuce-io/lettuce-core/issues/705

And yes, we still create multiple instances of them because:
* we had issues if one of 30 microservices we have is written poorly -- it affects all other microservices on JVM and it is very hard to diagnose which service does that
* we had issue when one microservice issued command of wrong type and it resulted in complete node crash. This command was for admin service and was triggered rarely, so it was very hard to diagnose

Still, interested in your thoughts.

Mark Paluch

unread,
Jul 5, 2019, 5:50:52 AM7/5/19
to lettuce-redis-client-users, vleu...@gmail.com
Thanks for sharing your insights.

I do not understand the relationship between service instances (JVMs) and the number of ClientResource instances (objects in a single JVM). Care to elaborate?

ClientResources encapsulates the I/O thread pool and expensive resources, therefore you should keep only a single instance that you then can share with one or many RedisClient/RedisClusterClient instances.

Typically, keep a single ClientResources instance per JVM. I’m wondering why the latency collector is so bloated. There should be at most (number of commands * number of remote redid instances) objects. 50 Megs is a bit much, though.

Can you share the heap dump with me so I can investigate the root cause?

Cheers, 
Mar

Am 5. Juli 2019, 10:26 +0200 schrieb vleu...@gmail.com:
Sharing some of my findings.

I did run profiler and took some memory snapshots.

<lettuce-1.png>

<Screenshot 2019-07-05 at 11.15.01.png>

And memory:

<Screenshot 2019-07-05 at 11.16.36.png>

To unsubscribe from this group and stop receiving emails from it, send an email to lettuce-redis-clien...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lettuce-redis-client-users/96e4499e-ba7c-4427-802d-d327988289b6%40googlegroups.com.

vleu...@gmail.com

unread,
Jul 8, 2019, 6:58:51 AM7/8/19
to lettuce-redis-client-users
I'm afraid sharing dump will leak a lot of sensitive information. But I will look into it myself and will try to figure out why. We have 68 redis instances (fives clusters (max 12 nodes), four masters-slaves) -- may be that's a lot.

App functionality is separated in small, independent GRPC micro-services. most of them are using Redis as a storage. I have nothing against sharing and using less resources (I actually tried to deploy with single ClientResources -- so far it seems to work fine, but still evaluating). Those services are intended to be isolated from each other, but with shared ClientResource -- if anything happens to command handler (bogus command by issued some service, or one service issuing a lot of commands -- e.g. one used for admin/internal or migration purposes), all services who use Redis are affected. We had those issues before.

We have 16 nodes each running all those 40-50 GRPC services.

Regarding 

trisha...@gmail.com

unread,
Jul 22, 2019, 7:02:37 PM7/22/19
to lettuce-redis-client-users
Hi Valery,

Were you able to figure out the root cause for the CommandTimeoutExceptions ? We're facing the same issue after upgrading to 5.1.7-RELEASE, 
Reply all
Reply to author
Forward
0 new messages