Jedis SocketTimeoutException to Redis

11,408 views
Skip to first unread message

Shawn Clark

unread,
Sep 16, 2013, 5:01:07 PM9/16/13
to redi...@googlegroups.com
Hi,

I have been tracking an issue with Jedis / Redis over the last two weeks while running load tests. Our load testbed has 2 physical servers for Redis, 4 physical servers for our service, and a bunch of servers for other stuff that isn't directly related to this topic. Whenever running load tests we make REST calls to the frontend of the service that in-turn uses Redis as a shared cache for fairly static data that is initially loaded from a database. The keys and values are strings up to 15 characters. The amount of cached data is under 10 GB. Load TPS numbers are 15k TPS (spread across the 4 service nodes) that translates roughly into 40k TPS to Redis. Initially we were running with only two nodes of Redis (5GB memory each), one instance per physical. What we found though is that when Redis would have this blocking issue our TPS would drop very significantly. Ultimately this is due to the primarily single threaded nature of Redis. As such we have mitigated this by creating 5 instances per physical box to have a total of 10 Redis instances using ~1 GB of memory each. This has helped reduce the issue but we still haven't found the root cause. We went from seeing thousands of errors per minute to under 50 errors per minute.

The infographic on the left shows a sample of the transactions showing in our AppDynamics monitoring. When the system is running fine each request takes under 10ms to respond. When Redis starts to error we see a few requests in a warning state (they complete without error but are slower than normal) and then after that a number of requests in error (they fail to Redis and throw a JedisException which causes a database call for that data but complete).

Due to this issue the service has been setup with aggressive timeouts to Redis so that when Redis is having an issue we aren't getting stuck waiting for it. The socket timeout for Redis is 50ms configured in the Jedis connection. Below is some exception stack traces showing the issue.

redis.clients.jedis.exceptions.JedisConnectionException : java.net.SocketTimeoutException
      at redis.clients.jedis.Connection.connect(Connection.java:134)
      at redis.clients.jedis.BinaryClient.connect(BinaryClient.java:69)
      at redis.clients.jedis.Connection.sendCommand(Connection.java:79)
      at redis.clients.jedis.Connection.sendCommand(Connection.java:75)
      at redis.clients.jedis.Jedis.get(Jedis.java:66)
      at redis.clients.jedis.ShardedJedis.get(ShardedJedis.java:43)
      ...

redis.clients.jedis.exceptions.JedisConnectionException : java.net.SocketTimeoutException: Read timed out
      at redis.clients.jedis.Protocol.process(Protocol.java:79)
      at redis.clients.jedis.Protocol.read(Protocol.java:131)
      at redis.clients.jedis.Connection.getBinaryBulkReply(Connection.java:182)
      at redis.clients.jedis.Connection.getBulkReply(Connection.java:171)
      at redis.clients.jedis.Jedis.get(Jedis.java:67)
      at redis.clients.jedis.ShardedJedis.get(ShardedJedis.java:43)
      ...

We have run Redis benchmarks and read over the documentation at http://redis.io/topics/latency. We see similar issues during the benchmark where there are blips of increased response time.

Unfortunately we still haven't found a root cause and are running out of options to test. We are hoping to find root cause in the next few days. Any assistance would be appreciated.

Thanks

Mauricio Gomez

unread,
Dec 19, 2013, 7:40:22 PM12/19/13
to redi...@googlegroups.com
Hi There.

I just want to know if you found a solution about this?

Dmitriy Dumanskiy

unread,
Feb 7, 2014, 5:32:11 AM2/7/14
to redi...@googlegroups.com
Hello Clark,

we have exactly same case - load to redis is about 40k req/sec. Timeout is 50ms. Latest redis and latest jedis. Periodically getting timeout connection error. Have you found some solution?

Вівторок, 17 вересня 2013 р. 00:01:07 UTC+3 користувач Shawn Clark написав:

Salvatore Sanfilippo

unread,
Feb 7, 2014, 5:47:54 AM2/7/14
to Redis DB
Hello Shawn, Mauricio, Dmitriy,

the main source of this kind of issues are documented here as you already know: http://redis.io/topics/latency
If you can't find anything, I would try the software watchdog, even if care should be taken since it is possible that running with the watchdog enabled may cause crashes of the server.

The obvious things to check are: disable persistence for a moment (both AOF and RDB) and check if this is a persistence delated issue. If so check if there is a disk busy problem or if you have big fork times because of EC2 instances that have the slow fork issue. Check with SLOWLOG if you are calling slow commands such as KEYS or if you are DELeting large keys, which is a blocking operation as well.

Basically it is possible to provide some help if there is some ongoing investigation from your side that we can inspect.

For example you could use redis-cli --latency-history and correlate latency spikes to events in the Redis log.

Btw since this issues are so common, in the next weeks I'll introduce a new LATENCY command inside Redis. The command is actually an internal framework that measure latency of different operations: fork(), write(), commands executions, and so forth. It can provide the latest N samples, or provide directly an analysis. For example you'll be able to write "LATENCY DOCTOR" and it will write what is the likely source of your issues.

Just an example, if it detects your fork-time / data-size ratio is high it will tell you there is an issue.
A more interesting example: if it detects that you have slow commands execution in the middle of O(1) or O(Log(N)) commands it will tell you that you have some problem with the scheduling of the process, that is, you are running in a crap VM with noisy neighbors or something like that.

Basically: a single command that gives you the picture.

Salvatore 



--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/groups/opt_out.



--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

To "attack a straw man" is to create the illusion of having refuted a proposition by replacing it with a superficially similar yet unequivalent proposition (the "straw man"), and to refute it
       — Wikipedia (Straw man page)

Dmitriy Dumanskiy

unread,
Feb 7, 2014, 7:13:41 AM2/7/14
to redi...@googlegroups.com
Thank you for reply Salvatore. Yes, I read docs you provided.

Want to update this ticket with more info.

Right now we using Amazon ElastiCache service on c3.XLarge instance. (Previously we used redis on EC2, it had same problem). Current version of redis is 2.6.13. 
All persistence is turned off:

# Persistence
loading:0
rdb_changes_since_last_save:273926035
rdb_bgsave_in_progress:0
rdb_last_save_time:1391595182
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok

Our performance metrics shows that average response time from redis is < 1ms (this is from 1 server):
             count = 1389167
         mean rate = 314.84 calls/second
     1-minute rate = 759.36 calls/second
     5-minute rate = 613.73 calls/second
    15-minute rate = 453.61 calls/second
               min = 0.04 milliseconds
               max = 44.18 milliseconds
              mean = 0.44 milliseconds
            median = 0.32 milliseconds
              75% <= 0.83 milliseconds
              95% <= 5.94 milliseconds
              98% <= 10.55 milliseconds
              99% <= 14.96 milliseconds
            99.9% <= 44.09 milliseconds

We do not use slow operations. In that particular case just "GET" commands. ~15k req/sec. All keys are 42 chars. Batch of "SADD" in pipeline (~1000 "SADD" comands within 1 pipeline with average 15 values per every set) every 5 secs. So ~ 30k operations/sec.
We have 10 servers and every server uses pool of 24 connections.

The problem is - those timeouts are reproducible even with 1 server that generates only 800 req/sec of "GET" + ~800 "SADD" within 1 pipeline with average 1 value per set.

I'll continue investigation. Let ypu know if find something.

Пʼятниця, 7 лютого 2014 р. 12:47:54 UTC+2 користувач Salvatore Sanfilippo написав:

Salvatore Sanfilippo

unread,
Feb 7, 2014, 9:56:49 AM2/7/14
to Redis DB
Hello Dmitriy, debugging if you are running on ElasticCache is going to be a non trivial issue as you don't have full control on the system.
One thing you can try is the following:

config set slowlog-log-slower-than 10000

Then when the delay happens, try to run SLOWLOG GET to see if there are entries.

Basically I'm not sure at all that under EC2 the process is guaranteed to never have 50 milliseconds delays that are outside the control of the process itself.
If you'll see slowlog entries about constant-time or logarithmic commands, you know that likely it's not Redis to blame, or your setup, but the process was stopped for 50 milliseconds.

Another thing, did you verified with redis-cli --latency-history that the latency is observable outside the Java application?

Salvatore

Dmitriy Dumanskiy

unread,
Feb 7, 2014, 12:08:03 PM2/7/14
to redi...@googlegroups.com
Thank you, look like I found problem:

redis-cli --latency
min: 0, max: 15, avg: 0.33 (17558 samples)
Latency very close to what we have in our Server metrics. So look like network not a problem.


SLOWLOG GET
 1) 1) (integer) 36
    2) (integer) 1391709993
    3) (integer) 10835
    4) 1) "GET"
       2) "2ads10percent_ihgydxhzfodzscqmovif"
 2) 1) (integer) 35
    2) (integer) 1391709950
    3) (integer) 34155
    4) 1) "GET"
       2) "2ads10percent_rmywqesssitmfksetzvj"
 3) 1) (integer) 34
    2) (integer) 1391709830
    3) (integer) 34863
    4) 1) "GET"
       2) "2ads10percent_tteeoomiimcgdzcocuqs"
 4) 1) (integer) 33
    2) (integer) 1391709710
    3) (integer) 35754
    4) 1) "GET"
       2) "2ads10percent_txewdysvbluimspsirgo"
5) 1) (integer) 32
    2) (integer) 1391709566
    3) (integer) 30654
    4) 1) "SADD"
       2) "288:1391706000000"
       3) "2ads10percent_conlonuorzbzlplinmum"
       4) "2ads10percent_oqtlcyhnbmefudzwzzmb"
6) 1) (integer) 31
    2) (integer) 1391709086
    3) (integer) 18924
    4) 1) "GET"
       2) "2ads10percent_sypzzfpbimhqkbtobzbn"
7) 1) (integer) 30
    2) (integer) 1391707881
    3) (integer) 36753
    4)  1) "SADD"
        2) "246:1391706000000"
        3) "2ads10percent_rjflkrypmrkupdtiezer"
        4) "2ads10percent_togkxdhsvtsrcedqylql"
        5) "2ads10percent_rejrwiqdowkiggieectq"
        6) "2ads10percent_mmkmwceogqqbkwrbqzxt"
        7) "2ads10percent_tmbjunuqtpmuiuwqqgrx"
        8) "2ads10percent_mruirkjwvnonexfqybcp"
        9) "2ads10percent_yjtsbkwjzygiqfxucttx"
       10) "2ads10percent_fcmzteyojtzjkffiudhq"
8) 1) (integer) 27
    2) (integer) 1391702958
    3) (integer) 32322
    4)  1) "SADD"
        2) "826:1391702400000"
        3) "2ads10percent_llxlofycjqxdlljqzjnx"
        4) "2ads10percent_yjiogzjcdtsbgfwjepwv"
        5) "2ads10percent_nptdkryxyifkslwopuvt"
        6) "2ads10percent_ghbxhhcoolhhospxlfxn"
        7) "2ads10percent_yppzohrykdccgtlrknbs"
        8) "2ads10percent_bkncehcpogwiobmbwjsd"
        9) "2ads10percent_gzslzbhnlmwpqtjrrkjd"
       10) "2ads10percent_jesbjzsunbxgeyfsztyc"
       11) "2ads10percent_szwykwkevzzmcjfyvuij"
       12) "2ads10percent_vwhorisrvxqupxnetjpw"
9) 1) (integer) 25
    2) (integer) 1391700982
    3) (integer) 20682
    4)  1) "SADD"
        2) "779:1391698800000"
        3) "2ads10percent_zwsnscrtwkqbtbtrgrcn"
        4) "2ads10percent_tqretchglfjwibdflfik"
        5) "2ads10percent_ulcgwjkxbmmlpywugblk"
        6) "2ads10percent_pckrqncymrkvxhkyxtec"
        7) "2ads10percent_cfxndtgzduydxiixkdio"
        8) "2ads10percent_nkbxeghuvxkpsufrzdwc"
        9) "2ads10percent_vqbqwdnzcrmnekulsffy"
       10) "2ads10percent_defzcizrcimpknszprro"
       11) "2ads10percent_qxvtciuyjqumtbmdkhow"
       12) "2ads10percent_yvngkozlqbqbejhtvyre"
10) 1) (integer) 23
    2) (integer) 1391698040
    3) (integer) 15948
    4) 1) "EXPIRE"
       2) "581:1391695200000"
       3) "10800"

Seems when those batch of "SADD" commands is performed CPU usage increased and "GET" commands became too slow that is causing SocketTimeout. 
I'll check again to be sure, but thank for help.

Пʼятниця, 7 лютого 2014 р. 16:56:49 UTC+2 користувач Salvatore Sanfilippo написав:

Salvatore Sanfilippo

unread,
Feb 7, 2014, 12:16:30 PM2/7/14
to Redis DB

On Fri, Feb 7, 2014 at 6:08 PM, Dmitriy Dumanskiy <doo...@gmail.com> wrote:
Seems when those batch of "SADD" commands is performed CPU usage increased and "GET" commands became too slow that is causing SocketTimeout. 
I'll check again to be sure, but thank for help.

Great investigation, wrong analysis :-)

Redis is single threaded, so the SLOWLOG is reporting that the Redis instance took 34 milliseconds to *execute* the GET:

3) 1) (integer) 34
    2) (integer) 1391709830
    3) (integer) 34863
    4) 1) "GET"
       2) "2ads10percent_tteeoomiimcgdzcocuqs"

This means that the Redis process was stopped in the middle of the computation by the kernel, out of its control, while it was not doing any syscall.

This means that there are two possibilities:

A) There are swapped out pages that when accessed need to be loaded. You can verify this as explained in the "latency" page at redis.io, unless that in this case you can't since it is Amazon business.
B) The Hypervisor is not giving the VM the ability to run in a decent way for some reason.

There are people here that are much more expert than me and can comment about why an EC2 instance should stop in the middle of a computation for 50 milliseconds.

Long story short: it is not a problem of Redis or your application.

Salvatore

Nate Wiger

unread,
Feb 7, 2014, 1:03:28 PM2/7/14
to redi...@googlegroups.com
Amazon ElastiCache doesn't currently support C3, could you double-check what type of instance this is?  M3?  C1?

Dmitriy Dumanskiy

unread,
Feb 10, 2014, 4:15:56 AM2/10/14
to redi...@googlegroups.com
Nate,

sorry for confusion. Right now we using m2.XLarge. Before we were using c1.XLarge. Problems were with both instance types.

Salvatore, thanks for correction =). I'll check with my team what can be done here.

Пʼятниця, 7 лютого 2014 р. 20:03:28 UTC+2 користувач Nate Wiger написав:

Neelesh Korade

unread,
Feb 24, 2014, 2:06:11 AM2/24/14
to redi...@googlegroups.com
Hi All,

First of all, thanks to Salvatore for giving such detailed response on how latency related issues can be debugged/investigated in Redis. It was really helpful.

We are seeing some intermittent SocketTimeoutExceptions too. I tried to use output from —latency-history, info and redis log and try and correlate to figure out what could have been the problem. However, I haven’t been able to conclude anything yet. At the end of the email below, please find output of the relevant commands run on one of our redis instances which is showing SocketTimeouts.

Note that Redis timeout is set to 0 (to not timeout ever) and the client timeout is set to 250ms. And the rdb save config is “720 10000” (that is, save every 12 minutes if there are at least 10k changes). From the redis, logs, I see that snapshot process is kicking in about every 12 minutes and takes little over 3 minutes to finish. Also, as can be seen in the Info output, note that almost all our keys expire.

I am unable to deduce what might be the reason behind the latency spikes and the resulting SocketTimeouts and if there is a way to fix it. There seems to be some correlation between the latency spikes and the forking of the snapshot process but since the fork time is within 50 ms, little unsure if this could lead to the latency spikes.

Would appreciate any help with this. 

Thank you
Neelesh
  • Sloglog (showing one representative query stats)-
    1. (integer) 16253
    2. (integer) 1393221057
    3. (integer) 81780
      1. "HGET"
      2. "720503294"
      3. "pl3h"
  • Redis Log (Showing snapshot time, COW memory usage etc. for to back to back snapshot runs)-
[9532] 24 Feb 04:33:26.007 * 10000 changes in 720 seconds. Saving...
    [9532] 24 Feb 04:33:26.052 * Background saving started by pid 13762
      [13762] 24 Feb 04:36:41.353 * DB saved on disk
        [13762] 24 Feb 04:36:41.374 * RDB: 6986 MB of memory used by copy-on-write
          [9532] 24 Feb 04:36:41.454 * Background saving terminated with success
            [9532] 24 Feb 04:48:42.079 * 10000 changes in 720 seconds. Saving...
              [9532] 24 Feb 04:48:42.117 * Background saving started by pid 16276
                [16276] 24 Feb 04:52:05.121 * DB saved on disk
                  [16276] 24 Feb 04:52:05.141 * RDB: 7017 MB of memory used by copy-on-write
                    [9532] 24 Feb 04:52:05.293 * Background saving terminated with success

                    • Info Output-
                    # Server
                    redis_version:2.6.14
                    redis_git_sha1:00000000
                    redis_git_dirty:0
                    redis_mode:standalone
                    os:Linux 2.6.32-358.18.1.el6.YAHOO.20131009.x86_64 x86_64
                    arch_bits:64
                    multiplexing_api:epoll
                    gcc_version:4.4.7
                    process_id:9532
                    run_id:c8fcda6acd67517944729f39c20dc8c7d5cd6b58
                    tcp_port:6379
                    uptime_in_seconds:3606035
                    uptime_in_days:41
                    hz:10
                    lru_clock:910196

                    # Clients
                    connected_clients:6
                    client_longest_output_list:0
                    client_biggest_input_buf:0
                    blocked_clients:0

                    # Memory
                    used_memory:7480050936
                    used_memory_human:6.97G
                    used_memory_rss:8104726528
                    used_memory_peak:7754724248
                    used_memory_peak_human:7.22G
                    used_memory_lua:31744
                    mem_fragmentation_ratio:1.08
                    mem_allocator:jemalloc-3.2.0

                    # Persistence
                    loading:0
                    rdb_changes_since_last_save:232700
                    rdb_bgsave_in_progress:0
                    rdb_last_save_time:1393222243
                    rdb_last_bgsave_status:ok
                    rdb_last_bgsave_time_sec:237
                    rdb_current_bgsave_time_sec:-1
                    aof_enabled:0
                    aof_rewrite_in_progress:0
                    aof_rewrite_scheduled:0
                    aof_last_rewrite_time_sec:-1
                    aof_current_rewrite_time_sec:-1
                    aof_last_bgrewrite_status:ok

                    # Stats
                    total_connections_received:3760
                    total_commands_processed:6375092026
                    instantaneous_ops_per_sec:1289
                    rejected_connections:0
                    expired_keys:645850045
                    evicted_keys:0
                    keyspace_hits:4551557333
                    keyspace_misses:671409325
                    pubsub_channels:0
                    pubsub_patterns:0
                    latest_fork_usec:45864

                    # Replication
                    role:master
                    connected_slaves:0

                    # CPU
                    used_cpu_sys:49000.62
                    used_cpu_user:44309.82
                    used_cpu_sys_children:15537.03
                    used_cpu_user_children:146168.59

                    # Keyspace
                    db0:keys=35753864,expires=35753842

                    • redis-cli —-latency-history (partial output; the spikes repeat every 15 minutes or so)
                    min: 0, max: 1, avg: 0.14 (1463 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.11 (1467 samples) -- 15.00 seconds range
                    min: 0, max: 1, avg: 0.09 (1470 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.09 (1470 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.09 (1471 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.12 (1466 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.11 (1467 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.13 (1467 samples) -- 15.00 seconds range
                    min: 0, max: 1, avg: 0.12 (1464 samples) -- 15.00 seconds range
                    min: 0, max: 1, avg: 0.11 (1469 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.13 (1464 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.10 (1470 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.12 (1468 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.09 (1473 samples) -- 15.00 seconds range
                    min: 0, max: 1, avg: 0.11 (1469 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.09 (1473 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.09 (1473 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.09 (1472 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.09 (1471 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.10 (1472 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.11 (1472 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.09 (1472 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.08 (1472 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.10 (1472 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.10 (1468 samples) -- 15.00 seconds range
                    min: 0, max: 1, avg: 0.08 (1472 samples) -- 15.00 seconds range
                    min: 0, max: 1, avg: 0.10 (1468 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.07 (1475 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.08 (1474 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.08 (1474 samples) -- 15.00 seconds range
                    min: 0, max: 1, avg: 0.09 (1472 samples) -- 15.00 seconds range
                    min: 0, max: 1, avg: 0.10 (1468 samples) -- 15.00 seconds range
                    min: 0, max: 373, avg: 0.62 (1402 samples) -- 15.05 seconds range
                    min: 0, max: 346, avg: 1.12 (1339 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.07 (1474 samples) -- 15.01 seconds range
                    min: 0, max: 1, avg: 0.08 (1477 samples) -- 15.01 seconds range




                    --

                    Salvatore Sanfilippo

                    unread,
                    Feb 24, 2014, 10:12:27 AM2/24/14
                    to Redis DB
                    Hello Dmitry,

                    you may want to try this program: https://gist.github.com/antirez/8913533

                    Basically you have slow logs in the context of HGET that is an O(1)
                    command, so the system intrinsic latency is pretty bad.
                    Note that fork times are not related to HGET slowing down, since the
                    two happen sequentially, and fork will never happen in the middle of a
                    command.
                    My guess is that this may be related to intrinsic latency that goes
                    even worst when you do something like fork(), so the two will sum or
                    worse.

                    The output of a 100 seconds run of the above program may tell us
                    something more about the intrinsic latency and provide further info,
                    but since we are seeing this happening in O(1) commands there are
                    really not much options, either swapped pages (that you can check, and
                    is unlikely), or the scheduler not running your server for a long
                    time.

                    Salvatore

                    Neelesh Korade

                    unread,
                    Feb 24, 2014, 11:30:25 AM2/24/14
                    to redi...@googlegroups.com
                    Thank you Salvatore (am assuming your response was to my email and you inadvertently addressed it to Dmitry).

                    One quick question is, if an HGET call is received by the server and is in the queue but in between server starts fork, doesn't the execution time of HGET include the time it takes for the server to finish fork plus time to execute the commands which ahead of HGET in the queue? (Or is my understanding totally off!)

                    Thank you
                    Neelesh

                    > keyspace_misses:67140--

                    Salvatore 'antirez' Sanfilippo
                    open source developer - GoPivotal
                    http://invece.org

                    To "attack a straw man" is to create the illusion of having refuted a
                    proposition by replacing it with a superficially similar yet
                    unequivalent proposition (the "straw man"), and to refute it
                           — Wikipedia (Straw man page)

                    --
                    You received this message because you are subscribed to the Google Groups "Redis DB" group.
                    To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
                    To post to this group, send email to redi...@googlegroups.com.
                    Visit this group at http://groups.google.com/group/redis-db.
                    For more options, visit https://groups.google.com/groups/opt_out.


                    --
                    Sent from my iPhone

                    Salvatore Sanfilippo

                    unread,
                    Feb 24, 2014, 11:35:56 AM2/24/14
                    to Redis DB
                    On Mon, Feb 24, 2014 at 5:30 PM, Neelesh Korade
                    <neelesh...@gmail.com> wrote:
                    > Thank you Salvatore (am assuming your response was to my email and you
                    > inadvertently addressed it to Dmitry).

                    Sorry Neelesh... always reply in rush to the ML :-)

                    > One quick question is, if an HGET call is received by the server and is in
                    > the queue but in between server starts fork, doesn't the execution time of
                    > HGET include the time it takes for the server to finish fork plus time to
                    > execute the commands which ahead of HGET in the queue? (Or is my
                    > understanding totally off!)

                    No, the slow log accounts *only* for the command execution, minus all
                    the time for network I/O, dispatch, ...

                    This is the relevant source code:

                    /* Call the command. */
                    c->flags &= ~(REDIS_FORCE_AOF|REDIS_FORCE_REPL);
                    redisOpArrayInit(&server.also_propagate);
                    dirty = server.dirty;
                    start = ustime();
                    c->cmd->proc(c);
                    duration = ustime()-start;
                    dirty = server.dirty-dirty;

                    As you can see we are just metering c->cmd->proc(c), that is the
                    function implementing the command.
                    So also the reply to the client is not calculated in any way (but the
                    time needed to *populate* the reply buffer is calculated).

                    Salvatore
                    >> > output from --latency-history, info and redis log and try and correlate
                    >> -- Wikipedia (Straw man page)
                    -- Wikipedia (Straw man page)

                    Keith Frost

                    unread,
                    Feb 24, 2014, 2:19:27 PM2/24/14
                    to redi...@googlegroups.com
                    Hello Dmitry,

                    In my experience, long-tail latency problems for O(1) commands in redis on linux, when not caused by extreme load, are usually related to disk I/O.  Are you using rdb snapshots, or aof?  If the former, look to see if the timeouts correlated with when the rdb snapshots are taking place.  If the latter, check the same for aof rewrites.  As a test, you can turn off persistence entirely for some period of time (on non-production systems, of course!), and see if your timeouts go away for the duration of the test.  What I've found helpful in the past, is to (1) prefer aof persistence to rdb snapshots, and (2) deliberately trigger background aof rewrites during reduced traffic times, to minimize the impact.  In any case, you need to engineer your redis client systems to be robust against occasional latency spikes.

                    Hope this is helpful,

                    Keith Frost

                    Salvatore Sanfilippo

                    unread,
                    Feb 24, 2014, 4:45:33 PM2/24/14
                    to Redis DB
                    Hello Keith,

                    good suggestions but while O(1) commands and slowness can be what you
                    said, O(1) command *logged into* the SLOWLOG can't be related to slow
                    disks, since there is no write(2) in the code path executing the
                    commands: it is just a memory business, access some data, write a
                    reply into a buffer, write some Redis protocol into the AOF output
                    buffer. So O(1) slow commands logged are almost surely related to
                    something outside Redis.

                    Salvatore
                    > --
                    > You received this message because you are subscribed to the Google Groups
                    > "Redis DB" group.
                    > To unsubscribe from this group and stop receiving emails from it, send an
                    > email to redis-db+u...@googlegroups.com.
                    > To post to this group, send email to redi...@googlegroups.com.
                    > Visit this group at http://groups.google.com/group/redis-db.
                    > For more options, visit https://groups.google.com/groups/opt_out.



                    Keith Frost

                    unread,
                    Feb 24, 2014, 4:50:53 PM2/24/14
                    to redi...@googlegroups.com
                    Can't that something outside the Redis process be the system managing e.g. copy-on-write memory for the forked Redis background process which is dumping the dataset to disk, though?

                    Salvatore Sanfilippo

                    unread,
                    Feb 25, 2014, 1:26:48 AM2/25/14
                    to Redis DB
                    On Mon, Feb 24, 2014 at 10:50 PM, Keith Frost <keith....@gmail.com> wrote:
                    > Can't that something outside the Redis process be the system managing e.g.
                    > copy-on-write memory for the forked Redis background process which is
                    > dumping the dataset to disk, though?

                    Copy on write happens in a per-page basis and can't be responsible for
                    measurable latency: what it affects is overall performance as it uses
                    CPU.
                    What instead is able to block the Redis process, from the point of
                    view of system operating paging, is if there are swapped out pages,
                    but to check this there is a guide in the "latency" page at Redis.io
                    (and it is very unlikely to happen in most workloads).

                    Salvatore

                    --
                    Salvatore 'antirez' Sanfilippo
                    open source developer - GoPivotal
                    http://invece.org

                    To "attack a straw man" is to create the illusion of having refuted a
                    proposition by replacing it with a superficially similar yet
                    unequivalent proposition (the "straw man"), and to refute it

                    Keith Frost

                    unread,
                    Feb 25, 2014, 9:43:43 AM2/25/14
                    to redi...@googlegroups.com

                    Aha, I see.  Thank you for illuminating this point.  It is perplexing, because I know I have often seen high latency read-only operations even on dedicated Redis machines, especially when AOF rewrites or RDB snapshots are occurring.

                    However, those machines were running many Redis instances (approximately one per core), with only one slow disk.  So perhaps the Linux kernel scheduler or disk buffering subsystem performs sub-optimally under heavy I/O backlog, and starts swapping out pages it really should not, when there are many buffers to flush, or interrupting perfectly good CPU-bound processes, to block other processes on I/O. Something like that must be happening to Dmitry, as well.

                    --
                    You received this message because you are subscribed to a topic in the Google Groups "Redis DB" group.
                    To unsubscribe from this topic, visit https://groups.google.com/d/topic/redis-db/_iHpPU5a3og/unsubscribe.
                    To unsubscribe from this group and all its topics, send an email to redis-db+u...@googlegroups.com.

                    Salvatore Sanfilippo

                    unread,
                    Feb 25, 2014, 9:54:03 AM2/25/14
                    to Redis DB
                    On Tue, Feb 25, 2014 at 3:43 PM, Keith Frost <keith....@gmail.com> wrote:
                    > It is perplexing, because I know I have often seen high latency read-only
                    > operations even on dedicated Redis machines, especially when AOF rewrites or
                    > RDB snapshots are occurring.

                    Yep this is entirely possible but you don't usually see the entry in
                    the slow log.

                    What happen in this case if the disk can't cope is:

                    1) Redis receives a command.
                    2) Redis executes a command, the execution of the command is fast and
                    does not get logged.
                    3) Now there is to write the command to the AOF file (so far it was
                    accumulated on the buffer). If the disk can't cope the kernel instead
                    of buffering blocks the write.

                    So you see a latency spike with AOF + busy disks.

                    Btw the plan is to have Redis able to log the different syscalls
                    latency history soon to debug this kind of issues easily.

                    Keith Frost

                    unread,
                    Feb 25, 2014, 10:06:30 AM2/25/14
                    to redi...@googlegroups.com
                    The scenario you outline is the case for a mutating operation, which makes perfect sense.  But we would see high latency outliers even on commands like ZRANGEBYSCORE (with limits of 50 items returned), so there was no need for that process to write to the AOF file.  This is the perplexing case.  I suppose it is possible that the I/O backlog grew large enough that active memory pages got swapped to disk.  I expect the additional logging may help provide some insight.

                    Thank you,

                    Keith Frost


                    Salvatore Sanfilippo

                    unread,
                    Feb 25, 2014, 10:51:42 AM2/25/14
                    to Redis DB
                    On Tue, Feb 25, 2014 at 4:06 PM, Keith Frost <keith....@gmail.com> wrote:
                    > I suppose it is possible that the I/O backlog grew large enough that active
                    > memory pages got swapped to disk. I expect the additional logging may help
                    > provide some insight.

                    That's possible but fortunately also observable:

                    cd /proc/<redis-server-pid>
                    cat smaps | grep 'Swap:'

                    However this is one thing that the "LATENCY" command will be able to
                    do automatically when the user asks for reports.

                    saurabh gupta

                    unread,
                    Apr 17, 2016, 4:02:41 AM4/17/16
                    to Redis DB
                    Hi Shawn,

                    Although its been a long time ago when you reported the issue but it is still reproducible on our side .. we are using redis -3.0.1 and jedis 2.7.2 and facing sockettimeout issue even though trying all the steps which is mentioned in the below in this thread but not able to understand the reason for this.

                    Dis you find the solution or root cause for this?

                    Please help as we are getting this issue in production.

                    Regards
                    Saurabh
                    Reply all
                    Reply to author
                    Forward
                    0 new messages