Redis Error: Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis

2,196 views
Skip to first unread message

ravi beli

unread,
Dec 15, 2017, 2:25:33 AM12/15/17
to Redis DB
I have run below Test-1 and Test-2 for longer run for performance test with redis configuration values specified, still we see the highlighted error-1 & 2 message and cluster is failing for some time, few of our processing failing. How to solve this problem.

please anyone have suggestion to avoid cluster fail which is goes longer than 10seconds, cluster is not coming up within 3 retry attempts (spring retry template we are using for retry mechanism try count is set to 3, and retry after 5sec, its exponential way next attempts) using Jedis client.

Error-1: Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

Error-2: Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).
    
Test-1:

    Run the test with Redis Setting: 

    "appendfsync"="yes" 
    "appendonly"="no"

    [root@rdcapdev1-redis-cache3 redis-3.2.5]# src/redis-cli -p 6379
    
    127.0.0.1:6379> CONFIG GET aof

    1) "auto-aof-rewrite-percentage"
    2) "30"
    3) "auto-aof-rewrite-min-size"
    4) "67108864"
    5) "aof-rewrite-incremental-fsync"
    6) "yes"
    7) "aof-load-truncated"
    8) "yes"
    127.0.0.1:6379> exit


    [root@rdcapdev1-redis-cache3 redis-3.2.5]# src/redis-cli -p 6380

    127.0.0.1:6380> CONFIG GET **aof***

    1) "auto-aof-rewrite-percentage"
    2) "30"
    3) "auto-aof-rewrite-min-size"
    4) "67108864"
    5) "aof-rewrite-incremental-fsync"
    6) "yes"
    7) "aof-load-truncated"
    8) "yes"
    127.0.0.1:6380> clear
    
Observation:

      1. The redis failover occurred for ~40 sec.
      2. There are around 20 documents failed on FX and OCR level. Due to inability to write/read the files to Redis.
      3. This has been happened when ~50% of RAM got utilized.
      4. The Master-slave configuration has be reshuffled as below after this failover.
      5. Below are the few highlights of the redis log, plese refer that attached log for more detail.
      6. I have logs for this, for more details: 30Per_AofRW_2.zip

Redis2 Master log:

    5306:M 03 Apr 09:02:36.947 * Background saving terminated with success
    5306:M 03 Apr 09:02:49.574 * Starting automatic rewriting of AOF on 3% growth
    5306:M 03 Apr 09:02:49.583 * Background append only file rewriting started by pid 12864
    5306:M 03 Apr 09:02:54.050 * AOF rewrite child asks to stop sending diffs.
    12864:C 03 Apr 09:02:54.051 * Parent agreed to stop sending diffs. Finalizing AOF...
    12864:C 03 Apr 09:02:54.051 * Concatenating 0.00 MB of AOF diff received from parent.
    12864:C 03 Apr 09:02:54.051 * SYNC append only file rewrite performed
    12864:C 03 Apr 09:02:54.058 * AOF rewrite: 2 MB of memory used by copy-on-write
    5306:M 03 Apr 09:02:54.098 * Background AOF rewrite terminated with success
    5306:M 03 Apr 09:02:54.098 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
    5306:M 03 Apr 09:02:54.098 * Background AOF rewrite finished successfully
    5306:M 03 Apr 09:04:01.843 * Starting automatic rewriting of AOF on 3% growth
    5306:M 03 Apr 09:04:01.853 * Background append only file rewriting started by pid 12867
    5306:M 03 Apr 09:04:11.657 * AOF rewrite child asks to stop sending diffs.
    12867:C 03 Apr 09:04:11.657 * Parent agreed to stop sending diffs. Finalizing AOF...
    12867:C 03 Apr 09:04:11.657 * Concatenating 0.00 MB of AOF diff received from parent.
    12867:C 03 Apr 09:04:11.657 * SYNC append only file rewrite performed
    12867:C 03 Apr 09:04:11.664 * AOF rewrite: 2 MB of memory used by copy-on-write
    5306:M 03 Apr 09:04:11.675 * Background AOF rewrite terminated with success
    5306:M 03 Apr 09:04:11.675 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
    5306:M 03 Apr 09:04:11.675 * Background AOF rewrite finished successfully
    5306:M 03 Apr 09:04:48.054 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
    5306:M 03 Apr 09:05:28.571 * Starting automatic rewriting of AOF on 3% growth
    5306:M 03 Apr 09:05:28.581 * Background append only file rewriting started by pid 12873
    5306:M 03 Apr 09:05:33.300 * AOF rewrite child asks to stop sending diffs.
    12873:C 03 Apr 09:05:33.300 * Parent agreed to stop sending diffs. Finalizing AOF...
    12873:C 03 Apr 09:05:33.300 * Concatenating 2.09 MB of AOF diff received from parent.
    12873:C 03 Apr 09:05:33.329 * SYNC append only file rewrite performed
    12873:C 03 Apr 09:05:33.336 * AOF rewrite: 11 MB of memory used by copy-on-write
    5306:M 03 Apr 09:05:33.395 * Background AOF rewrite terminated with success
    5306:M 03 Apr 09:05:33.395 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
    5306:M 03 Apr 09:05:33.395 * Background AOF rewrite finished successfully
    5306:M 03 Apr 09:07:37.082 * 10 changes in 300 seconds. Saving...
    5306:M 03 Apr 09:07:37.092 * Background saving started by pid 12875
    12875:C 03 Apr 09:07:47.016 * DB saved on disk
    12875:C 03 Apr 09:07:47.024 * RDB: 5 MB of memory used by copy-on-write
    5306:M 03 Apr 09:07:47.113 * Background saving terminated with success
    5306:M 03 Apr 09:07:51.622 * Starting automatic rewriting of AOF on 3% growth
    5306:M 03 Apr 09:07:51.632 * Background append only file rewriting started by pid 12876
    5306:M 03 Apr 09:07:56.559 * AOF rewrite child asks to stop sending diffs.
    12876:C 03 Apr 09:07:56.559 * Parent agreed to stop sending diffs. Finalizing AOF...
    12876:C 03 Apr 09:07:56.559 * Concatenating 0.00 MB of AOF diff received from parent.
    12876:C 03 Apr 09:07:56.559 * SYNC append only file rewrite performed
    12876:C 03 Apr 09:07:56.567 * AOF rewrite: 2 MB of memory used by copy-on-write
    5306:M 03 Apr 09:07:56.645 * Background AOF rewrite terminated with success
    5306:M 03 Apr 09:07:56.645 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
    5306:M 03 Apr 09:07:56.645 * Background AOF rewrite finished successfully
    5306:M 03 Apr 09:12:48.071 * 10 changes in 300 seconds. Saving...
    5306:M 03 Apr 09:12:48.081 * Background saving started by pid 12882
    12882:C 03 Apr 09:12:58.381 * DB saved on disk
    12882:C 03 Apr 09:12:58.389 * RDB: 5 MB of memory used by copy-on-write
    5306:M 03 Apr 09:12:58.403 * Background saving terminated with success
    5306:M 03 Apr 10:17:33.005 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
    5306:M 03 Apr 10:22:42.042 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
    5306:M 03 Apr 10:27:51.039 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
    5306:M 03 Apr 11:10:10.606 * Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).
    5306:M 03 Apr 11:10:10.607 # Cluster state changed: fail
    5306:M 03 Apr 11:10:10.608 * FAIL message received from 83f6a9589aa1bce8932a367894fa391edd0ce269 about ac630108d1556786a4df74945cfe35db981d15fa
    5306:M 03 Apr 11:10:11.594 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 7

Redis1 Master log:
    
    2515:C 05 May 11:06:30.343 * DB saved on disk
    2515:C 05 May 11:06:30.379 * RDB: 15 MB of memory used by copy-on-write
    837:S 05 May 11:06:30.429 * Background saving terminated with success
    837:S 05 May 11:11:31.024 * 10 changes in 300 seconds. Saving...
    837:S 05 May 11:11:31.067 * Background saving started by pid 2534
    837:S 05 May 11:12:24.802 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61
    837:S 05 May 11:12:27.049 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.
    2534:C 05 May 11:12:31.110 * DB saved on disk

Redis2 Master log:

    837:M 05 May 10:30:22.216 * Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).
    837:M 05 May 10:30:22.216 # Cluster state changed: fail
    837:M 05 May 10:30:23.148 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 12
    837:M 05 May 10:30:23.188 # Cluster state changed: ok
    837:M 05 May 10:30:27.227 * Clear FAIL state for node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba: slave is reachable again.
    837:M 05 May 10:35:22.017 * 10 changes in 300 seconds. Saving...
    .
    .
    .
    837:M 05 May 11:12:23.592 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61
    837:M 05 May 11:12:27.045 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.
    
Redis3 Master Log:
    
    833:M 05 May 10:30:22.217 * FAIL message received from 83f6a9589aa1bce8932a367894fa391edd0ce269 about a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba
    833:M 05 May 10:30:22.217 # Cluster state changed: fail
    833:M 05 May 10:30:23.149 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 12
    833:M 05 May 10:30:23.189 # Cluster state changed: ok
    1822:C 05 May 10:30:27.397 * DB saved on disk
    1822:C 05 May 10:30:27.428 * RDB: 8 MB of memory used by copy-on-write
    833:M 05 May 10:30:27.528 * Background saving terminated with success
    833:M 05 May 10:30:27.828 * Clear FAIL state for node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba: slave is reachable again.
    
    MASTER SLAVE INFO
    hashCode master slave hashSlot
    81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 10.2.1.233:6380, 10923-16383
    6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 10.2.1.233:6379, 0-5460
    83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 10.2.1.202:6380, 5461-10922
    6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 master - 0 1493981044497 12 connected 0-5460
    81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 master - 0 1493981045500 3 connected 10923-16383
    ce62a26102ef54f43fa7cca64d24eab45cf42a61 10.2.1.202:6380 slave 83f6a9589aa1bce8932a367894fa391edd0ce269 0 1493981043495 10 connected
    ac630108d1556786a4df74945cfe35db981d15fa 10.2.1.233:6380 slave 81ae2d757f57f36fa1df6e930af3b072084ba3e8 0 1493981042492 11 connected
    83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 master - 0 1493981044497 2 connected 5461-10922
    a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba 10.2.1.233:6379 myself,slave 6b8d49e9db288b13071559c667e95e3691ce8bd0 0 0 1 connected

Test-2:

    Run the test with Redis Setting: 

    "appendfsync"="no" 
    "appendonly"="yes"


Observation:

    1. The redis failover occurred for ~40 sec.
    2. There are around 20 documents failed on FX and OCR level. Due to inability to write/read the files to Redis.
    3. This has been happened when ~50% of RAM got utilized.
    4. The Master-slave configuration has be reshuffled as below after this failover.

Redis1 Master log:

    2515:C 05 May 11:06:30.343 * DB saved on disk
    2515:C 05 May 11:06:30.379 * RDB: 15 MB of memory used by copy-on-write
    837:S 05 May 11:06:30.429 * Background saving terminated with success
    837:S 05 May 11:11:31.024 * 10 changes in 300 seconds. Saving...
    837:S 05 May 11:11:31.067 * Background saving started by pid 2534
    837:S 05 May 11:12:24.802 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61
    837:S 05 May 11:12:27.049 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.
    2534:C 05 May 11:12:31.110 * DB saved on disk


MASTER SLAVE INFO

    hashCode master slave hashSlot
    81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 10.2.1.233:6380, 10923-16383
    6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 10.2.1.233:6379, 0-5460
    83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 10.2.1.202:6380, 5461-10922
    6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 master - 0 1493981044497 12 connected 0-5460
    81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 master - 0 1493981045500 3 connected 10923-16383
    ce62a26102ef54f43fa7cca64d24eab45cf42a61 10.2.1.202:6380 slave 83f6a9589aa1bce8932a367894fa391edd0ce269 0 1493981043495 10 connected
    ac630108d1556786a4df74945cfe35db981d15fa 10.2.1.233:6380 slave 81ae2d757f57f36fa1df6e930af3b072084ba3e8 0 1493981042492 11 connected
    83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 master - 0 1493981044497 2 connected 5461-10922
    a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba 10.2.1.233:6379 myself,slave 6b8d49e9db288b13071559c667e95e3691ce8bd0 0 0 1 connected

hva...@gmail.com

unread,
Dec 18, 2017, 2:53:06 AM12/18/17
to Redis DB
The first error you noticed asks if the disk that holds your AOF file is busy.  You posted a lot of detail, but none of it seems to answer that question.  Is that disk busy at those times?  What do your performance graphs for that disk show for how busy it is during the times that the Redis log shows the trouble?

The Redis log you posted has entries for both an AOF file and an RDB snapshot file.  Is this intentional, that you're using both methods of persistence instead of just one?  Are both saving to the same disk?  If they are saving to the same disk, have you explored the question of whether the RDB dump to disk is triggering these error messages in the log file?
Reply all
Reply to author
Forward
0 new messages