Redis Error: Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis

2,211 views

Skip to first unread message

ravi beli

unread,

Dec 15, 2017, 2:25:33 AM12/15/17

to Redis DB

I have run below Test-1 and Test-2 for longer run for performance test with redis configuration values specified, still we see the highlighted error-1 & 2 message and cluster is failing for some time, few of our processing failing. How to solve this problem.

please anyone have suggestion to avoid cluster fail which is goes longer than 10seconds, cluster is not coming up within 3 retry attempts (spring retry template we are using for retry mechanism try count is set to 3, and retry after 5sec, its exponential way next attempts) using Jedis client.

Error-1: Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

Error-2: Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).

Test-1:

Run the test with Redis Setting:

"appendfsync"="yes"

"appendonly"="no"

[root@rdcapdev1-redis-cache3 redis-3.2.5]# src/redis-cli -p 6379

127.0.0.1:6379> CONFIG GET aof

1) "auto-aof-rewrite-percentage"

2) "30"

3) "auto-aof-rewrite-min-size"

4) "67108864"

5) "aof-rewrite-incremental-fsync"

6) "yes"

7) "aof-load-truncated"

8) "yes"

127.0.0.1:6379> exit

[root@rdcapdev1-redis-cache3 redis-3.2.5]# src/redis-cli -p 6380

127.0.0.1:6380> CONFIG GET **aof***

1) "auto-aof-rewrite-percentage"

2) "30"

3) "auto-aof-rewrite-min-size"

4) "67108864"

5) "aof-rewrite-incremental-fsync"

6) "yes"

7) "aof-load-truncated"

8) "yes"

127.0.0.1:6380> clear

Observation:

1. The redis failover occurred for ~40 sec.

2. There are around 20 documents failed on FX and OCR level. Due to inability to write/read the files to Redis.

3. This has been happened when ~50% of RAM got utilized.

4. The Master-slave configuration has be reshuffled as below after this failover.

5. Below are the few highlights of the redis log, plese refer that attached log for more detail.

6. I have logs for this, for more details: 30Per_AofRW_2.zip

Redis2 Master log:

5306:M 03 Apr 09:02:36.947 * Background saving terminated with success

5306:M 03 Apr 09:02:49.574 * Starting automatic rewriting of AOF on 3% growth

5306:M 03 Apr 09:02:49.583 * Background append only file rewriting started by pid 12864

5306:M 03 Apr 09:02:54.050 * AOF rewrite child asks to stop sending diffs.

12864:C 03 Apr 09:02:54.051 * Parent agreed to stop sending diffs. Finalizing AOF...

12864:C 03 Apr 09:02:54.051 * Concatenating 0.00 MB of AOF diff received from parent.

12864:C 03 Apr 09:02:54.051 * SYNC append only file rewrite performed

12864:C 03 Apr 09:02:54.058 * AOF rewrite: 2 MB of memory used by copy-on-write

5306:M 03 Apr 09:02:54.098 * Background AOF rewrite terminated with success

5306:M 03 Apr 09:02:54.098 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)

5306:M 03 Apr 09:02:54.098 * Background AOF rewrite finished successfully

5306:M 03 Apr 09:04:01.843 * Starting automatic rewriting of AOF on 3% growth

5306:M 03 Apr 09:04:01.853 * Background append only file rewriting started by pid 12867

5306:M 03 Apr 09:04:11.657 * AOF rewrite child asks to stop sending diffs.

12867:C 03 Apr 09:04:11.657 * Parent agreed to stop sending diffs. Finalizing AOF...

12867:C 03 Apr 09:04:11.657 * Concatenating 0.00 MB of AOF diff received from parent.

12867:C 03 Apr 09:04:11.657 * SYNC append only file rewrite performed

12867:C 03 Apr 09:04:11.664 * AOF rewrite: 2 MB of memory used by copy-on-write

5306:M 03 Apr 09:04:11.675 * Background AOF rewrite terminated with success

5306:M 03 Apr 09:04:11.675 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)

5306:M 03 Apr 09:04:11.675 * Background AOF rewrite finished successfully

5306:M 03 Apr 09:04:48.054 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

5306:M 03 Apr 09:05:28.571 * Starting automatic rewriting of AOF on 3% growth

5306:M 03 Apr 09:05:28.581 * Background append only file rewriting started by pid 12873

5306:M 03 Apr 09:05:33.300 * AOF rewrite child asks to stop sending diffs.

12873:C 03 Apr 09:05:33.300 * Parent agreed to stop sending diffs. Finalizing AOF...

12873:C 03 Apr 09:05:33.300 * Concatenating 2.09 MB of AOF diff received from parent.

12873:C 03 Apr 09:05:33.329 * SYNC append only file rewrite performed

12873:C 03 Apr 09:05:33.336 * AOF rewrite: 11 MB of memory used by copy-on-write

5306:M 03 Apr 09:05:33.395 * Background AOF rewrite terminated with success

5306:M 03 Apr 09:05:33.395 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)

5306:M 03 Apr 09:05:33.395 * Background AOF rewrite finished successfully

5306:M 03 Apr 09:07:37.082 * 10 changes in 300 seconds. Saving...

5306:M 03 Apr 09:07:37.092 * Background saving started by pid 12875

12875:C 03 Apr 09:07:47.016 * DB saved on disk

12875:C 03 Apr 09:07:47.024 * RDB: 5 MB of memory used by copy-on-write

5306:M 03 Apr 09:07:47.113 * Background saving terminated with success

5306:M 03 Apr 09:07:51.622 * Starting automatic rewriting of AOF on 3% growth

5306:M 03 Apr 09:07:51.632 * Background append only file rewriting started by pid 12876

5306:M 03 Apr 09:07:56.559 * AOF rewrite child asks to stop sending diffs.

12876:C 03 Apr 09:07:56.559 * Parent agreed to stop sending diffs. Finalizing AOF...

12876:C 03 Apr 09:07:56.559 * Concatenating 0.00 MB of AOF diff received from parent.

12876:C 03 Apr 09:07:56.559 * SYNC append only file rewrite performed

12876:C 03 Apr 09:07:56.567 * AOF rewrite: 2 MB of memory used by copy-on-write

5306:M 03 Apr 09:07:56.645 * Background AOF rewrite terminated with success

5306:M 03 Apr 09:07:56.645 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)

5306:M 03 Apr 09:07:56.645 * Background AOF rewrite finished successfully

5306:M 03 Apr 09:12:48.071 * 10 changes in 300 seconds. Saving...

5306:M 03 Apr 09:12:48.081 * Background saving started by pid 12882

12882:C 03 Apr 09:12:58.381 * DB saved on disk

12882:C 03 Apr 09:12:58.389 * RDB: 5 MB of memory used by copy-on-write

5306:M 03 Apr 09:12:58.403 * Background saving terminated with success

5306:M 03 Apr 10:17:33.005 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

5306:M 03 Apr 10:22:42.042 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

5306:M 03 Apr 10:27:51.039 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

5306:M 03 Apr 11:10:10.606 * Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).

5306:M 03 Apr 11:10:10.607 # Cluster state changed: fail

5306:M 03 Apr 11:10:10.608 * FAIL message received from 83f6a9589aa1bce8932a367894fa391edd0ce269 about ac630108d1556786a4df74945cfe35db981d15fa

5306:M 03 Apr 11:10:11.594 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 7

Redis1 Master log:

2515:C 05 May 11:06:30.343 * DB saved on disk

2515:C 05 May 11:06:30.379 * RDB: 15 MB of memory used by copy-on-write

837:S 05 May 11:06:30.429 * Background saving terminated with success

837:S 05 May 11:11:31.024 * 10 changes in 300 seconds. Saving...

837:S 05 May 11:11:31.067 * Background saving started by pid 2534

837:S 05 May 11:12:24.802 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61

837:S 05 May 11:12:27.049 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.

2534:C 05 May 11:12:31.110 * DB saved on disk

Redis2 Master log:

837:M 05 May 10:30:22.216 * Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).

837:M 05 May 10:30:22.216 # Cluster state changed: fail

837:M 05 May 10:30:23.148 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 12

837:M 05 May 10:30:23.188 # Cluster state changed: ok

837:M 05 May 10:30:27.227 * Clear FAIL state for node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba: slave is reachable again.

837:M 05 May 10:35:22.017 * 10 changes in 300 seconds. Saving...

837:M 05 May 11:12:23.592 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61

837:M 05 May 11:12:27.045 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.

Redis3 Master Log:

833:M 05 May 10:30:22.217 * FAIL message received from 83f6a9589aa1bce8932a367894fa391edd0ce269 about a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba

833:M 05 May 10:30:22.217 # Cluster state changed: fail

833:M 05 May 10:30:23.149 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 12

833:M 05 May 10:30:23.189 # Cluster state changed: ok

1822:C 05 May 10:30:27.397 * DB saved on disk

1822:C 05 May 10:30:27.428 * RDB: 8 MB of memory used by copy-on-write

833:M 05 May 10:30:27.528 * Background saving terminated with success

833:M 05 May 10:30:27.828 * Clear FAIL state for node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba: slave is reachable again.

MASTER SLAVE INFO

hashCode master slave hashSlot

81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 10.2.1.233:6380, 10923-16383

6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 10.2.1.233:6379, 0-5460

83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 10.2.1.202:6380, 5461-10922

6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 master - 0 1493981044497 12 connected 0-5460

81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 master - 0 1493981045500 3 connected 10923-16383

ce62a26102ef54f43fa7cca64d24eab45cf42a61 10.2.1.202:6380 slave 83f6a9589aa1bce8932a367894fa391edd0ce269 0 1493981043495 10 connected

ac630108d1556786a4df74945cfe35db981d15fa 10.2.1.233:6380 slave 81ae2d757f57f36fa1df6e930af3b072084ba3e8 0 1493981042492 11 connected

83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 master - 0 1493981044497 2 connected 5461-10922

a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba 10.2.1.233:6379 myself,slave 6b8d49e9db288b13071559c667e95e3691ce8bd0 0 0 1 connected

Test-2:

Run the test with Redis Setting:

"appendfsync"="no"

"appendonly"="yes"

Observation:

1. The redis failover occurred for ~40 sec.

2. There are around 20 documents failed on FX and OCR level. Due to inability to write/read the files to Redis.

3. This has been happened when ~50% of RAM got utilized.

4. The Master-slave configuration has be reshuffled as below after this failover.

Redis1 Master log:

2515:C 05 May 11:06:30.343 * DB saved on disk

2515:C 05 May 11:06:30.379 * RDB: 15 MB of memory used by copy-on-write

837:S 05 May 11:06:30.429 * Background saving terminated with success

837:S 05 May 11:11:31.024 * 10 changes in 300 seconds. Saving...

837:S 05 May 11:11:31.067 * Background saving started by pid 2534

837:S 05 May 11:12:24.802 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61

837:S 05 May 11:12:27.049 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.

2534:C 05 May 11:12:31.110 * DB saved on disk

MASTER SLAVE INFO

hashCode master slave hashSlot

81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 10.2.1.233:6380, 10923-16383

6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 10.2.1.233:6379, 0-5460

83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 10.2.1.202:6380, 5461-10922

6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 master - 0 1493981044497 12 connected 0-5460

81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 master - 0 1493981045500 3 connected 10923-16383

ce62a26102ef54f43fa7cca64d24eab45cf42a61 10.2.1.202:6380 slave 83f6a9589aa1bce8932a367894fa391edd0ce269 0 1493981043495 10 connected

ac630108d1556786a4df74945cfe35db981d15fa 10.2.1.233:6380 slave 81ae2d757f57f36fa1df6e930af3b072084ba3e8 0 1493981042492 11 connected

83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 master - 0 1493981044497 2 connected 5461-10922

a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba 10.2.1.233:6379 myself,slave 6b8d49e9db288b13071559c667e95e3691ce8bd0 0 0 1 connected

hva...@gmail.com

unread,

Dec 18, 2017, 2:53:06 AM12/18/17

to Redis DB

The first error you noticed asks if the disk that holds your AOF file is busy. You posted a lot of detail, but none of it seems to answer that question. Is that disk busy at those times? What do your performance graphs for that disk show for how busy it is during the times that the Redis log shows the trouble?

The Redis log you posted has entries for both an AOF file and an RDB snapshot file. Is this intentional, that you're using both methods of persistence instead of just one? Are both saving to the same disk? If they are saving to the same disk, have you explored the question of whether the RDB dump to disk is triggering these error messages in the log file?

Reply all

Reply to author

Forward

0 new messages