Hi Good People,
We have been using Redis-Sentinel cluster for almost 2 years. We have 14 clients (sentinel running in each client) and configurations of sentinel is like below:
bind 0.0.0.0
port 26379
sentinel myid ***************************************
sentinel monitor redis-cluster IP Port 11
sentinel down-after-milliseconds redis-cluster 10000
sentinel failover-timeout redis-cluster 10000
sentinel auth-pass redis-cluster *******************************************
logfile "/path/to/redis-sentinel.log"
# Generated by CONFIG REWRITE
daemonize yes
pidfile "/var/run/redis/sentinel.pid"
dir "/"
sentinel config-epoch redis-cluster 635
sentinel leader-epoch redis-cluster 635
Redis has AOF enabled (persec). It is being used for key-value database and also we are maintaining a queue through this. For last 1 month we are facing master switch almost everyday. Initially we had seen below error:
12646:M 04 Jul 01:21:24.082 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
12646:M 04 Jul 01:21:31.100 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
12646:M 04 Jul 01:21:40.046 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
12646:M 04 Jul 01:21:48.095 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
12646:M 04 Jul 01:21:57.039 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
12646:M 04 Jul 01:22:07.094 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
12646:M 04 Jul 01:22:52.005 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
12646:M 04 Jul 01:24:42.066 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
12646:M 04 Jul 01:24:57.047 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
12646:M 04 Jul 01:25:39.076 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
12646:M 04 Jul 01:25:57.022 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
Our Redis server was in our private data center (VM) & was sharing disk with other servers. After observing those, we migrated both master & slave in a separate LUN where no other VM is mounted. But still we were facing those. We checked network, memory, disk & everything. Disk is HP 3PAR with SSD and disk performance is attached.
Current configuration of each server is:
CPU: 12 vCPU
Memory: 72GB
Disk: 300GB (SSD)
Redis Max_memory: 20GB
Redis over-commit: False
THP: Enabled
vNIC capacity: 10G
We have not enabled over-commit has the virtual machine has sufficient memory. The latest logs before last switches are something like below:
Master Log:
6308:C 19 Jul 22:22:00.040 * RDB: 1153 MB of memory used by copy-on-write
28654:M 19 Jul 22:22:00.857 * Background saving terminated with success
28654:M 19 Jul 22:23:01.055 * 10000 changes in 60 seconds. Saving...
28654:M 19 Jul 22:23:01.740 * Background saving started by pid 6760
28654:M 19 Jul 22:23:38.081 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
6760:C 19 Jul 22:25:59.717 * DB saved on disk
6760:C 19 Jul 22:26:00.205 * RDB: 1159 MB of memory used by copy-on-write
28654:M 19 Jul 22:26:01.017 * Background saving terminated with success
28654:M 19 Jul 22:27:02.016 * 10000 changes in 60 seconds. Saving...
28654:M 19 Jul 22:27:02.679 * Background saving started by pid 7270
28654:S 19 Jul 22:27:12.875 # Connection with slave IP:6379 lost.
28654:S 19 Jul 22:27:12.998 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
28654:S 19 Jul 22:27:12.998 * SLAVE OF IP:6379 enabled (user request from 'id=75176616 addr=IP:23816 fd=626 name=sentinel-5c7721e9-cmd age=10 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')
28654:S 19 Jul 22:27:13.000 # CONFIG REWRITE executed with success.
28654:S 19 Jul 22:27:13.482 * Connecting to MASTER IP:6379
28654:S 19 Jul 22:27:13.482 * MASTER <-> SLAVE sync started
28654:S 19 Jul 22:27:13.482 * Non blocking connect for SYNC fired the event.
28654:S 19 Jul 22:27:13.482 * Master replied to PING, replication can continue...
28654:S 19 Jul 22:27:13.483 * Trying a partial resynchronization (request ****************).
28654:S 19 Jul 22:27:38.429 * Full resync from master: ********************
28654:S 19 Jul 22:27:38.429 * Discarding previously cached master state.
7270:C 19 Jul 22:29:51.022 * DB saved on disk
7270:C 19 Jul 22:29:51.477 * RDB: 3642 MB of memory used by copy-on-write
28654:S 19 Jul 22:29:52.462 * Background saving terminated with success
Slave Log:
17194:C 19 Jul 22:23:49.183 * RDB: 1339 MB of memory used by copy-on-write
11795:S 19 Jul 22:23:49.942 * Background saving terminated with success
11795:S 19 Jul 22:24:50.093 * 10000 changes in 60 seconds. Saving...
11795:S 19 Jul 22:24:50.651 * Background saving started by pid 17634
11795:M 19 Jul 22:26:16.869 # Setting secondary replication ID to ce5d1d734a007a2b96183d48ede899c174e83f1b, valid up to offset: 162712239579. New replication ID is 15793e555c6080779d18e1c8f876bf0c1b2d5bfd
11795:M 19 Jul 22:26:16.869 # Connection with master lost.
11795:M 19 Jul 22:26:16.869 * Caching the disconnected master state.
11795:M 19 Jul 22:26:16.869 * Discarding previously cached master state.
11795:M 19 Jul 22:26:16.869 * MASTER MODE enabled (user request from 'id=1432442 addr=IP:38994 fd=28 name=sentinel-e1402af9-cmd age=69213 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')
11795:M 19 Jul 22:26:16.870 # CONFIG REWRITE executed with success.
11795:M 19 Jul 22:26:16.923 * Replication backlog freed after 3600 seconds without connected slaves.
11795:M 19 Jul 22:27:13.483 * Slave IP:6379 asks for synchronization
11795:M 19 Jul 22:27:13.483 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for 'ce5d1d734a007a2b96183d48ede899c174e83f1b', my replication IDs are '68b45810acaf15a79c8d4bd6ee98d2759f1bb74b' and '0000000000000000000000000000000000000000')
11795:M 19 Jul 22:27:13.483 * Can't attach the slave to the current BGSAVE. Waiting for next BGSAVE for SYNC
17634:C 19 Jul 22:27:36.683 * DB saved on disk
17634:C 19 Jul 22:27:37.116 * RDB: 1361 MB of memory used by copy-on-write
11795:M 19 Jul 22:27:37.840 * Background saving terminated with success
11795:M 19 Jul 22:27:37.840 * Starting BGSAVE for SYNC with target: disk
11795:M 19 Jul 22:27:38.428 * Background saving started by pid 18056
11795:M 19 Jul 22:28:11.012 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
11795:M 19 Jul 22:29:21.001 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
18056:C 19 Jul 22:30:27.528 * DB saved on disk
18056:C 19 Jul 22:30:27.958 * RDB: 1350 MB of memory used by copy-on-write
11795:M 19 Jul 22:30:28.729 * Background saving terminated with success
11795:M 19 Jul 22:31:29.034 * 10000 changes in 60 seconds. Saving...
11795:M 19 Jul 22:31:29.619 * Background saving started by pid 18583
18583:C 19 Jul 22:34:16.692 * DB saved on disk
18583:C 19 Jul 22:34:17.160 * RDB: 1283 MB of memory used by copy-on-write
11795:M 19 Jul 22:34:17.883 * Background saving terminated with success
11795:M 19 Jul 22:35:18.074 * 10000 changes in 60 seconds. Saving...
11795:M 19 Jul 22:35:18.657 * Background saving started by pid 19043
19043:C 19 Jul 22:38:04.105 * DB saved on disk
19043:C 19 Jul 22:38:04.546 * RDB: 1305 MB of memory used by copy-on-write
11795:M 19 Jul 22:38:05.276 * Background saving terminated with success
11795:M 19 Jul 22:39:06.056 * 10000 changes in 60 seconds. Saving...
11795:M 19 Jul 22:39:06.642 * Background saving started by pid 19501
19501:C 19 Jul 22:41:52.293 * DB saved on disk
19501:C 19 Jul 22:41:52.735 * RDB: 2424 MB of memory used by copy-on-write
11795:M 19 Jul 22:41:53.639 * Background saving terminated with success
11795:M 19 Jul 22:42:54.074 * 10000 changes in 60 seconds. Saving...
11795:M 19 Jul 22:42:54.656 * Background saving started by pid 19983
19983:C 19 Jul 22:45:37.980 * DB saved on disk
Redis-info is also attached. CPU, Memory utilization is less than 50% during incidents. Can anyone help me to figure out what to do?