Redis switches intermittently (Connection with master lost)

4,152 views
Skip to first unread message

divay nagpal

unread,
May 20, 2016, 11:00:04 AM5/20/16
to Redis DB
Hi All,

I am sorry if this issue is replica of some other issue but couldn't find any solution to it. So posting.
Issue: We have 4 db instances over which we have 5 redis instances running. We have a 1 master 1 slave configuration for our redis according to which
2 redis masters on box1, 3 masters on box2, 2 slaves on box3 and 3 slaves on box4. 
Also we have 5 sentinals distributed across these machines 2 one one machine and 1 each on the remaining 3 machines.

The problem is that redis slave loses connection to the master and vice versa intermittently(Error : Connection with master lost). There is no pattern as such for it. I am confused why redis loses 
connection once and then again after some time it is able to connect. We have enough space available on the box for our redis instances 
so no space congestion as such. Below are my redis and sentinel config for one of the instances.

Sentinel Config.

sentinel monitor name <ip> <port> 2
# Other important configuration, but not modified here.
sentinel auth-pass name <auth key>
sentinel down-after-milliseconds name 5000
sentinel failover-timeout name 60000
sentinel parallel-syncs name 1
# END OF CONFIGURATION

Redis config

###############################################################################
#########PLEASE BE SURE ABOUT THE PARAMETERS BEFORE EDITING THIS FILE##########
################################ GENERAL  #####################################
daemonize yes
pidfile "<pid file location>.pid"
port <port>
tcp-backlog 511
timeout 0
tcp-keepalive 60
loglevel notice
logfile "<log file location>.log"

################################ SNAPSHOTTING  ################################
databases 16
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename "dump.rdb"
dir "<data dir location>"

################################# REPLICATION #################################
slave-serve-stale-data yes
slave-read-only no
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay yes
repl-backlog-size 128mb
repl-backlog-ttl 3600
slave-priority 100
min-slaves-to-write 0
min-slaves-max-lag 10

################################## SECURITY ###################################
masterauth "<auth key>"
requirepass "<auth key>"

################################### LIMITS ####################################
maxmemory 12gb
maxmemory-policy allkeys-lru
maxmemory-samples 5

############################## APPEND ONLY MODE ###############################
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes

################################ LUA SCRIPTING  ###############################
lua-time-limit 5000

################################## SLOW LOG ###################################
slowlog-log-slower-than 10000
slowlog-max-len 128

################################ LATENCY MONITOR ##############################
latency-monitor-threshold 0

############################# EVENT NOTIFICATION ##############################
notify-keyspace-events ""

############################### ADVANCED CONFIG ###############################
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes
# Generated by CONFIG REWRITE

maxclients 4064

Please help if you see anything missing in the conf or if any one else has faced the same issue and has a solution.

Thanks,

José Antonio Quiles Follana

unread,
May 23, 2016, 1:51:44 AM5/23/16
to Redis DB
What is the master database size?
Could you show master and slave logs?
If database size is too big, sometimes the master buffer becomes full while is synchronizing...

David Geller

unread,
Oct 24, 2016, 2:30:35 PM10/24/16
to Redis DB
I'm having this same problem intermittently.  The two slaves lost connection to the one master.  The size of the database is less than 4MB.  I've seen at least one other complaint on this out on the Internets but no replies.

Anyone?

I'd be happy to supply more info.

Tuco

unread,
Oct 24, 2016, 10:27:36 PM10/24/16
to Redis DB
Can you post logs for both master and slave when the error occurred.

David Geller

unread,
Oct 25, 2016, 10:40:38 AM10/25/16
to Redis DB


On the master.  There is that one AOF message, but that usually doesn't cause a failure.  Before that, the last message was the same but it was at 03:14.  At 4:39:46, the system.io.w_await was averaging 3.91ms with an actual value of .81ms at 04:39:40.  My redis data volume is RAID0 on AWS EBS storage with 2500 iops subscribed.  There's nothing going on on those disks at that time and the system io latency is being effected by a different volume altogether at the time.  I'm not trying to be proactively defensive about it :-)  I've just been trying to get rid of those messages, without success!  (and they don't seem to bother redis too much, in practice -- unless this is a problem here).

16327:M 23 Oct 04:39:46.012 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
16327:M 23 Oct 04:39:53.845 # Connection with slave 10.10.21.109:6379 lost.
16327:M 23 Oct 04:39:53.845 # Connection with slave 10.10.30.129:6379 lost.
16327:M 23 Oct 04:39:53.869 # CONFIG REWRITE executed with success.
16327:M 23 Oct 04:39:53.943 * 1 changes in 10800 seconds. Saving...
16327:M 23 Oct 04:39:53.944 * Background saving started by pid 29650
29650:C 23 Oct 04:39:54.008 * DB saved on disk
29650:C 23 Oct 04:39:54.008 * RDB: 0 MB of memory used by copy-on-write
16327:M 23 Oct 04:39:54.044 * Background saving terminated with success
16327:S 23 Oct 04:40:03.869 * SLAVE OF 10.10.21.109:6379 enabled (user request from 'id=432 addr=10.10.20.161:40138 fd=72 name=sentinel-ba425363-cmd age=17 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=rw cmd=exec')
16327:S 23 Oct 04:40:03.870 # CONFIG REWRITE executed with success.
16327:S 23 Oct 04:40:04.356 * Connecting to MASTER 10.10.21.109:6379
16327:S 23 Oct 04:40:04.375 * MASTER <-> SLAVE sync started
16327:S 23 Oct 04:40:04.375 * Non blocking connect for SYNC fired the event.
16327:S 23 Oct 04:40:04.376 * Master replied to PING, replication can continue...
16327:S 23 Oct 04:40:04.376 * Partial resynchronization not possible (no cached master)
16327:S 23 Oct 04:40:04.377 * Full resync from master: 83703af33e7f4783591e2fa7e6ccd950db5cdd10:197921726
16327:S 23 Oct 04:40:04.404 * MASTER <-> SLAVE sync: receiving 245028 bytes from master
16327:S 23 Oct 04:40:04.406 * MASTER <-> SLAVE sync: Flushing old data
16327:S 23 Oct 04:40:04.407 * MASTER <-> SLAVE sync: Loading DB in memory
16327:S 23 Oct 04:40:04.410 * MASTER <-> SLAVE sync: Finished with success
16327:S 23 Oct 04:40:04.425 * Background append only file rewriting started by pid 29671
16327:S 23 Oct 04:40:04.464 * AOF rewrite child asks to stop sending diffs.
29671:C 23 Oct 04:40:04.464 * Parent agreed to stop sending diffs. Finalizing AOF...
29671:C 23 Oct 04:40:04.464 * Concatenating 0.00 MB of AOF diff received from parent.
29671:C 23 Oct 04:40:04.464 * SYNC append only file rewrite performed
29671:C 23 Oct 04:40:04.464 * AOF rewrite: 0 MB of memory used by copy-on-write
16327:S 23 Oct 04:40:04.475 * Background AOF rewrite terminated with success
16327:S 23 Oct 04:40:04.475 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
16327:S 23 Oct 04:40:04.475 * Background AOF rewrite finished successfully


On the slave:

17755:M 23 Oct 04:39:52.174 # Connection with master lost.
17755:M 23 Oct 04:39:52.174 * Caching the disconnected master state.
17755:M 23 Oct 04:39:52.174 * Discarding previously cached master state.
17755:M 23 Oct 04:39:52.184 * MASTER MODE enabled (user request from 'id=12 addr=10.10.30.129:39611 fd=13 name=sentinel-21af3b90-cmd age=494042 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=rw cmd=exec')
17755:M 23 Oct 04:39:52.195 # CONFIG REWRITE executed with success.
17755:M 23 Oct 04:39:52.655 # CONFIG REWRITE executed with success.
17755:M 23 Oct 04:39:53.337 # CONFIG REWRITE executed with success.
17755:M 23 Oct 04:39:53.588 * Slave 10.10.30.129:6379 asks for synchronization
17755:M 23 Oct 04:39:53.588 * Full resync requested by slave 10.10.30.129:6379
17755:M 23 Oct 04:39:53.588 * Starting BGSAVE for SYNC with target: disk
17755:M 23 Oct 04:39:53.588 * Background saving started by pid 30298
30298:C 23 Oct 04:39:53.598 * DB saved on disk
30298:C 23 Oct 04:39:53.598 * RDB: 0 MB of memory used by copy-on-write
17755:M 23 Oct 04:39:53.686 * Background saving terminated with success
17755:M 23 Oct 04:39:53.689 * Synchronization with slave 10.10.30.129:6379 succeeded
17755:M 23 Oct 04:40:04.377 * Slave 10.10.10.227:6379 asks for synchronization
17755:M 23 Oct 04:40:04.377 * Full resync requested by slave 10.10.10.227:6379
17755:M 23 Oct 04:40:04.377 * Starting BGSAVE for SYNC with target: disk
17755:M 23 Oct 04:40:04.377 * Background saving started by pid 30324
30324:C 23 Oct 04:40:04.384 * DB saved on disk
30324:C 23 Oct 04:40:04.385 * RDB: 0 MB of memory used by copy-on-write
17755:M 23 Oct 04:40:04.404 * Background saving terminated with success
17755:M 23 Oct 04:40:04.405 * Synchronization with slave 10.10.10.227:6379 succeeded

On the other slave:

16335:S 23 Oct 04:39:53.017 # Connection with master lost.
16335:S 23 Oct 04:39:53.018 * Caching the disconnected master state.
16335:S 23 Oct 04:39:53.018 * Discarding previously cached master state.
16335:S 23 Oct 04:39:53.023 * SLAVE OF 10.10.21.109:6379 enabled (user request from 'id=5 addr=10.10.30.129:36444 fd=10 name=sentinel-21af3b90-cmd age=494042 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=14 qbuf-free=32754 obl=36 oll=0 omem=0 events=rw cmd=exec')
16335:S 23 Oct 04:39:53.049 # CONFIG REWRITE executed with success.
16335:S 23 Oct 04:39:53.311 # CONFIG REWRITE executed with success.
16335:S 23 Oct 04:39:53.584 * Connecting to MASTER 10.10.21.109:6379
16335:S 23 Oct 04:39:53.584 * MASTER <-> SLAVE sync started
16335:S 23 Oct 04:39:53.586 * Non blocking connect for SYNC fired the event.
16335:S 23 Oct 04:39:53.587 * Master replied to PING, replication can continue...
16335:S 23 Oct 04:39:53.590 * Partial resynchronization not possible (no cached master)
16335:S 23 Oct 04:39:53.591 * Full resync from master: 83703af33e7f4783591e2fa7e6ccd950db5cdd10:197918675
16335:S 23 Oct 04:39:53.690 * MASTER <-> SLAVE sync: receiving 245124 bytes from master
16335:S 23 Oct 04:39:53.694 * MASTER <-> SLAVE sync: Flushing old data
16335:S 23 Oct 04:39:53.696 * MASTER <-> SLAVE sync: Loading DB in memory

David Geller

unread,
Oct 25, 2016, 12:27:37 PM10/25/16
to Redis DB
redis config

pidfile "/var/run/redis/redis.pid"
port 6379
tcp-backlog 65535
unixsocket "/tmp/redis.sock"
timeout 0
tcp-keepalive 0
loglevel notice
logfile "/var/log/apps/redis.log"
syslog-enabled no
syslog-ident "redis"
syslog-facility local0
databases 16
save 86400 1000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename "dump.rdb"
dir "/apps/redis"
slave-serve-stale-data yes
slave-read-only yes
repl-ping-slave-period 10
repl-timeout 60
repl-disable-tcp-nodelay no
slave-priority 100
maxclients 10000
maxmemory 64012kb
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite yes
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
notify-keyspace-events ""

hva...@gmail.com

unread,
Oct 25, 2016, 1:06:31 PM10/25/16
to Redis DB

Your master server's log includes this line:


16335:S 23 Oct 04:39:53.023 * SLAVE OF 10.10.21.109:6379 enabled (user request from 'id=5 addr=10.10.30.129:36444 fd=10 name=sentinel-21af3b90-cmd age=494042 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=14 qbuf-free=32754 obl=36 oll=0 omem=0 events=rw cmd=exec')

Note the "name=sentinel" portion, part of the explanation of where the SLAVEOF command came from.  This indicates you're running Sentinel to control your Redis master and slaves.  Sentinel's function is to detect a master that has become unavailable and change one of the slaves to be the new master (and when the old master is available, change it to be a slave of the new master).

So the action of master changing to a different server is what Sentinel is designed to do, and your logs indicate your Sentinel is doing it.  The reasons why Sentinel is changing the master should be found in the Sentinel logs.  That's your best lead to the cause of the problem, and from there, the solution.

hva...@gmail.com

unread,
Oct 25, 2016, 1:09:06 PM10/25/16
to Redis DB
My post was intended to be in reply to David Geller instead of the original start of the thread.

David Geller

unread,
Oct 25, 2016, 1:09:23 PM10/25/16
to Redis DB
Yes, thank your for your reply.  I understand that.  Sentinel is working beautifully in this case.  The failover was quick and nearly painless.  However, the problem I'm trying to address is why the slaves and the master lost connectivity in the first place.

hva...@gmail.com

unread,
Oct 25, 2016, 1:50:56 PM10/25/16
to Redis DB

From the IP addresses in the logfile, it looks like your Redis and Sentinel daemons are on separate machines, which is good.  They may even be in different subnets, which is fine.

The best info is still in the Sentinel logs.  Does it say it tried to connect to the master and the connection was refused?  Does it say it connected and sent a command to the master, but didn't get a reply back before the timeout?

There are usually just a few categories of problems that can make a monitoring system like Sentinel, running on a different server, believe the master has become unavailable:
  1. The server becomes busy or frozen so the software running on it can't communicate.  This could happen on Redis servers or Sentinel servers.
  2. The network between the Sentinel servers and the master becomes clogged so the Sentinel commands and Redis replies are delayed or blocked.

Your server performance graphing/monitoring will be able to confirm or deny that your Redis or Sentinel servers are experiencing slowness or brief freeze-ups.  You'll have to add some monitoring to find out if the network between the Sentinels and Redis servers gets clogged or has hiccups.

David Geller

unread,
Oct 25, 2016, 4:42:56 PM10/25/16
to Redis DB
Ok, you're right, the sentinels decided.  I had to interleave all of the logs to get the timeframe.

So, looking at the combined logs, I get an AOF is taking too long message at 04:39:46 on redis and 5 seconds later (down-after-milliseconds value), the sentinels all go into +sdown and +odown immediately, since they all agree.

However, I can confirm that there is nothing going on, on that box, at the time.  Not CPU, not IO, nothing.  So it appears that the aof fsync is causing the master to show unavailable for greater than 5s.  I can either set the down-after-ms time to 10s to (hopefully) avoid this or I can set fsync to no.  :-/ 

The setup is four servers.  Three run redis and sentinel and the fourth just runs sentinel because I need a sentinel for when I do upgrades (then I have three new servers and kill the three old ones).

Here's the top of the interleaved logs for posterity:


redis16    - 10.10.10.227 -- 16327:M 23 Oct 04:39:46.012 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:51.260 # +sdown master rails 10.10.10.227 6379
sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:51.342 # +sdown master rails 10.10.10.227 6379
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.764 # +sdown master rails 10.10.10.227 6379
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.823 # +odown master rails 10.10.10.227 6379 #quorum 2/2
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.823 # +new-epoch 3
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.823 # +try-failover master rails 10.10.10.227 6379
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.859 # +vote-for-leader 21af3b909123f02a883c6b42dcbe868b7b449c59 3
sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:51.904 # +new-epoch 3
sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:51.909 # +vote-for-leader 21af3b909123f02a883c6b42dcbe868b7b449c59 3
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.912 # 10.10.21.109:26379 voted for 21af3b909123f02a883c6b42dcbe868b7b449c59 3
sentinel16 - 10.10.10.227 -- 16405:X 23 Oct 04:39:51.959 # +new-epoch 3
sentinel16 - 10.10.10.227 -- 16405:X 23 Oct 04:39:51.964 # +vote-for-leader 21af3b909123f02a883c6b42dcbe868b7b449c59 3
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.967 # 10.10.10.227:26379 voted for 21af3b909123f02a883c6b42dcbe868b7b449c59 3
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.015 # +elected-leader master rails 10.10.10.227 6379
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.015 # +failover-state-select-slave master rails 10.10.10.227 6379
sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.077 # +new-epoch 3
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.092 # +selected-slave slave 10.10.21.109:6379 10.10.21.109 6379 @ rails 10.10.10.227 6379
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.092 * +failover-state-send-slaveof-noone slave 10.10.21.109:6379 10.10.21.109 6379 @ rails
redis15    - 10.10.21.109 -- 17755:M 23 Oct 04:39:52.174 # Connection with master lost.
redis15    - 10.10.21.109 -- 17755:M 23 Oct 04:39:52.174 * Caching the disconnected master state.
redis15    - 10.10.21.109 -- 17755:M 23 Oct 04:39:52.174 * Discarding previously cached master state.
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.175 * +failover-state-wait-promotion slave 10.10.21.109:6379 10.10.21.109 6379 @ rails
redis15    - 10.10.21.109 -- 17755:M 23 Oct 04:39:52.184 * MASTER MODE enabled (user request from 'id=12 addr=10.10.30.129:39611 fd=13 name=sentinel-21af3b90-cmd age=494042 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=rw cmd=exec')
redis15    - 10.10.21.109 -- 17755:M 23 Oct 04:39:52.195 # CONFIG REWRITE executed with success.
sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.407 # +odown master rails 10.10.10.227 6379 #quorum 3/2
sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.408 # +new-epoch 4
sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.408 # +try-failover master rails 10.10.10.227 6379
sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.410 # +vote-for-leader ba42536356be4fe8848b23bdaa8cb1053084aaa5 4
sentinel16 - 10.10.10.227 -- 16405:X 23 Oct 04:39:52.420 # +new-epoch 4
sentinel16 - 10.10.10.227 -- 16405:X 23 Oct 04:39:52.425 # +vote-for-leader ba42536356be4fe8848b23bdaa8cb1053084aaa5 4
sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:52.427 # +new-epoch 4
sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.427 # 10.10.10.227:26379 voted for ba42536356be4fe8848b23bdaa8cb1053084aaa5 4
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.434 # +new-epoch 4
sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.437 # 10.10.30.129:26379 voted for ba42536356be4fe8848b23bdaa8cb1053084aaa5 4
sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.438 # +vote-for-leader ba42536356be4fe8848b23bdaa8cb1053084aaa5 4
sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:52.448 # +vote-for-leader ba42536356be4fe8848b23bdaa8cb1053084aaa5 4
sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:52.448 # +odown master rails 10.10.10.227 6379 #quorum 2/2
sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:52.448 # Next failover delay: I will not start a failover before Sun Oct 23 04:40:53 2016

hva...@gmail.com

unread,
Oct 26, 2016, 2:13:43 PM10/26/16
to Redis DB


David Geller wrote:

So, looking at the combined logs, I get an AOF is taking too long message at 04:39:46 on redis and 5 seconds later (down-after-milliseconds value), the sentinels all go into +sdown and +odown immediately, since they all agree.

However, I can confirm that there is nothing going on, on that box, at the time.  Not CPU, not IO, nothing.  So it appears that the aof fsync is causing the master to show unavailable for greater than 5s.  I can either set the down-after-ms time to 10s to (hopefully) avoid this or I can set fsync to no.  :-/ 


On the master server you are indeed graphing the disk I/O statistics and those graphs show no writes during the AOF fsync?
  • If you're really not seeing disk writes during the AOF fsync, that suggests a very severe problem between the virtual machine and its disk storage.  Time to test the disk i/o.
  • If you are seeing disk writes in the graphs, then Redis flushing all its data to disk is probably hitting the maximum I/O rate that's allowed to that disk storage.  I.e., the disk performance is a bad enough bottleneck that Redis switches to the "impatient" form of flushing data to disk, and as the log message indicates, this has a domino effect of making Redis respond very slowly to Sentinel's checks, which in turn triggers failover.  Time to check whether you're using a virtual server and storage type that's appropriate for the AOF rerwites.  You may need to upgrade.  An alternative might be to move the AOF file to an additional slave machine that your app won't use and Sentinel won't promote to master.





David Geller

unread,
Oct 27, 2016, 5:49:56 AM10/27/16
to Redis DB
Yeah.  There's nothing going on wrt IO on the instances.  IDK.  When these AWS EBS volumes work, they work pretty well.  It just seems like, at times, they falter inexplicably.  I benchmarked the setup and squeezed the most I could out of them with RAID0.  To upgrade these, I have to go with much larger drives and/or much larger instances.  Yet, I've been able to get 300MB/s write throughput.  AWS docs are sparse when it comes to actual guaranteed latency.  I've tried every different storage option and these io1 ssd's are the best... and yet maybe not good enough.

Anyway, thanks for your time.  I don't really need fsync on the master anyway.  And, I'll take it up with Amazon.
Reply all
Reply to author
Forward
0 new messages