On Thursday, April 23, 2015 at 10:43:02 PM UTC-7, Josiah Carlson wrote:
> You are probably going to need to take down your servers for a few minutes to upgrade Redis versions to 2.8.19 at least, and maybe even the most recent 3.0.0 (you can run in non-cluster mode). Maybe even getting a bigger machine, or look into whether you can shard your data and either run Redis cluster or Twemproxy + sharded Redis, or just sharded Redis controlled/managed by your clients. You have options and a way forward.
> Some questions:
> How much main memory does the system have?
For the master:
(10:32:00) cdodrill@AF002171:~
$ cat /proc/meminfo | head
MemTotal: 528351876 kB
MemFree: 260988440 kB
For the slave:
(10:34:30) cdodrill@AF002170:~
$ cat /proc/meminfo | head
MemTotal: 528351876 kB
MemFree: 485304736 kB
> Is it a VM, colo, self-hosted?
They are both servers in our colo cage.
> How fast is the network?
We have one gigabit ethernet between machines in our production cluster.
> What is your buffer set at?
For the master:
$ redis-cli -h AF002171 config get client-output-buffer-limit
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60"
For the slave:
$ redis-cli -h AF002170 config get client-output-buffer-limit
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 365072220160 365072220160 60 pubsub 33554432 8388608 60"
> How big is your most recent successful dump?
(15:35:32) cdodrill@AF002171:/var/lib/redis
$ ls -lh
total 50G
-rw-rw---- 1 redis redis 34G 2015-04-24 15:26 dump.rdb
-rw-rw---- 1 redis redis 195M 2015-03-27 15:10 temp-1349.rdb
-rw-r----- 1 redis redis 0 2014-07-02 22:22 temp-1404364952.24370.rdb
-rw-rw---- 1 redis redis 241M 2014-07-17 12:31 temp-14247.rdb
-rw-rw---- 1 redis redis 11G 2015-04-24 15:35 temp-22838.rdb
-rw-rw---- 1 redis redis 185M 2014-07-02 21:09 temp-31301.rdb
> Do you have free space?
(10:39:16) cdodrill@AF002171:~
$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda5 856G 52G 804G 7% /
(10:34:32) cdodrill@AF002170:~
$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda5 856G 37G 819G 5% /
A significant amount on both hosts
> What does the system log on the master say?
[25352] 22 Apr 11:55:09.059 # Client addr=
10.2.8.122:51808 fd=26 name= age=578 idle=578 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16384 oll=3176 omem=77967424 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
[12364] 22 Apr 12:13:01.285 * DB saved on disk
[12364] 22 Apr 12:13:06.637 * RDB: 15411 MB of memory used by copy-on-write
[25352] 22 Apr 12:13:16.028 * Background saving terminated with success
[25352] 22 Apr 12:13:17.569 * Slave asks for synchronization
[25352] 22 Apr 12:13:17.569 * Full resync requested by slave.
[25352] 22 Apr 12:13:17.569 * Starting BGSAVE for SYNC
[25352] 22 Apr 12:13:22.667 * Background saving started by pid 17275
[25352] 22 Apr 12:20:54.093 # Client addr=
10.2.8.122:52598 fd=38 name= age=457 idle=457 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16368 oll=3067 omem=74985592 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
On the slave:
[31824] 22 Apr 11:45:31.324 # I/O error reading bulk count from MASTER: Resource temporarily unavailable
[31824] 22 Apr 11:45:31.512 * Connecting to MASTER AF002171:6379
[31824] 22 Apr 11:45:31.513 * MASTER <-> SLAVE sync started
[31824] 22 Apr 11:45:31.513 * Non blocking connect for SYNC fired the event.
[31824] 22 Apr 11:45:31.513 * Master replied to PING, replication can continue...
[31824] 22 Apr 11:45:31.514 * Partial resynchronization not possible (no cached master)
[31824] 22 Apr 11:45:31.514 * Full resync from master: 9db318a6a93e7e1137800d67405ce2a0cbc610fa:1411857424636
[31824] 22 Apr 12:13:15.955 # I/O error reading bulk count from MASTER: Resource temporarily unavailable
[31824] 22 Apr 12:13:16.887 * Connecting to MASTER AF002171:6379
[31824] 22 Apr 12:13:16.888 * MASTER <-> SLAVE sync started
[31824] 22 Apr 12:13:16.889 * Non blocking connect for SYNC fired the event.
[31824] 22 Apr 12:13:17.567 * Master replied to PING, replication can continue...
[31824] 22 Apr 12:13:17.568 * Partial resynchronization not possible (no cached master)
[31824] 22 Apr 12:13:17.569 * Full resync from master: 9db318a6a93e7e1137800d67405ce2a0cbc610fa:1412015033050
> Can you show the INFO output on the master, a working, and non-working slave?
On the master:
(10:56:09) root@AF002171:~
# redis-cli
# Server
redis_version:2.8.7
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:aeb6a1447e4d3a87
redis_mode:standalone
os:Linux 3.2.0-26-generic x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.4.3
process_id:25352
run_id:9db318a6a93e7e1137800d67405ce2a0cbc610fa
tcp_port:6379
uptime_in_seconds:2393850
uptime_in_days:27
hz:10
lru_clock:383481
config_file:/etc/redis/redis.conf
# Clients
connected_clients:37
client_longest_output_list:1
client_biggest_input_buf:0
blocked_clients:0
# Memory
used_memory:205034900680
used_memory_human:190.95G
used_memory_rss:220557266944
used_memory_peak:214827623272
used_memory_peak_human:200.07G
used_memory_lua:33792
mem_fragmentation_ratio:1.08
mem_allocator:jemalloc-3.5.1
# Persistence
loading:0
rdb_changes_since_last_save:4835284
rdb_bgsave_in_progress:1
rdb_last_save_time:1429896593
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:1606
rdb_current_bgsave_time_sec:1579
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
# Stats
total_connections_received:80929824
total_commands_processed:6354434567
instantaneous_ops_per_sec:3926
rejected_connections:0
sync_full:529
sync_partial_ok:0
sync_partial_err:0
expired_keys:46522303
evicted_keys:0
keyspace_hits:1864201300
keyspace_misses:115381101
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:5400114
# Replication
role:master
connected_slaves:0
master_repl_offset:1426657393650
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1426656345075
repl_backlog_histlen:1048576
# CPU
used_cpu_sys:103776.85
used_cpu_user:149847.14
used_cpu_sys_children:96818.01
used_cpu_user_children:2168143.75
# Keyspace
db0:keys=120137760,expires=430857,avg_ttl=1944976
db12:keys=119245,expires=119143,avg_ttl=342950357
We do not have a working slave. This is info from the slave that keeps failing to start replicating:
(10:57:19) cdodrill@AF002170:~
$ redis-cli
# Server
redis_version:2.8.19
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:20fd63bba41f903c
redis_mode:standalone
os:Linux 3.2.0-26-generic x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.4.3
process_id:9147
run_id:9c57482cd8bc884edbda5827ee49f6068677f8f5
tcp_port:6379
uptime_in_seconds:76066
uptime_in_days:0
hz:10
lru_clock:3834882
config_file:/etc/redis/redis.conf
# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
# Memory
used_memory:1850256
used_memory_human:1.76M
used_memory_rss:3067904
used_memory_peak:1908000
used_memory_peak_human:1.82M
used_memory_lua:35840
mem_fragmentation_ratio:1.66
mem_allocator:jemalloc-3.6.0
# Persistence
loading:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1429822176
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
# Stats
total_connections_received:11695
total_commands_processed:81377
instantaneous_ops_per_sec:0
total_net_input_bytes:2806868
total_net_output_bytes:21339157
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
# Replication
role:slave
master_host:AF002171
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:1
slave_repl_offset:1
master_sync_left_bytes:-1
master_sync_last_io_seconds_ago:3
master_link_down_since_seconds:1429898242
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
# CPU
used_cpu_sys:51.41
used_cpu_user:16.81
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
> What is the network bandwidth?
> How much data is Redis handling/second?
> Is Redis swapping (can find this out from INFO output)?
It does not appear to be swapping on the main server
> Can you provide the first few lines of 'top' all the way to and including the line starting with 'Swap:'?
For the slave (AF002170):
top - 15:14:31 up 296 days, 22:30, 2 users, load average: 0.07, 0.08, 0.10
Tasks: 168 total, 1 running, 167 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 528351876k total, 43048076k used, 485303800k free, 996k buffers
Swap: 39085052k total, 0k used, 39085052k free, 36359084k cached
For the master (AF002171):
top - 15:15:23 up 281 days, 1:24, 2 users, load average: 1.22, 1.28, 1.31
Tasks: 171 total, 3 running, 168 sleeping, 0 stopped, 0 zombie
Cpu(s): 5.9%us, 1.1%sy, 0.0%ni, 92.8%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 528351876k total, 288765976k used, 239585900k free, 1316k buffers
Swap: 39085052k total, 0k used, 39085052k free, 55631636k cached
> Are you using AOF? RDB? Spinning disks? SSDs? RAID?
We are using RDB persistence. The machines each have a raid of 3 solid state drives.
> Do you have any utilization/CPU graphs of around the time that this happened? What does it look like?
> Any other weird stuff going on?
There shouldn’t be anything weird going on.