Redis replication issue for a very large data set (180 GB)

5,965 views
Skip to first unread message

Christine Dodrill

unread,
Apr 23, 2015, 3:53:26 PM4/23/15
to redi...@googlegroups.com

Hello,


We have a rather large (180 GB) data set in redis that is having trouble replicating. It is causing errors after about 30-40 minutes of working (depending on buffer sizes) and returns an error similar to:


[31824] 22 Apr 12:13:15.955 # I/O error reading bulk count from MASTER: Resource temporarily unavailable


This has been working fine until the last failover, when the slave started to fail syncing like this.


Does anyone have any ideas as to what could be causing this? We are on Redis 2.8.7.


Thanks,


--

Christine Dodrill

Site Reliability Engineer @ IMVU

Josiah Carlson

unread,
Apr 24, 2015, 1:43:02 AM4/24/15
to redi...@googlegroups.com
You are probably going to need to take down your servers for a few minutes to upgrade Redis versions to 2.8.19 at least, and maybe even the most recent 3.0.0 (you can run in non-cluster mode). Maybe even getting a bigger machine, or look into whether you can shard your data and either run Redis cluster or Twemproxy + sharded Redis, or just sharded Redis controlled/managed by your clients. You have options and a way forward.

Some questions:
How much main memory does the system have? Is it a VM, colo, self-hosted? How fast is the network? What is your buffer set at? How big is your most recent successful dump? Do you have free space? What does the system log on the master say? Can you show the INFO output on the master, a working, and non-working slave? What is the network bandwidth? How much data is Redis handling/second? What is the network utilization? What is the disk IO? Is Redis swapping (can find this out from INFO output)? Can you provide the first few lines of 'top' all the way to and including the line starting with 'Swap:'? Are you using AOF? RDB? Spinning disks? SSDs? RAID? Do you have any utilization/CPU graphs of around the time that this happened? What does it look like? Any other weird stuff going on?

 - Josiah


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Christine Dodrill

unread,
Apr 24, 2015, 6:37:24 PM4/24/15
to redi...@googlegroups.com
On Thursday, April 23, 2015 at 10:43:02 PM UTC-7, Josiah Carlson wrote:
> You are probably going to need to take down your servers for a few minutes to upgrade Redis versions to 2.8.19 at least, and maybe even the most recent 3.0.0 (you can run in non-cluster mode). Maybe even getting a bigger machine, or look into whether you can shard your data and either run Redis cluster or Twemproxy + sharded Redis, or just sharded Redis controlled/managed by your clients. You have options and a way forward.

> Some questions:
> How much main memory does the system have?

For the master:

(10:32:00) cdodrill@AF002171:~
$ cat /proc/meminfo | head    
MemTotal:       528351876 kB  
MemFree:        260988440 kB  

For the slave:

(10:34:30) cdodrill@AF002170:~
$ cat /proc/meminfo  | head   
MemTotal:       528351876 kB  
MemFree:        485304736 kB  
 
> Is it a VM, colo, self-hosted?

They are both servers in our colo cage.
 
> How fast is the network?

We have one gigabit ethernet between machines in our production cluster.
 
> What is your buffer set at?

For the master:
$ redis-cli -h AF002171 config get client-output-buffer-limit
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60"


For the slave:
$ redis-cli -h AF002170 config get client-output-buffer-limit
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 365072220160 365072220160 60 pubsub 33554432 8388608 60"

> How big is your most recent successful dump?

(15:35:32) cdodrill@AF002171:/var/lib/redis
$ ls -lh
total 50G
-rw-rw---- 1 redis redis  34G 2015-04-24 15:26 dump.rdb
-rw-rw---- 1 redis redis 195M 2015-03-27 15:10 temp-1349.rdb
-rw-r----- 1 redis redis    0 2014-07-02 22:22 temp-1404364952.24370.rdb
-rw-rw---- 1 redis redis 241M 2014-07-17 12:31 temp-14247.rdb
-rw-rw---- 1 redis redis  11G 2015-04-24 15:35 temp-22838.rdb
-rw-rw---- 1 redis redis 185M 2014-07-02 21:09 temp-31301.rdb

> Do you have free space?

(10:39:16) cdodrill@AF002171:~                          
$ df -h /                                               
Filesystem            Size  Used Avail Use% Mounted on  
/dev/sda5             856G   52G  804G   7% /           

(10:34:32) cdodrill@AF002170:~                         
$ df -h /                                              
Filesystem            Size  Used Avail Use% Mounted on 
/dev/sda5             856G   37G  819G   5% /          

A significant amount on both hosts

> What does the system log on the master say?

[25352] 22 Apr 11:55:09.059 # Client addr=10.2.8.122:51808 fd=26 name= age=578 idle=578 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16384 oll=3176 omem=77967424 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.             
[25352] 22 Apr 11:55:09.060 # Connection with slave 10.2.8.122:6379 lost.                                                                
[12364] 22 Apr 12:13:01.285 * DB saved on disk                                                                                           
[12364] 22 Apr 12:13:06.637 * RDB: 15411 MB of memory used by copy-on-write                                                              
[25352] 22 Apr 12:13:16.028 * Background saving terminated with success                                                                  
[25352] 22 Apr 12:13:17.569 * Slave asks for synchronization                                                                             
[25352] 22 Apr 12:13:17.569 * Full resync requested by slave.                                                                            
[25352] 22 Apr 12:13:17.569 * Starting BGSAVE for SYNC                                                                                   
[25352] 22 Apr 12:13:22.667 * Background saving started by pid 17275                                                                     
[25352] 22 Apr 12:20:54.093 # Client addr=10.2.8.122:52598 fd=38 name= age=457 idle=457 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16368 oll=3067 omem=74985592 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.             
[25352] 22 Apr 12:20:54.093 # Connection with slave 10.2.8.122:6379 lost.

On the slave:

[31824] 22 Apr 11:45:31.324 # I/O error reading bulk count from MASTER: Resource temporarily unavailable
[31824] 22 Apr 11:45:31.512 * Connecting to MASTER AF002171:6379
[31824] 22 Apr 11:45:31.513 * MASTER <-> SLAVE sync started
[31824] 22 Apr 11:45:31.513 * Non blocking connect for SYNC fired the event.
[31824] 22 Apr 11:45:31.513 * Master replied to PING, replication can continue...
[31824] 22 Apr 11:45:31.514 * Partial resynchronization not possible (no cached master)
[31824] 22 Apr 11:45:31.514 * Full resync from master: 9db318a6a93e7e1137800d67405ce2a0cbc610fa:1411857424636
[31824] 22 Apr 12:13:15.955 # I/O error reading bulk count from MASTER: Resource temporarily unavailable
[31824] 22 Apr 12:13:16.887 * Connecting to MASTER AF002171:6379
[31824] 22 Apr 12:13:16.888 * MASTER <-> SLAVE sync started
[31824] 22 Apr 12:13:16.889 * Non blocking connect for SYNC fired the event.
[31824] 22 Apr 12:13:17.567 * Master replied to PING, replication can continue...
[31824] 22 Apr 12:13:17.568 * Partial resynchronization not possible (no cached master)
[31824] 22 Apr 12:13:17.569 * Full resync from master: 9db318a6a93e7e1137800d67405ce2a0cbc610fa:1412015033050
 
> Can you show the INFO output on the master, a working, and non-working slave?

On the master:
(10:56:09) root@AF002171:~                       
# redis-cli                                      
127.0.0.1:6379> info                             
# Server                                         
redis_version:2.8.7                              
redis_git_sha1:00000000                          
redis_git_dirty:0                                
redis_build_id:aeb6a1447e4d3a87                  
redis_mode:standalone                            
os:Linux 3.2.0-26-generic x86_64                 
arch_bits:64                                     
multiplexing_api:epoll                           
gcc_version:4.4.3                                
process_id:25352                                 
run_id:9db318a6a93e7e1137800d67405ce2a0cbc610fa  
tcp_port:6379                                    
uptime_in_seconds:2393850                        
uptime_in_days:27                                
hz:10                                            
lru_clock:383481                                 
config_file:/etc/redis/redis.conf                
                                                 
# Clients                                        
connected_clients:37                             
client_longest_output_list:1                     
client_biggest_input_buf:0                       
blocked_clients:0                                
                                                 
# Memory                                         
used_memory:205034900680                         
used_memory_human:190.95G                        
used_memory_rss:220557266944                     
used_memory_peak:214827623272                    
used_memory_peak_human:200.07G                   
used_memory_lua:33792                            
mem_fragmentation_ratio:1.08                     
mem_allocator:jemalloc-3.5.1                     
                                                 
# Persistence                                    
loading:0                                        
rdb_changes_since_last_save:4835284              
rdb_bgsave_in_progress:1                         
rdb_last_save_time:1429896593                    
rdb_last_bgsave_status:ok                        
rdb_last_bgsave_time_sec:1606                    
rdb_current_bgsave_time_sec:1579                 
aof_enabled:0                                    
aof_rewrite_in_progress:0                        
aof_rewrite_scheduled:0                          
aof_last_rewrite_time_sec:-1                     
aof_current_rewrite_time_sec:-1                  
aof_last_bgrewrite_status:ok                     
aof_last_write_status:ok                         
                                                 
# Stats                                          
total_connections_received:80929824              
total_commands_processed:6354434567              
instantaneous_ops_per_sec:3926                   
rejected_connections:0                           
sync_full:529                                    
sync_partial_ok:0                                
sync_partial_err:0                               
expired_keys:46522303                            
evicted_keys:0                                   
keyspace_hits:1864201300                         
keyspace_misses:115381101                        
pubsub_channels:0                                
pubsub_patterns:0                                
latest_fork_usec:5400114                         

# Replication                                       
role:master                                         
connected_slaves:0                                  
master_repl_offset:1426657393650                    
repl_backlog_active:1                               
repl_backlog_size:1048576                           
repl_backlog_first_byte_offset:1426656345075        
repl_backlog_histlen:1048576                        
                                                    
# CPU                                               
used_cpu_sys:103776.85                              
used_cpu_user:149847.14                             
used_cpu_sys_children:96818.01                      
used_cpu_user_children:2168143.75                   
                                                    
# Keyspace                                          
db0:keys=120137760,expires=430857,avg_ttl=1944976   
db12:keys=119245,expires=119143,avg_ttl=342950357   


We do not have a working slave.  This is info from the slave that keeps failing to start replicating:

(10:57:19) cdodrill@AF002170:~                 
$ redis-cli                                    
127.0.0.1:6379> info                           
# Server                                       
redis_version:2.8.19                           
redis_git_sha1:00000000                        
redis_git_dirty:0                              
redis_build_id:20fd63bba41f903c                
redis_mode:standalone                          
os:Linux 3.2.0-26-generic x86_64               
arch_bits:64                                   
multiplexing_api:epoll                         
gcc_version:4.4.3                              
process_id:9147                                
run_id:9c57482cd8bc884edbda5827ee49f6068677f8f5
tcp_port:6379                                  
uptime_in_seconds:76066                        
uptime_in_days:0                               
hz:10                                          
lru_clock:3834882                              
config_file:/etc/redis/redis.conf              
                                               
# Clients                                      
connected_clients:1                            
client_longest_output_list:0                   
client_biggest_input_buf:0                     
blocked_clients:0                              
                                               
# Memory                                       
used_memory:1850256                            
used_memory_human:1.76M                        
used_memory_rss:3067904                        
used_memory_peak:1908000                       
used_memory_peak_human:1.82M                   
used_memory_lua:35840                          
mem_fragmentation_ratio:1.66                   
mem_allocator:jemalloc-3.6.0                   
                                               
# Persistence                                  
loading:0                                      
rdb_changes_since_last_save:0                  
rdb_bgsave_in_progress:0                       
rdb_last_save_time:1429822176                  
rdb_last_bgsave_status:ok                      
rdb_last_bgsave_time_sec:-1                    
rdb_current_bgsave_time_sec:-1                 
aof_enabled:0                                  
aof_rewrite_in_progress:0                      
aof_rewrite_scheduled:0                        
aof_last_rewrite_time_sec:-1                   
aof_current_rewrite_time_sec:-1                
aof_last_bgrewrite_status:ok                   
aof_last_write_status:ok                       
                                               
# Stats                                        
total_connections_received:11695               
total_commands_processed:81377                 
instantaneous_ops_per_sec:0                    
total_net_input_bytes:2806868                  
total_net_output_bytes:21339157                
instantaneous_input_kbps:0.00                  
instantaneous_output_kbps:0.00                 
rejected_connections:0                         
sync_full:0                                    
sync_partial_ok:0                              
sync_partial_err:0                             
expired_keys:0                                 
evicted_keys:0                                 
keyspace_hits:0                                
keyspace_misses:0                              
pubsub_channels:0                              
pubsub_patterns:0                              
latest_fork_usec:0                             

# Replication                                 
role:slave                                    
master_host:AF002171                          
master_port:6379                              
master_link_status:down                       
master_last_io_seconds_ago:-1                 
master_sync_in_progress:1                     
slave_repl_offset:1                           
master_sync_left_bytes:-1                     
master_sync_last_io_seconds_ago:3             
master_link_down_since_seconds:1429898242     
slave_priority:100                            
slave_read_only:1                             
connected_slaves:0                            
master_repl_offset:0                          
repl_backlog_active:0                         
repl_backlog_size:1048576                     
repl_backlog_first_byte_offset:0              
repl_backlog_histlen:0                        
                                              
# CPU                                         
used_cpu_sys:51.41                            
used_cpu_user:16.81                           
used_cpu_sys_children:0.00                    
used_cpu_user_children:0.00                   
 
> What is the network bandwidth?

During the replication attempts, this was the network bandwidth use: http://i.imgur.com/U5Kv5Pp.png 

> How much data is Redis handling/second?

  
> What is the disk IO?

 
> Is Redis swapping (can find this out from INFO output)?

It does not appear to be swapping on the main server

> Can you provide the first few lines of 'top' all the way to and including the line starting with 'Swap:'?

For the slave (AF002170):

top - 15:14:31 up 296 days, 22:30,  2 users,  load average: 0.07, 0.08, 0.10
Tasks: 168 total,   1 running, 167 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  528351876k total, 43048076k used, 485303800k free,      996k buffers
Swap: 39085052k total,        0k used, 39085052k free, 36359084k cached

For the master (AF002171):

top - 15:15:23 up 281 days,  1:24,  2 users,  load average: 1.22, 1.28, 1.31
Tasks: 171 total,   3 running, 168 sleeping,   0 stopped,   0 zombie
Cpu(s):  5.9%us,  1.1%sy,  0.0%ni, 92.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  528351876k total, 288765976k used, 239585900k free,     1316k buffers
Swap: 39085052k total,        0k used, 39085052k free, 55631636k cached

> Are you using AOF? RDB? Spinning disks? SSDs? RAID?

We are using RDB persistence.  The machines each have a raid of 3 solid state drives.

> Do you have any utilization/CPU graphs of around the time that this happened? What does it look like?

Yep: http://i.imgur.com/VhbDBdp.png left is the broken slave and the right is the master.

> Any other weird stuff going on?

There shouldn’t be anything weird going on.

Josiah Carlson

unread,
Apr 25, 2015, 12:04:19 AM4/25/15
to redi...@googlegroups.com
Looking at your master log:

[25352] 22 Apr 12:20:54.093 # Client addr=10.2.8.122:52598 fd=38 name= age=457 idle=457 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16368 oll=3067 omem=74985592 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits. 

... the slave is disconnected after 7 1/2 minutes and has a backlog of almost 75 megs, which is about 165k/second in writes. This trips up the 64 meg 60 second soft limit. According to your INFO output:

rdb_last_bgsave_time_sec:1606                    
rdb_current_bgsave_time_sec:1579

...It takes ~27 minutes to create a snapshot after a slave connects, which induces a 267 meg backlog total. This is the replication backlog just to create the snapshot. You add the 36 gig of the snapshot, and you are looking at 36 gigs in change of backlog in total. Now, a 1 gigabit connection can move 100 megs/second, so the snapshot + backlog gets passed over in ~6 minutes. Then the slave stops what it is doing, and loads the snapshot, which will take another 13-30 minutes, then catches up to the master 8-30 minutes later.

Your backlog config...

For the master:
$ redis-cli -h AF002171 config get client-output-buffer-limit
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60"

That's a 64 meg soft limit over 60 seconds, or a 256 meg hard limit. As mentioned earlier, that is hit 7 1/2 minutes in. On the master, set the slave limit to something like 64 gigs, just to have some headroom. You can run the following command directly:

CONFIG SET client-output-buffer-limit slave 96gb 64gb 60

And to save yourself some headache:

CONFIG SET repl-backlog-size 1gb

That will let your slave be disconnected due to network issues for up to about 90 minutes, and not require a full resync. Previously, you only got 6-7 seconds before a slave would require a full resync.

And also update the config file.

Do those things, and you should be set and have a working slave again.

Regards,
 - Josiah

Sabai

unread,
Jul 10, 2018, 1:35:35 PM7/10/18
to Redis DB
This helps to solve my issue. I have 100Gb of data i my master Redis and slave keeps trying to do full sync, after updating the below in master slave can sync with master sucessfully.

login to master through redis-cli -h <masternode> -p <port> .   **Below command its temp until host reboot, it is not permanent.**

"config get client-output-buffer-limit"  mostly it will be (  "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60" ) now

# CONFIG SET client-output-buffer-limit "slave 10737418240 10737418240 60"  <--- this is to update from 256 mb to 10gb as my physical memory is 512 gb

"config get repl-backlog-size"  < mostly it will be 1mb >

# CONFIG SET repl-backlog-size 1gb <-- increases to 1gb

Юрий Соколов

unread,
Jul 10, 2018, 2:39:40 PM7/10/18
to Redis DB
Had the same issue recently with much smaller size: ~7GB of data. And also fixed by increasing client-output-buffer-limit and repl-backlog-size.

Other in-memory database with disk persistence through snapshot+log, sovles the issue by feeding replica from disk: both snapshot and log is feeded into replica from disk until replica is close enough to current master's state.

Reply all
Reply to author
Forward
0 new messages