Redis replication not consistent

1,472 views
Skip to first unread message

Антон Рубцов

unread,
Aug 16, 2013, 6:12:16 AM8/16/13
to redi...@googlegroups.com
Hello.
Need your help.

We have 3 redis servers. 1 master and 2 slaves. We are trying to write to master and read from slaves. On high load slaves start being not consistent, but lag is 0.

Some times we get such problem:

We get some key from master. It's ok and returns value.
Wait 10 mins.
We get some key from slave. It's not ok and returns nil, although slave lag is zero.
We restart slave. It's resyncing and getting same key returns rights value. Slave becomes consistent.

Need your advice.

Thanks. 






Here is redis-cli info. Some info is hidden by &&& sign.

# Server
redis_version:2.7.102
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:2142a9a177d852c
redis_mode:standalone
os:Linux 3.2.28+ x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.5.4
process_id:6874
run_id:3e242c4d5127d98589058a1e8c6aa3e064f92107
tcp_port:6379
uptime_in_seconds:159344
uptime_in_days:1
hz:10
lru_clock:1349735
config_file:/etc/redis.conf

# Clients
connected_clients:2959
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:149025024
used_memory_human:142.12M
used_memory_rss:144429056
used_memory_peak:222115720
used_memory_peak_human:211.83M
used_memory_lua:33792
mem_fragmentation_ratio:0.97
mem_allocator:tcmalloc-1.8

# Persistence
loading:0
rdb_changes_since_last_save:2557
rdb_bgsave_in_progress:0
rdb_last_save_time:1376646146
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:1
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok

# Stats
total_connections_received:191002
total_commands_processed:33418660
instantaneous_ops_per_sec:420
rejected_connections:0
sync_full:4
sync_partial_ok:0
sync_partial_err:2
expired_keys:0
evicted_keys:0
keyspace_hits:16035323
keyspace_misses:677089
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:2541

# Replication
role:master
connected_slaves:2
slave0:ip=&&&&&&&&,port=&&&,state=online,offset=1193170878,lag=0
slave1:ip=&&&&&&&&,port=&&&,state=online,offset=1193168900,lag=0
master_repl_offset:1193173847
repl_backlog_active:1
repl_backlog_size:10485760
repl_backlog_first_byte_offset:1182688088
repl_backlog_histlen:10485760

# CPU
used_cpu_sys:1071.59
used_cpu_user:855.44
used_cpu_sys_children:50.28
used_cpu_user_children:443.55

# Keyspace
db0:keys=115318,expires=0
db1:keys=7019,expires=0

Josiah Carlson

unread,
Aug 16, 2013, 2:51:10 PM8/16/13
to redi...@googlegroups.com
Are there any particular types of keys that are missing? Any specific operations that you are performing that might not be replicated? Are you using scripting? Multi/exec?

 - Josiah


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/groups/opt_out.

Yiftach Shoolman

unread,
Aug 16, 2013, 3:12:17 PM8/16/13
to redi...@googlegroups.com
How do you know that lag is 0 ? your app might be writing at a higher rate than your replication links and therefore your new writes are accumulated in the client buffer.

To look at the replication buffer at the master side use CLIENT LIST and then look at the 'omem' of your slave clients. If it is greater than 0, that can explain your problem  
--

Yiftach Shoolman
+972-54-7634621

Greg Andrews

unread,
Aug 16, 2013, 4:24:51 PM8/16/13
to redi...@googlegroups.com
Your master's INFO output has these lines:

# Replication
role:master
connected_slaves:2
slave0:ip=&&&&&&&&,port=&&&,state=online,offset=1193170878,lag=0
slave1:ip=&&&&&&&&,port=&&&,state=online,offset=1193168900,lag=0
master_repl_offset:1193173847
repl_backlog_active:1
repl_backlog_size:10485760
repl_backlog_first_byte_offset:1182688088
repl_backlog_histlen:10485760

The info on the "slave1:" and "repl_backlog_*" lines suggests that slave1's replication has stopped for some reason.  Are there no errors or warnings in slave1's logfile?

  -Greg

Антон Рубцов

unread,
Aug 19, 2013, 6:13:29 AM8/19/13
to redi...@googlegroups.com
In CLIENT LIST omem is 0 everywhere.

Setting by HMSET command.

All 2 slaves have same delay from master.

пятница, 16 августа 2013 г., 23:12:17 UTC+4 пользователь Yiftach написал:

Антон Рубцов

unread,
Aug 19, 2013, 11:00:50 AM8/19/13
to redi...@googlegroups.com
I will try to reproduce this in test env. In production the problem it's very stable.

пятница, 16 августа 2013 г., 14:12:16 UTC+4 пользователь Антон Рубцов написал:

Антон Рубцов

unread,
Aug 20, 2013, 10:29:52 AM8/20/13
to redi...@googlegroups.com
No errors in log file. How did you figure out that slave1's replication has stopped for some reason.

суббота, 17 августа 2013 г., 0:24:51 UTC+4 пользователь GregA написал:

Greg Andrews

unread,
Aug 20, 2013, 10:57:01 AM8/20/13
to redi...@googlegroups.com
I mentioned them in my previous message, but I'll go over them again.

I quoted some of the lines from your message where you posted the output your master Redis gave to the INFO command.  Those lines were:


# Replication
role:master
connected_slaves:2
slave0:ip=&&&&&&&&,port=&&&,state=online,offset=1193170878,lag=0
slave1:ip=&&&&&&&&,port=&&&,state=online,offset=1193168900,lag=0
master_repl_offset:1193173847
repl_backlog_active:1
repl_backlog_size:10485760
repl_backlog_first_byte_offset:1182688088
repl_backlog_histlen:10485760


Then I pointed out that the contents of the repl_backlog_* lines:


repl_backlog_active:1
repl_backlog_size:10485760
repl_backlog_first_byte_offset:1182688088
repl_backlog_histlen:10485760

suggest that a backlog exists with one of the slaves.  There is 1 active backlog, and it has a size that's larger than 0.

Also the "slave1:" line suggests it is the slave with the backlog.  When you compare the "slave0:" and "slave1:" lines, you see the "offset=" field is different:

slave0:ip=&&&&&&&&,port=&&&,state=online,offset=1193170878,lag=0
slave1:ip=&&&&&&&&,port=&&&,state=online,offset=1193168900,lag=0

Yes, the "lag" value is the same between the two lines, but the "offset" value is not the same.  The "slave0:" line has an offset value that's close to the "master_repl_offset:" value on the line below.  The "slave1:" line has an offset value that's far from the values in both the "master_repl_offset:" line and the "slave0:" line.  I.e., slave1 was the one that was not keeping up with replication.

Your description of the problem was that replication stops rather than pauses and catches up again.  Those are the reasons I said the INFO output suggests slave1's replication had stopped.

  -Greg



--

Salvatore Sanfilippo

unread,
Aug 20, 2013, 11:02:03 AM8/20/13
to Redis DB
Hello, please as a first step try upgrading to 2.8 RC3 to check if
this still happens.

Cheers,
Salvatore
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/groups/opt_out.



--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

Beauty is more important in computing than anywhere else in technology
because software is so complicated. Beauty is the ultimate defence
against complexity.
— David Gelernter

Антон Рубцов

unread,
Aug 21, 2013, 5:16:09 AM8/21/13
to redi...@googlegroups.com
Hello,
we upgraded all nodes, but got the same problem.

вторник, 20 августа 2013 г., 19:02:03 UTC+4 пользователь Salvatore Sanfilippo написал:

Salvatore Sanfilippo

unread,
Aug 21, 2013, 5:49:43 AM8/21/13
to Redis DB
Hello again,

how long are your HMSET commands?
Also what is the INFO output of both master and slaves when the
desynchronization happened (when slave is not returning the right
data?)

Do you have time to chat in real-time in order to investigate the
issue? Maybe IRC? Send me a private email at antirez - gmail when you
are able to chat please.

Cheers,
Salvatore
>> -- David Gelernter
>
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/groups/opt_out.



--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

Beauty is more important in computing than anywhere else in technology
because software is so complicated. Beauty is the ultimate defence
against complexity.
-- David Gelernter

Salvatore Sanfilippo

unread,
Aug 21, 2013, 6:21:37 AM8/21/13
to Redis DB
Update:

I just tried to flood a Redis 2.8 RC3 server with:

require 'rubygems'
require 'redis'

r = Redis.new
while true
key = rand(1000000)
data = (1...(rand(10000)|1)).to_a
data.shuffle!
r.hmset(key,*data)
end

At the same time I had two slaves connected, and at the end of the
flooding I asked for DEBUG DIGEST obtaining exactly the same hash.
So no easy way to reproduce this apparently.

Cheers,
Salvatore

Антон Рубцов

unread,
Aug 21, 2013, 7:02:29 AM8/21/13
to redi...@googlegroups.com
I've got same hash on slaves but master is different

Keyspaces from servers. 
master
db0:keys=120270,expires=0,avg_ttl=0

slaves
db0:keys=120082,expires=0,avg_ttl=0

after restart hash of slave equals master's hash.

slaves are going synchronously and keyspaces change. So I think that replication is working. But with a lag.

среда, 21 августа 2013 г., 14:21:37 UTC+4 пользователь Salvatore Sanfilippo написал:

Salvatore Sanfilippo

unread,
Aug 21, 2013, 7:08:05 AM8/21/13
to Redis DB
The lag is perfectly fine, that is, Redis replication is asynchronous,
so there is a delay between the time you write data to a master, and
the time you can read the data from the slave. This delay usually is
in the sub millisecond figure. The "lag" that you read in the INFO
output is the amount of seconds elapsed since we received the offset,
so unless you read 15 seconds there, this is not the real lag. I mean,
a lag=1 does not mean that data has really a delay of one second, just
that we received the latest offset one second ago.

Here the problem is to understand if there is a real desync. That is,
if you stop the clients writing data to your master, is the key count
and content exactly the same?

Also, what are the commands are you using? You can easily obtain a
list wiht INFO commandstats.

Cheers,
Salvatore

Антон Рубцов

unread,
Aug 21, 2013, 7:48:30 AM8/21/13
to redi...@googlegroups.com
Thank you. I'll try to stop writes at night for 5 minutes and show info. 

среда, 21 августа 2013 г., 15:08:05 UTC+4 пользователь Salvatore Sanfilippo написал:
Reply all
Reply to author
Forward
0 new messages