Redis replication not consistent

Антон Рубцов

unread,

Aug 16, 2013, 6:12:16 AM8/16/13

to redi...@googlegroups.com

Hello.

Need your help.

We have 3 redis servers. 1 master and 2 slaves. We are trying to write to master and read from slaves. On high load slaves start being not consistent, but lag is 0.

Some times we get such problem:

We get some key from master. It's ok and returns value.

Wait 10 mins.

We get some key from slave. It's not ok and returns nil, although slave lag is zero.

We restart slave. It's resyncing and getting same key returns rights value. Slave becomes consistent.

Need your advice.

Thanks.

Here is redis-cli info. Some info is hidden by &&& sign.

# Server

redis_version:2.7.102

redis_git_sha1:00000000

redis_git_dirty:0

redis_build_id:2142a9a177d852c

redis_mode:standalone

os:Linux 3.2.28+ x86_64

arch_bits:64

multiplexing_api:epoll

gcc_version:4.5.4

process_id:6874

run_id:3e242c4d5127d98589058a1e8c6aa3e064f92107

tcp_port:6379

uptime_in_seconds:159344

uptime_in_days:1

hz:10

lru_clock:1349735

config_file:/etc/redis.conf

# Clients

connected_clients:2959

client_longest_output_list:0

client_biggest_input_buf:0

blocked_clients:0

# Memory

used_memory:149025024

used_memory_human:142.12M

used_memory_rss:144429056

used_memory_peak:222115720

used_memory_peak_human:211.83M

used_memory_lua:33792

mem_fragmentation_ratio:0.97

mem_allocator:tcmalloc-1.8

# Persistence

loading:0

rdb_changes_since_last_save:2557

rdb_bgsave_in_progress:0

rdb_last_save_time:1376646146

rdb_last_bgsave_status:ok

rdb_last_bgsave_time_sec:1

rdb_current_bgsave_time_sec:-1

aof_enabled:0

aof_rewrite_in_progress:0

aof_rewrite_scheduled:0

aof_last_rewrite_time_sec:-1

aof_current_rewrite_time_sec:-1

aof_last_bgrewrite_status:ok

# Stats

total_connections_received:191002

total_commands_processed:33418660

instantaneous_ops_per_sec:420

rejected_connections:0

sync_full:4

sync_partial_ok:0

sync_partial_err:2

expired_keys:0

evicted_keys:0

keyspace_hits:16035323

keyspace_misses:677089

pubsub_channels:0

pubsub_patterns:0

latest_fork_usec:2541

# Replication

role:master

connected_slaves:2

slave0:ip=&&&&&&&&,port=&&&,state=online,offset=1193170878,lag=0

slave1:ip=&&&&&&&&,port=&&&,state=online,offset=1193168900,lag=0

master_repl_offset:1193173847

repl_backlog_active:1

repl_backlog_size:10485760

repl_backlog_first_byte_offset:1182688088

repl_backlog_histlen:10485760

# CPU

used_cpu_sys:1071.59

used_cpu_user:855.44

used_cpu_sys_children:50.28

used_cpu_user_children:443.55

# Keyspace

db0:keys=115318,expires=0

db1:keys=7019,expires=0

Josiah Carlson

unread,

Aug 16, 2013, 2:51:10 PM8/16/13

to redi...@googlegroups.com

Are there any particular types of keys that are missing? Any specific operations that you are performing that might not be replicated? Are you using scripting? Multi/exec?

- Josiah

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/groups/opt_out.

Yiftach Shoolman

unread,

Aug 16, 2013, 3:12:17 PM8/16/13

to redi...@googlegroups.com

How do you know that lag is 0 ? your app might be writing at a higher rate than your replication links and therefore your new writes are accumulated in the client buffer.

To look at the replication buffer at the master side use CLIENT LIST and then look at the 'omem' of your slave clients. If it is greater than 0, that can explain your problem

--

Yiftach Shoolman
+972-54-7634621

Greg Andrews

unread,

Aug 16, 2013, 4:24:51 PM8/16/13

to redi...@googlegroups.com

Your master's INFO output has these lines:

# Replication
role:master
connected_slaves:2
slave0:ip=&&&&&&&&,port=&&&,state=online,offset=1193170878,lag=0
slave1:ip=&&&&&&&&,port=&&&,state=online,offset=1193168900,lag=0

master_repl_offset:1193173847
repl_backlog_active:1
repl_backlog_size:10485760
repl_backlog_first_byte_offset:1182688088
repl_backlog_histlen:10485760

The info on the "slave1:" and "repl_backlog_*" lines suggests that slave1's replication has stopped for some reason. Are there no errors or warnings in slave1's logfile?

-Greg

Антон Рубцов

unread,

Aug 19, 2013, 6:13:29 AM8/19/13

to redi...@googlegroups.com

In CLIENT LIST omem is 0 everywhere.

Setting by HMSET command.

All 2 slaves have same delay from master.

пятница, 16 августа 2013 г., 23:12:17 UTC+4 пользователь Yiftach написал:

Антон Рубцов

unread,

Aug 19, 2013, 11:00:50 AM8/19/13

to redi...@googlegroups.com

I will try to reproduce this in test env. In production the problem it's very stable.

пятница, 16 августа 2013 г., 14:12:16 UTC+4 пользователь Антон Рубцов написал:

Антон Рубцов

unread,

Aug 20, 2013, 10:29:52 AM8/20/13

to redi...@googlegroups.com

No errors in log file. How did you figure out that slave1's replication has stopped for some reason.

суббота, 17 августа 2013 г., 0:24:51 UTC+4 пользователь GregA написал:

Greg Andrews

unread,

Aug 20, 2013, 10:57:01 AM8/20/13

to redi...@googlegroups.com

I mentioned them in my previous message, but I'll go over them again.

I quoted some of the lines from your message where you posted the output your master Redis gave to the INFO command. Those lines were:

# Replication
role:master
connected_slaves:2
slave0:ip=&&&&&&&&,port=&&&,state=online,offset=1193170878,lag=0
slave1:ip=&&&&&&&&,port=&&&,state=online,offset=1193168900,lag=0
master_repl_offset:1193173847
repl_backlog_active:1
repl_backlog_size:10485760
repl_backlog_first_byte_offset:1182688088
repl_backlog_histlen:10485760

Then I pointed out that the contents of the repl_backlog_* lines:

repl_backlog_active:1
repl_backlog_size:10485760
repl_backlog_first_byte_offset:1182688088
repl_backlog_histlen:10485760

suggest that a backlog exists with one of the slaves. There is 1 active backlog, and it has a size that's larger than 0.

Also the "slave1:" line suggests it is the slave with the backlog. When you compare the "slave0:" and "slave1:" lines, you see the "offset=" field is different:

slave0:ip=&&&&&&&&,port=&&&,state=online,offset=1193170878,lag=0
slave1:ip=&&&&&&&&,port=&&&,state=online,offset=1193168900,lag=0

Yes, the "lag" value is the same between the two lines, but the "offset" value is not the same. The "slave0:" line has an offset value that's close to the "master_repl_offset:" value on the line below. The "slave1:" line has an offset value that's far from the values in both the "master_repl_offset:" line and the "slave0:" line. I.e., slave1 was the one that was not keeping up with replication.

Your description of the problem was that replication stops rather than pauses and catches up again. Those are the reasons I said the INFO output suggests slave1's replication had stopped.

-Greg

--

Salvatore Sanfilippo

unread,

Aug 20, 2013, 11:02:03 AM8/20/13

to Redis DB

Hello, please as a first step try upgrading to 2.8 RC3 to check if
this still happens.

Cheers,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/groups/opt_out.

--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

Beauty is more important in computing than anywhere else in technology
because software is so complicated. Beauty is the ultimate defence
against complexity.
— David Gelernter

Антон Рубцов

unread,

Aug 21, 2013, 5:16:09 AM8/21/13

to redi...@googlegroups.com

Hello,

we upgraded all nodes, but got the same problem.

вторник, 20 августа 2013 г., 19:02:03 UTC+4 пользователь Salvatore Sanfilippo написал:

Salvatore Sanfilippo

unread,

Aug 21, 2013, 5:49:43 AM8/21/13

to Redis DB

Hello again,

how long are your HMSET commands?
Also what is the INFO output of both master and slaves when the
desynchronization happened (when slave is not returning the right
data?)

Do you have time to chat in real-time in order to investigate the
issue? Maybe IRC? Send me a private email at antirez - gmail when you
are able to chat please.

Cheers,
Salvatore

>> -- David Gelernter

>
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/groups/opt_out.

--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

Beauty is more important in computing than anywhere else in technology
because software is so complicated. Beauty is the ultimate defence
against complexity.

-- David Gelernter

Salvatore Sanfilippo

unread,

Aug 21, 2013, 6:21:37 AM8/21/13

to Redis DB

Update:

I just tried to flood a Redis 2.8 RC3 server with:

require 'rubygems'
require 'redis'

r = Redis.new
while true
key = rand(1000000)
data = (1...(rand(10000)|1)).to_a
data.shuffle!
r.hmset(key,*data)
end

At the same time I had two slaves connected, and at the end of the
flooding I asked for DEBUG DIGEST obtaining exactly the same hash.
So no easy way to reproduce this apparently.

Cheers,
Salvatore

Антон Рубцов

unread,

Aug 21, 2013, 7:02:29 AM8/21/13

to redi...@googlegroups.com

I've got same hash on slaves but master is different

Keyspaces from servers.

master

db0:keys=120270,expires=0,avg_ttl=0

slaves

db0:keys=120082,expires=0,avg_ttl=0

after restart hash of slave equals master's hash.

slaves are going synchronously and keyspaces change. So I think that replication is working. But with a lag.

среда, 21 августа 2013 г., 14:21:37 UTC+4 пользователь Salvatore Sanfilippo написал:

Salvatore Sanfilippo

unread,

Aug 21, 2013, 7:08:05 AM8/21/13

to Redis DB

The lag is perfectly fine, that is, Redis replication is asynchronous,
so there is a delay between the time you write data to a master, and
the time you can read the data from the slave. This delay usually is
in the sub millisecond figure. The "lag" that you read in the INFO
output is the amount of seconds elapsed since we received the offset,
so unless you read 15 seconds there, this is not the real lag. I mean,
a lag=1 does not mean that data has really a delay of one second, just
that we received the latest offset one second ago.

Here the problem is to understand if there is a real desync. That is,
if you stop the clients writing data to your master, is the key count
and content exactly the same?

Also, what are the commands are you using? You can easily obtain a
list wiht INFO commandstats.

Cheers,
Salvatore

Антон Рубцов

unread,

Aug 21, 2013, 7:48:30 AM8/21/13

to redi...@googlegroups.com

Thank you. I'll try to stop writes at night for 5 minutes and show info.

среда, 21 августа 2013 г., 15:08:05 UTC+4 пользователь Salvatore Sanfilippo написал:

Reply all

Reply to author

Forward