Thank you,
- Josiah
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
Here's the output of INFO from today. I've also created a page with
graphs covering the 20min BGSAVE outage. You'll find system metrics
(cpu, memory, processes) as well as redis counters:
Graphs from 2011-Dec-31:
https://s3.amazonaws.com/redis-bgsave-crash-graphs/index.html
redis INFO from today
--------------------
redis_version:2.4.1
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:31239
uptime_in_seconds:1143630
uptime_in_days:13
lru_clock:442739
used_cpu_sys:232277.86
used_cpu_user:98715.13
used_cpu_sys_children:16887.25
used_cpu_user_children:141665.44
connected_clients:186
connected_slaves:3
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:2625362720
used_memory_human:2.45G
used_memory_rss:2670764032
used_memory_peak:5861321848
used_memory_peak_human:5.46G
mem_fragmentation_ratio:1.02
mem_allocator:jemalloc-2.2.1
loading:0
aof_enabled:0
changes_since_last_save:17834
bgsave_in_progress:0
last_save_time:1325633123
bgrewriteaof_in_progress:0
total_connections_received:24560750
total_commands_processed:8111961900
expired_keys:15069
evicted_keys:98253446
keyspace_hits:3795854023
keyspace_misses:3684036115
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:1125426
vm_enabled:0
role:master
db0:keys=12593757,expires=12593757
Dane
Oops, my mistake. We're using Ubuntu 11.04, not 10.04 as I indicated.
What kernel are you using with 11.04?
Dane
In our redis.conf, maxmemory is set to 4000mb on an EC2 m1.large
instance with 7.5GB memory. Given this maxmemory setting, why would
peak memory usage, as reported by redis INFO, show 5.4G?
More importantly, I'm still somewhat mystified that something caused
redis to lose 22M keys when the BGSAVE stalled.
Dane
I have some ideas, but nothing that I've checked the code to verify.
Someone else may have an idea.
> More importantly, I'm still somewhat mystified that something caused
> redis to lose 22M keys when the BGSAVE stalled.
What are the TTLs on your keys, typically? If they are relatively
short, the following scenario is possible:
1. Redis forked for the BGSAVE
2. the child process started scanning through items, some of them had
expired, so the child Redis evicted them on it's way through
3. as the child process was churning through memory, expiring keys,
etc., the shared memory between the two processes dropped to 0
4. the parent process is now stalled while the child process fills the
remaining memory, inducing swap, etc.
5. at some point the parent process is able to continue, performs a
sampling of keys to evict, notices that more than a few should be
evicted, then scans the entire keyspace for keys to expire, expiring
the majority of the 22M keys.
Regards,
- Josiah
Dane
I see 4 temp rdb files in the redis dir -- all dating from about 1
week prior to the outage. I'm guessing this indicates that BGSAVE
failed several times in the past?
But in response to your question about whether BGSAVE has been
succeeding: is redis.log a reliable source for this info?
In the 15 minutes leading up to the outage, redis.log shows successful
BGSAVE events...
[31239] 31 Dec 20:46:33 * 10 changes in 300 seconds. Saving...
[31239] 31 Dec 20:46:34 * Background saving started by pid 29126
[29126] 31 Dec 20:47:49 * DB saved on disk
[31239] 31 Dec 20:47:49 * Background saving terminated with success
[31239] 31 Dec 20:52:50 * 10 changes in 300 seconds. Saving...
[31239] 31 Dec 20:52:51 * Background saving started by pid 29180
[29180] 31 Dec 20:54:08 * DB saved on disk
[31239] 31 Dec 20:54:09 * Background saving terminated with success
[31239] 31 Dec 20:59:10 * 10 changes in 300 seconds. Saving...
[31239] 31 Dec 20:59:11 * Background saving started by pid 29229
[29229] 31 Dec 21:00:27 * DB saved on disk
... 20 min outage + data loss here ...
[31239] 31 Dec 21:19:30 * Background saving terminated with success
[31239] 31 Dec 21:19:47 * Slave ask for synchronization
Does this offer any clues why the BGSAVE outage would have lost data?
Dane
Yes.
> But in response to your question about whether BGSAVE has been
> succeeding: is redis.log a reliable source for this info?
Yes.
> In the 15 minutes leading up to the outage, redis.log shows successful
> BGSAVE events...
> [31239] 31 Dec 20:46:33 * 10 changes in 300 seconds. Saving...
> [31239] 31 Dec 20:46:34 * Background saving started by pid 29126
> [29126] 31 Dec 20:47:49 * DB saved on disk
> [31239] 31 Dec 20:47:49 * Background saving terminated with success
> [31239] 31 Dec 20:52:50 * 10 changes in 300 seconds. Saving...
> [31239] 31 Dec 20:52:51 * Background saving started by pid 29180
> [29180] 31 Dec 20:54:08 * DB saved on disk
> [31239] 31 Dec 20:54:09 * Background saving terminated with success
> [31239] 31 Dec 20:59:10 * 10 changes in 300 seconds. Saving...
> [31239] 31 Dec 20:59:11 * Background saving started by pid 29229
> [29229] 31 Dec 21:00:27 * DB saved on disk
> ... 20 min outage + data loss here ...
> [31239] 31 Dec 21:19:30 * Background saving terminated with success
> [31239] 31 Dec 21:19:47 * Slave ask for synchronization
>
>
> Does this offer any clues why the BGSAVE outage would have lost data?
It looks like there was a problem removing the old file and replacing
it with the new one. If you're in EC2, and you are storing data
locally, I bet you are using EBS-backed storage, right? Sometimes EBS
can hang. Sometimes EBS can disappear. Sometimes EBS is... well...
crap. Most of the time it's not bad, sometimes it's even quite good.
But sometimes, it's horrible. You should check some of your other
system logs for that time period.
It seems possible to me that during the replacement of the old file,
your EBS hung. When Redis was finally able to continue, it started
sampling your volatile keys, and evicting the hell out of them. Hence
22M -> 3k. You never answered the question about your TTLs. What do
your TTLs look like?
Regards,
- Josiah
We're using local ephemeral storage, not EBS devices. There was
definitely a burst of disk+swap IO at that time. Unfortunately the
system logs don't show anything unusual.
> It seems possible to me that during the replacement of the old file,
> your EBS hung. When Redis was finally able to continue, it started
> sampling your volatile keys, and evicting the hell out of them. Hence
> 22M -> 3k. You never answered the question about your TTLs. What do
> your TTLs look like?
Our keys are long-lived -- TTLs are 32 days or greater. So I don't
think normal key expiration is to blame. But perhaps
maxmemory-policy=volatile-ttl kicked in and went crazy evicting keys
because it thought it was out of memory due to a slow BGSAVE child
process. A bug?
More importantly, how should we configure redis to survive slow AWS
disk IO? Set maxmemory-policy=noeviction?
Is there a recommended way to run redis on EC2?
Dane
Oops, my mistake. We're using Ubuntu 11.04, not 10.04 as I indicated.