New issue 425 by barthazi.andras: 100% CPU usage, very slow response/timeout
http://code.google.com/p/redis/issues/detail?id=425
I'm not sure I can give you a proper bug report, but let me create this
issue, and if you have no idea at all, just close it.
What steps will reproduce the problem?
1. we have a 2 queue SET based queues, and 7 SET based cache
2. the queue is quite small, typically have <10 items (with <10 bytes)
3. the cache is about 800 MB, with typically 100 byte -> 8 kbyte values
4. we do about <1000 queries per second according to the MONITOR command
5. sometimes the CPU usage of Redis is going up to 100%, and staying there
for 30-60s
6. While the CPU usage is around 100%, the response time is very slow,
sometimes over 30s, sometimes timeouting, the CLI benchmark tool says about
50 GET queries per sec, while typically it says 45.000
7. I have thought the issue is related to background saving, but not
8. We're using the default factory settings, no virtual memory or other
like this
redis_version:2.0.2
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:29610
uptime_in_seconds:6006
uptime_in_days:0
connected_clients:20
connected_slaves:0
blocked_clients:0
used_memory:904343768
used_memory_human:862.45M
changes_since_last_save:715
bgsave_in_progress:0
last_save_time:1294150490
bgrewriteaof_in_progress:0
total_connections_received:18036
total_commands_processed:3992865
expired_keys:0
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:master
db0:keys=15,expires=0
It's an Ubuntu Linux, on a 8 core processor machine. Redis' memory usage is
10%.
Hello, are you sure you don't run commands like KEYS from time to time? Or
other similarly slow commands like big set intersections or alike?
Btw the simplest way to figure what happens is the following. When it
blocks use the script you'll find at:
And post here the output. Run it a few times while Redis is stopped. This
script is able to tell us where Redis is spending so much time.
Thank you!
Salvatore
I confirm the issue.
Redis 2.0.4
Linux RHEL5 2.6.18 SMP, 16 cores
CPU usage ~ 100%
==> redis-benchmark:
====== PING ======
10002 requests completed in 30.17 seconds
50 parallel clients
3 bytes payload
keep alive: 1
......
331.57 requests per second (60343.38 requests per second - when we disable
new services that use Redis intensively)
==> poormansprofiler output:
3 options:
1) 1
at,call,processCommand,processInputBuffer,readQueryFromClient,aeProcessEvents,aeMain,main
2)
1 "results:all:qZv9GAOK"),keysCommand,call,processCommand,processInputBuffer,readQueryFromClient,aeProcessEvents,aeMain,main
3) 1
dictNext,keysCommand,call,processCommand,processInputBuffer,readQueryFromClient,aeProcessEvents,aeMain,main
also it produces the following warnings:
warning: .dynamic section for "/lib64/libm.so.6" is not at the expected
address
warning: difference appears to be caused by prelink, adjusting expectations
warning: .dynamic section for "/lib64/libpthread.so.0" is not at the
expected address
warning: difference appears to be caused by prelink, adjusting expectations
warning: .dynamic section for "/lib64/libc.so.6" is not at the expected
address (wrong library or version mismatch?)
warning: .dynamic section for "/lib64/ld-linux-x86-64.so.2" is not at the
expected address
warning: difference appears to be caused by prelink, adjusting expectations
warning: no loadable sections found in added symbol-file system-supplied
DSO at 0x7fff177fc000
When Redis works fine profile output is as follows:
1 __epoll_wait_nocancel,aeApiPoll,flags=3),aeMain,main
with the only warning
warning: no loadable sections found in added symbol-file system-supplied
DSO at 0x7fffa4ded000
As expected your output of the poor man profiler shows that you are using
KEYS command, that should never be used in production code by clients, but
just for debugging.
As you can see in the documentation KEYS is O(N) both in space and time,
and will trash Redis performances.
You can also check our documentation here:
You can read there: "Warning: consider KEYS as a command that should only
be used in production environments with extreme care. It may ruin
performance when it is executed against large databases. This command is
intended for debugging and special operations, such as changing your
keyspace layout. Don't use KEYS in your regular application code. If you're
looking for a way to find keys in a subset of your keyspace, consider using
sets.".
Therefore the issue is not valid.
p.s. however thank you for reporting. While in this case the problem is
clear in other cases the same macroscopic behavior may be due to something
actually wrong in the server.
Cheers,
Salvatore
Hi Salvatore,
Thanks for a prompt reply. We'll get rid of keys command and report here
about results.
Cheers
Eugene
I was the original reporter, we had not used KEYS or commands like that,
but for be sure, I have checked it when I've tried debugging the problem.
Anyway, we had only about 10 keys, so I guess KEYS would be quite fast in
this situation. :)
Unfortunately we have replaced the code was based on Redis, so I can't
profile it. I'll try to get some time for trying to repeat it, but I cannot
say now that it will be soon.
I feel bad to speak contrary to the OP's statements, but it really smells
like this is a bgsave operation overloading the (otherwise already fairly
heavily loaded) box and causing other commands to run slowly.
How many processors in the machine? How much memory does it have? What
operations are being performed? Anything else running on that box?
Barthazi, @Josiah: BGSAVE can't overload the server as it's a different
process. It can only matter that you run out of memory because of BGSAVE
and the instance starts swapping, this of course will lead to a much slower
server, but it's a different problem.
Anyway the only reliable way to debug this problems is to use a profiler to
check where the time is spent, together with the other system tools to
check if the system is out of memory, if there is OS buffers / pages I/O
and so forth. Currently *every single time* it was possible to understand
what was going on really, it was either an out of memory condition or the
user doing something strange like KEYS, big set operations like
intersections, huge MULTI/EXEC calls with hundeds of commands inside, and
so forth.
I added the poor man profiler script in the utils in the unstable branch,
file: utlis/whatisdoing.sh
This is so far the best we can do to debug this kind of issues.
Cheers,
Salvatore
I didn't see that it was an 8 core server, or that Redis was using 10% of
the system memory.
If it's using EBS at Amazon (or some other network storage wherever it is),
operations may be slow during a BGSAVE due to network contention...
Works fine without keys command.
I would like to also ask for help in finding out why our redis is using
100% cpu. Apologies if I'm resurrecting an old thread.
The output for the poor man's profiler with 100 samples is
warning: Selected architecture i386:x86-64 is not compatible with reported
target architecture i386
warning: Architecture rejected target-supplied description
warning: Selected architecture i386:x86-64 is not compatible with reported
target architecture i386
warning: Architecture rejected target-supplied description
warning: Selected architecture i386:x86-64 is not compatible with reported
target architecture i386
warning: Architecture rejected target-supplied description
warning: Selected architecture i386:x86-64 is not compatible with reported
target architecture i386
warning: Architecture rejected target-supplied description
warning: Selected architecture i386:x86-64 is not compatible with reported
target architecture i386
warning: Architecture rejected target-supplied description
warning: Selected architecture i386:x86-64 is not compatible with reported
target architecture i386
warning: Architecture rejected target-supplied description
101
22
__accept_nocancel,anetGenericAccept,anetTcpAccept,acceptTcpHandler,aeProcessEvents,aeMain,main
20
open64,_IO_file_fopen,??,redisLog,acceptTcpHandler,aeProcessEvents,aeMain,main
16 __write_nocancel,sendReplyToClient,aeProcessEvents,aeMain,main
12 epoll_wait,aeApiPoll,flags=<value,aeMain,main
8 ??,??
7 gettimeofday,gettimeofday,aeGetTime,aeProcessEvents,aeMain,main
3 __read_nocancel,read,optimized,aeProcessEvents,aeMain,main
3
gettimeofday,gettimeofday,aeGetTime,processTimeEvents,flags=<value,aeMain,main
2
vfprintf,__vsnprintf_chk,vsnprintf,"",,anetGenericAccept,anetTcpAccept,acceptTcpHandler,aeProcessEvents,aeMain,main
1
__vsnprintf_chk,vsnprintf,"",,anetGenericAccept,anetTcpAccept,acceptTcpHandler,aeProcessEvents,aeMain,main
1
__vsnprintf_chk,vsnprintf,"\001\200\255\373\nsg20\ae",,anetGenericAccept,anetTcpAccept,acceptTcpHandler,aeProcessEvents,aeMain,main
1 time,redisLog,acceptTcpHandler,aeProcessEvents,aeMain,main
1 ??,??,redisLog,acceptTcpHandler,aeProcessEvents,aeMain,main
1
epoll_ctl,aeApiAddEvent,fd=750,,_installWriteEvent,addReply,call,processCommand,processInputBuffer,readQueryFromClient,aeProcessEvents,aeMain,main
1
epoll_ctl,aeApiAddEvent,fd=117,,_installWriteEvent,addReply,lookupKeyReadOrReply,getGenericCommand,call,processCommand,processInputBuffer,readQueryFromClient,aeProcessEvents,aeMain,main
1 ??,??,??
We are using redis 2.2.11.
I'm seeing similar behavior on redis 2.4.7. Everything's going fine, and
then suddenly redis is using 95% (jumps up from 10%) of the CPU and
performance slows to a crawl. The only commands running against the DB are
GET, SET, EXPIRE, and DELETE:
https://github.com/martinrusev/django-redis-sessions/blob/master/redis_sessions/session.py
At the point of failure there are lots of open connections:
$ netstat -tn|grep 6379|wc -l
1184
even though I stopped sending traffic to the redis server. In addition,
the following command just hangs and doesn't return:
$ redis-cli info
Is there a way to log all commands to see when it dies?
This appears to be a duplicate of #648:
http://code.google.com/p/redis/issues/detail?id=648