In this *exact* context, maybe. ;)
I see that you aren't using pipelining because it's not working right
for you. There are a few different libraries that wrap hiredis that
include pipelining, you may want to check how they do things.
> libev to many sockets to look at in the loop. So, I made a multi-
> threaded implementation, each thread having its own libev loop, and a
> number of hiredis clients unique to that thread/loop, still sharing
> (mutex-style) the queue of requests to process...
>
> But I don't see very much performance gain from multi-threading it. So
> my questions are:
> - What way is the best way to implement a high performance app using
> Redis as "back-end" (single-threaded, multi-threaded, event-based,
> libev or alternative, syncrounous/async)?
It depends on the context. Remember, Redis is a data structure server.
If your runtime has horrible data structures, horrible mechanisms to
interface with data structures, or if you need *shared* data
structures, Redis is not a bad answer. But if your runtime has access
to rich data structures, or doesn't need to be able to share the data
at all times (like during computation), using Redis may be a very bad
answer.
Every operation that you are performing is fundamentally limited by
the ability of your processor to construct buffers, shuffle those
buffers from process A, passed via messages to process B, where it is
then parsed and executed upon. Process B then responds (in the
reverse), at which point process A *may* perform some other operations
and start the interaction all over again.
The construction of the buffers, shuffling data between processes,
waiting for a response, and doing all of this via an event-based
socket library on top of threads is all ... slow. Redis has been
benchmarked to do some 100-200k operations/second on high end hardware
and good clients. Because you need 2-3 round trips per "operation",
you are going to be limited by latency, and even at peak performance,
probably limited at the upper end to roughly 50k queue items processed
per second.
In 10 minutes I hacked a Python script whose only purpose was to churn
through a random set of queue items similar in structure to what you
described in your other post. It processes those items in a
single-threaded fashion, incrementing/decrementing as necessary:
import random
import time
def churn(queue, X_map, Y_current):
for item, count in queue:
update = {}
for y in X_map[item]:
result = Y_current[y] - count
if result < 0:
break
update[y] = result
else:
Y_current.update(update)
def test():
queue = [('X%i'%(random.randrange(64),), random.randrange(1, 10))
for i in xrange(100000)]
X_map = dict(('X%i'%(i,), list(set('Y%i'%(random.randrange(1000),)
for i in xrange(random.randrange(3,11))))) for i in xrange(64))
currents = [dict(('Y%i'%(i,), random.randrange(5000)) for i in
xrange(1000)) for i in xrange(100)]
t = time.time()
for y_current in currents:
churn(queue, X_map, y_current)
return time.time()-t
Running test() in Python 2.6 on an Intel Core2 Duo at 3ghz, 10 trials
finished in 9.85-10.69 seconds each trial. That's about 1M queue items
processed every second, involving roughly 6M-7M decrement operations,
or roughly 30-70x the performance that can be attained using current
Redis 2.2, all using a language generally considered to be slow. Does
it have limitations? Yes. Primarily that you need to feed queue items
and system state in, and decide what to do with the data after the
fact. But since you didn't say how the data was going to be accessed
generally, it's hard for me to suggest a solution to give you really
what you want.
TL;DR If your goal really is throughput, and you are going to have
even a moderately large amount of queue items, use Redis as a cache
where you store the source and final result data, but not where you
actually process your data.
> - Is there a better client than hiredis (usable in ANSI C) I should
> use (even though I like the simplicity in hiredis) for this purpose,
> when performance is key?
At present, and as far as I know, hiredis is about the best available.
> - Is it possible to calculate (how?) the ultimate number of hiredis
> clients to use with a Redis server instance?
Keep adding clients until it stops getting faster. Generally though,
the more threads, the more thread context swaps, and the faster you
will hit the upper limit. You may gain some throughput by running
multiple processes (instead of threads) and assigning them to specific
processors, but then you have a distributed queue problem.
> - Is it possible to calculate (how?) the client/thread ratio to gain
> the best performance?
Only through experimentation and looking at the curve of performance
vs. threads/clients. Every set of operations will have a different
ratio, because each will have a different (average) number of
request/response cycles per queue item, and depending on the kinds of
requests, potentially different latencies per request (this is usually
only applicable in the case of set/zset intersections/unions, sorts,
long list manipulation, etc., as most simple operation costs in Redis
are hidden in network latency).
> - If I max out Redis-server (100% CPU utilization), is it pointless
> for me to chase performance, other than optimizing data structure/
> redis commands invoked?
It depends on what Redis is doing. Unless you are doing set/zset
operations on large sets/zsets, it is unlikely that you are actually
taxing Redis itself, instead you are likely taxing the socket calls
to/from Redis. Check the kernel times to find out whether it's Redis
or the sockets.
> - Does anyone know if "pipelining", lets say 5 HINCRBY, is possible
> using hiredis async context? Can't get pipelining to work with async
> hiredis at all...
I don't know, someone else will have to answer this part.
> Or, as I fear, can it only be bench'ed from server to server,
> measuring throughput and avg. response time in different setups??
> I've tried benching, but I get some strange results... Running single-
> threaded, testing with 10 to 60 hiredis clients in a libev loop, 27,
> 38 and 49 (e.g.) clients in a libev loop can give the best results,
> while anything above/below/between those numbers give lower
> throughput. And that is on a dedicated server only running redis-
> server and my app. Having a range (e.g. 26-28) giving the top results
> would make me more happy, having all numbers above/below giving lower
> throughput. At least then I could choose 27 as my client count...
If it is reproducible, then you've probably hit some sweet spot in
terms of cache sizes, size of messages transferred between processes,
etc.
> As a last note, putting Redis on a separate server, or having a Redis
> client communicating using something slower than a UNIX socket (e.g.
> TCP) is not an option. Network latency will degrade response times to
> much... I've tried it, and that's why I use UNIX socket on the same
> server...
As an alternative to using your chatty version as it stands, or the
Python solution I give above, you could consider checking out the
Redis scripting branch. Everything you do could likely be accomplished
with a relatively short blurb of Lua passed to Redis for each
operations provided from your queue. Even better, because it is
running in Redis itself, each operation would be atomic (at present
there are some non-destructive race conditions in your code that may
make some items return False that could have returned True if they
were processed in different orders).
I suspect that using scripting alone could get you a nice 10-100x
performance improvement, and you could replace your fancy threaded C
client with a simpler controller that is written in whatever the rest
of your system is written in.
Regards,
- Josiah
You are quite welcome, problems are fun to solve, or at least to help
yourself and others understand more deeply.
> My fancy threaded C client *is* the simple controller, it won't be
> replaced ;)
Whatever you are prepared to write and maintain :)
> It's a matter of many things, competence being one. I will look at the
> scripting branch, see if it might help. I'll be doing both UDP and TCP
I'll answer your questions below assuming you didn't. That said, the
scripting branch will more than likely max out Redis with regards to
throughput given your set of operations. Basically, you should be able
to get 100-200k queue items processed every second from a couple of
queue processing threads, and those threads won't need to be async
with hiredis, etc. to get that performance.
> communication, packet parsing, some logging and statistics, etc,
> etc... I've thought about using memcached, or just an in-process-
> memory structure, but I do need the possibility to persist (dump to
> disk), and I need the Master->Slave replication (I will also have fail-
> over, but that's another story). As you know what my data structure
> looks like, what data I need to store, and it being up to 10M
> "keys" (from my other post), would you say Redis is bad choice?
It's okay. But like I offered before, you could process some 1M queue
items/second if you cached whatever you have in Redis inside a Python
runtime (you could probably get 10-20x that with a custom C version).
With a small amount of work, you could even sync back out to Redis
every once in a while in bulk to update state that the rest of your
application wants to see with one round trip.
> You mentioned there might be libraries wrapping hiredis, providing
> pipelining, do you have any examples (usable in C)??
Not really usable in C, but both the standard Ruby and Python clients
support pipelining, and use hiredis.
> Your estimate of 50k operations is spot-on, at best I almost measured
> 60k "requests", with the structure/commands described in my other
> post. Intel i5 quad at 3GHz (desktop CPU), running 3 separate single-
> threaded processes (with libev/async hiredis).
Good to hear that my envelope-based calculator is as precise as ever ;)
> I'm not really sure I know what you mean with the following sentence:
> "If your goal really is throughput, and you are going to have even a
> moderately large amount of queue items, use Redis as a cache where you
> store the source and final result data, but not where you actually
> process your data."
> Well, I get info from Redis, thus "source", I compute in my app, and
> store changes in Redis, thus "final result"... Actually process? In
> Redis? Or am I missing something? Processing is done within my own
> app...
With your current setup, processing is done in Redis. You pass
commands to Redis to manipulate it's state, and depending on how it
replies, you maybe manipulate it more. Basically what I'm suggesting
is:
1. your application starts up
2. your application copies everything from Redis to local memory
3. your application processes items as fast as it can
4. your application optionally syncs 'current' or 'max' values if they
are updated by queue items every once in a while
5. your application syncs any final changes back to Redis and exits
Assuming that your application will be constantly running (will have
queue items to process always, or will have queue items to process
shortly, so will sleep for a short period if there are no more items
to process), it would fetch some work from some shared queue, process
them, then sync the changes back to Redis for replication, etc.
> Regarding the threading, well of course I don't want to cause to much
> context switching, but leaving cores doing nothing just seems like a
> waste. I've seen having 2 threads/core giving very good performance in
Unless you are doing expensive operations involving sorting, set/zset
intersection/union, etc., Redis is not going to be truely CPU bound,
it's going to be network bound. Look at all of the operations you are
doing and what they involve inside of Redis. Each time you increment a
number in a hash, you are doing 2 hash lookups and an increment,
followed by some buffer construction/parsing. If you didn't have to
pass those commands through the network stack from process to process,
and instead used a plain C++ std::hashmap, you could potentially
perform tens or hundreds of millions of those operations every second,
instead of the hundreds of thousands that Redis can do now.
Even going with the scripting branch so that every operation per queue
item is only a single round trip, you are limited to the number of
scripting commands you can stream between your client and Redis
(100-200k/second).
To look at it another way: don't believe the hype of going wide until
you've exhausted the performance ability of doing it the right way.
Like I showed before, with a few lines of Python, you can get 20x your
current performance. If you rewrote my hacked script in a faster
language, you could gain another 10x or better without breaking a
sweat.
> other apps... OT: How can Redis benefit from a multicore CPU? I've
> seen benchmarks on the net, and the better the CPU, the faster Redis
> performs. Redis of course benefits from the actual speed, but being
> single-threaded, how can the number of cores affect performance??
Run multiple copies of Redis, use the patch that tells the system to
bind the network processing for a particular process to the same
processor the processor is attached to (Jak Sprats talked about this
late last summer). This requires a specially patched linux kernel. Of
course, for applications that require it's data to all be in the same
Redis, there isn't currently a way. It feels like a waste to not use
the processors. I understand. If you don't want to waste it, mine some
bitcoins ;)
> Assuming the system is idle besides Redis... Is it at kernel-level,
> socket-level, epoll or something else that the benefit of multicore
> comes to play??
Not that I can think of right now, sorry.
> I know CPU isn't Everything, but not maxing out the CPU when I have a
> high performance server dedicated to the solution gives me kind of a
> "what more can I do"-feeling... And memory is no issue from a "storage
> size" POV, RAM is cheap, but if it's crucial to keep keys and values
> as small as possible in regard to performance, I'll look into the
> possibilty of changing my structure. Segmenting some stuff into more
> hashes, may make it possible to reduce the size of the keys/values
> (from 4 bytes to 2 bytes, e.g. instead of having one hash with 4 byte
> keys, I may let 2 byte identify the hash, and 2 byte be the key within
> that hash. Don't know if that would improve something)...
None of that will have any affect at all. Your performance problem
related to Redis isn't a data size problem (unless you are getting to
the point where your requests are passing the ethernet frame size and
you are communicating over the network, or unless you are using more
than a single page for messages over unix domain sockets, etc.); it's
a limitation of how many messages can pass from one process to another
through the kernel via the available socket APIs.
Remove the socket API from the equation, or minimize it's involvement,
and performance will go up.
Regards,
- Josiah
> I've seen many do-this, dont-do-that discussions regarding event-based
> vs thread-based solutions, though I haven't actual used event-based
> (not counting basic socket-selecting) socket programming. I usually go
> for a leader/follower MT solution. I've ASSUMED, however, that libev
> or equal should give the best performance in this case, it is what
> Redis is using (well AE, but still), and it is what's provided as
> example with hiredis.
>
> Regards
> //MJ
>
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>
This will work, but the latency for a cache miss until you get to your
standard working set (which may be your entire data set) will be
painful. Streaming the data all at once will be fast (1 million lists
would take about 10 seconds to stream to the client), at which point
you will never have another cache miss.
> 3. Well, should I even propose this ;) Linking to Redis itself,
> running it as part of my process, and inventing a third way of "Redis
> client access", in-proc (besides the TCP and UNIX socket ways), thus
> removing the socket from the equation. Maybe licencing becomes an
> issue, I don't know... Would/should work (guessing right now), but #1
> or #2 is probably preferred.
The scripting branch gives you this in reverse. Rather than linking
off to Redis, you instead run your code inside Redis. Actually, if
your queue was in Redis, you could have your code loop over a few
hundred or a thousand queue items per call, and you probably wouldn't
have any problem sustaining at least a few hundred thousand
updates/second; all synced live. You'd want to keep the time/call down
to a reasonable level, however, as you'd still want to be receiving
queue updates, and a running script blocks all other incoming calls.
- Josiah
> I'll do some more prototyping, and definitely a "10 minute
> version" (probably a 2 hour version) in C...
>
> //MJ
>