better multi-core functionality via per-process shared memory queues

6,257 views
Skip to first unread message

Jak Sprats

unread,
May 18, 2010, 5:10:26 PM5/18/10
to Redis DB
It is possible to use redis on multi-core very efficiently if each
redis instance could also read requests in from a shared memory queue
(where it currently reads in requests from tcp packets).

The current multi-core, multiple node solution is:
"Simply start multiple instances of Redis in different ports in the
same box and threat them as different servers! "

If each instance of Redis has its own SharedMemoryQueue to read in
requests and its own seperate SharedMemoryQueue to write responses,
the overhead of local tcp traffic can be avoided.
This would still be a single threaded approach and would not need any
locks or mutexes.

This would improve Redis' SMP performance and would even open up the
door for SMP on MPP, meaning many cores on many nodes.

I think there is also a way to introduce ids w/in said
SharedMemoryQueues which would allow multi-threading when the client
writes to the queue (the actual write would have to be wrapped in a
mutex). This is a break from the architecture, and Cache Coherency
issues might mean that the each SharedMemoryQueue needs to be be
broken up into num_cores pieces (each of which resides on a different
4K page), but it may be not-the-worst place to introduce
multithreading to clients.

My first suggestion seems good, the second is maybe too much
complexity. Thoughts?

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

Jak Sprats

unread,
May 19, 2010, 11:50:02 AM5/19/10
to Redis DB

I will reply to my own post just so hopefully someone pays attention
to it and also because I am not a good tech writer and often struggle
stating what I mean.

My suggestions aim to push sharding from per client (which means logic
in 15 languages and on machines w/ no state) into server side logic.
For Instance: Apache has a single listener thread whose sole task is
to listen on a port and forward requests to multiple worker threads,
which perform the work of serving each request. In redis if the
listener process were to handle tcp communication w/ clients and then
hand off requests (via key hashing) to worker threads (which were
pinned to single cores and communicated w/ the listener via shared-
memory-queues) performance would be enhanced (because requests are
being served from 4 cores in parallel) ... unless redis' primary
bottleneck is tcp communication.

This way a machine w/ 4 cores and 16GB RAM could have 4 redis workers,
each pinned to a specific core, each using 3GB RAM, plus a single
listener process which would just serve clients' tcp requests,
listening on a single port and pass them off to workers via shared
memory queues, then poll said shared memory queues, forwarding the
workers' responses to clients via tcp.

Shared memory queues are not absolutely needed (possibly a premature
optimisation), this could all be done via localhost tcp traffic (which
would require minimal code changes), but I still feel this logic/
functionality should be a simple config option server-side. This
simplifies the SMP (Symmetric multi processing {e.g. one machine many
cores}) setup.

An immediate counter argument to my suggestion is a setup w/ 4
machines, each running 4 cores. Here the client has to hash to one of
16 redis processes which are spread across 4 different ip addresses.
This case (MPP) is optimally served by client-side hashing, but MPP is
probably not as popular/needed as SMP.

Thoughts?

surfman

unread,
May 19, 2010, 1:53:03 PM5/19/10
to Redis DB
Interesting thoughts!

I was thinking for a while on the same topic. What I may see is
probably a cloud setup will be a way to this. lot of hosting companies
offer cloud environment based on XEN. Usually each instance of account
will have a isolated resources for CPU and memory. A typical node
setup will be like this, a single core CPU power with 2G memory plus
200G hard disk storage. Users may grab nodes as many as they like,
each node could be a single instance of REDIS that may have pre-
defined key-values.

The above is my 2 cents for anyone who is intested in taking advantage
of multi-core for REDIS.
> For more options, visit this group athttp://groups.google.com/group/redis-db?hl=en.- 隐藏被引用文字 -
>
> - 显示引用的文字 -

Jak Sprats

unread,
May 19, 2010, 3:46:16 PM5/19/10
to Redis DB
I did a very simple test of my proposal modifying the benchmark.c
(attached) w/ some hacky #ifdefs and using taskset to assign
redis_server's to specific cores (look in test_all.sh).
The four binaries (./redis-server_*) were created by doing 4 makes w/
various #ifdef commented in/out (e.g. comment in #ifdef DUAL_CORE)

NOTE: I dont know how to post files here, so i have a server and will
post links until someone tells me how to post files here
http://www.allinram.info/redis/May_19_2010/ (links to benchmark.c,
output, and test_all.sh)

I am running Ubuntu 9.10 w/ Linux 2.6.31 on a 64bit Quad Core 3.0 GHz
AMD

The results were not what I expected. As the number or cores being
used goes up the thruput goes down.
Single core: SET 111K, GET 110K -> current redis setup, redis-server
runs on one core, redis-benchmark on another
Dual core: SET 99K, GET 98K
Three core: SET 96K, GET 95K
Four core is dumb, because redis-benchmark and one redis-server run on
a single core (result: SET 94K GET 94K).

Looking thru the file "output" which has top's output running the
whole time, the only explanation I can find for this is a slight
(2-4%) increase in cpu_soft-interrupts, but that does not seem like a
real explanation ...

Using Shared Memory Queues would help w/ the software interrupts
(which is most likely from tcp traffic being given to different
cores), but maybe the architecture of this software is such that it
does not really benefit from being run on multi-core .... but that
just seems wrong ....

Anyone there?
Message has been deleted

Jak Sprats

unread,
May 19, 2010, 4:07:36 PM5/19/10
to Redis DB
Seperate cloud instances are akin to seperate physical machines,
meaning there is no chance to communicate via SharedMemory between
cloud instances (Hypervisors have to disallow for security reasons).

With cloud instances the best way to shard is client-side hashing as
tcp is a must in any client-server communication.

But redis is custom made for cloud instances. I just did a whole bunch
of research on virtualisation: www.allinram.info (not a plug, dont
look at it, i dont care) and the way redis treats hardware is perfect
for the cloud as HardDiskDrive's virtualise poorly, but redis treats
them like a tape-drive, which is correct .... this is actually what
got me using redis.

I am proposing an architecture that exploits seperate redis-server's
running on a single physical machine. Each redis-server can be easily
assigned a cpu core via "taskset" and then the question can be posed:
where should the sharding logic be (client or server).

The arguments for client side sharding logic would be it is dead-
simple to program and implement on the client side. Client side
sharding can also be used to hash to a specific core on a specific
physical machine very easily. Cons of client-side sharding are
supporting it in 15+ languages and when redis becomes a cluster (where
nodes can go up and down) supporting client side hashing in 15+
languages will be a pain in the ass, as adding-nodes and dropping-
nodes adds considerable complexity to the codebase.

The only real argument for server side sharding is the SMP use case
can be optimised: One core for tcp/ip, 3 cores for seperate redis-
servers, communication done via SharedMemory. This is not uncommon in
super computer architectures and seeing as how 8 cores are just around
the corner and 32GB RAM machines are cheap .... SMP machines may get
big enough to fit most datasets and will be cheaper than running a 4
node MPP cluster.

Of course my benchmarks (which used localhost tcp communication, not
SharedMemory) showed that running redis-server's on multiple cores
DECREASED thruput ... will dig deeper
> > For more options, visit this group athttp://groups.google.com/group/redis-db?hl=en.-隐藏被引用文字 -
>
> > - 显示引用的文字 -
>
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.

Jak Sprats

unread,
May 31, 2010, 7:16:02 PM5/31/10
to Redis DB
So I went ahead and hacked the proposal I made (single redis-server-
listener that hands off to mutiple redis-server-worker's, everybody
running on dedicated cores, in hopes of utilising multi-core better):

Code is here (it is hardly worth reading, a terrible hack job, and I
tried like 6 different approaches w/o starting anew)
http://allinram.info/redis/May_31_2010/

It did NOT improve performance (got about 105K SET/GET w/ 3 cores at
100% usage). I had one listener redis-server process waiting on
connections which handed off via SharedMemory to 2 redis-server's
which serviced the requests and responded via SharedMemory.

Basically redis is network bottlenecked. So handing off the actual
lookup to another process wins basically nothing.

Observing redis-server under high load it is about 16% cpu.user, 33%
cpu.system and 50% cpu.software-interrupt ... so distributing the 16%
cpu.user to 2 different cores didn't help.

So I brushed off my ego and tried the following tests (on a single 4
core machine {AMD Phenom II X4 @ 3.0Ghz})
2 redis-server's on cores 0 and 1 (ports 6380, 6381)
2 redis-benchmark's on cores 2 and 3
and I got 2*110K SET/GET ... so 220K SET/GET

I got happy, I then tried the 2 machine setup, which did not scale
linearly, but was also not bad (every process is on a dedicated core)
Machine A: 2 redis-servers Machine B: 2 redis-benchmark's .....
thruput 2*110K -> 220K (SET/GET)
Machine A: 3 redis-servers Machine B: 3 redis-benchmark's .....
thruput 3*70K -> 210K (SET/GET)
Machine A: 4 redis-servers Machine B: 4 redis-benchmark's .....
thruput (70K, 70K, 70K, {20K,50K}) -> 230/260K (SET/GET)

The 2 machine 4 benchmark->server setup seemed to have problems
because maybe one core is needed for soft-irqs.

2 benchmark's against 2 servers seems to be the sweetspot.

Has anyone tried this on a 6-core machine or an 8-core machine (220K/s
is faster than anything I have seen posted).

On May 19, 1:07 pm, Jak Sprats <jakspr...@gmail.com> wrote:
> Seperate cloud instances are akin to seperate physical machines,
> meaning there is no chance to communicate via SharedMemory between
> cloud instances (Hypervisors have to disallow for security reasons).
>
> With cloud instances the best way to shard is client-side hashing as
> tcp is a must in any client-server communication.
>
> But redis is custom made for cloud instances. I just did a whole bunch
> of research on virtualisation:www.allinram.info(not a plug, dont

Josiah Carlson

unread,
Jun 1, 2010, 12:01:27 AM6/1/10
to redi...@googlegroups.com
If you're using linux and you have your socket implementation still
around, you may want to give unix domain sockets a shot. I remember
getting about 75% better performance than localhost TCP/IP
connections... about 5 years ago.

- Josiah

Jak Sprats

unread,
Jun 1, 2010, 12:46:47 AM6/1/10
to Redis DB

Unix domain sockets are faster than localhost TCP.
The thing I learned trying to hack redis into using SharedMemoryQueues
was: the initial connection to the machine must be done in TCP (as
this is a server) and after this initial TCP to server communication
is complete, the quickest thing to do is process the request in-
process ... it does not help to hand off after this point, because it
is basically only a hash table lookup.

So using unix domain sockets and doing 2 clients against 2 servers on
a single machine would be faster, but it wouldnt be a proper server
that can communicate w/ the outside world via TCP.

It took me like 4 days hacking to figure out that the current
implementation, which utilizes multi-cores by putting a single redis-
server on each core is a VERY good implementation (retains single
process w/ no threads architecture, which is lockless, uncomplicated,
and FAST) and it also seems to play well on multi-cores. As far as I
know, the problem that is always encountered when trying to make a
program scale on multiple cores is the propagation of software
interrupts accross cores. So redis-server does this pretty well for 2
cores, and ok for 3 cores, and a little bit unpredictably w/ 4 cores
(running on a 4 core machine).

NOTE: I did all of these tests on linux 2.6.31 which is current enough
to have had some changes that improve software interrupts over multi-
cores (2.6.18 is bad at this).

Another thing I learned doing these tests is the client (redis-
benchmark) can bottleneck. So if you run one client against 2 servers
(via consistent-hashing) you will not see 2*110K performance, you will
see 100K, so less than half the thruput as using two clients against
two servers ... most protocols bottleneck strongly on server-side.

So if someone has a 6 core or 8 core machine and could simply run my 2
machine tests using 8 clients and 8 servers (this will not saturate a
1Gb Ethernet connection - i think)
I would be interested in what kind of numbers they would post,
possibly in excess of 500K SET/GETs, maybe only 250K ???

Then redis can say "on 8 cores we can do 500K SET/GETs" and 8 cores
should be commodity CPUs w/in 18 months.
It would be nice to have some numbers on how well redis scales across
multiple cores.

*Another interesting thing would be to see how redis performs when run
on HyperThreaded CPUs, running a redis-server per logical CPU.
> >> of research on virtualisation:www.allinram.info(nota plug, dont
> >> > > For more options, visit this group...
>
> read more »

Josiah Carlson

unread,
Jun 1, 2010, 3:32:41 AM6/1/10
to redi...@googlegroups.com
What I meant was: if you needed to do more than just gets, like say a
set intersection, union, ..., you know, nontrivial operations (which I
end up needing to do 1000x more than plain gets... except for hgetall,
I do that a lot too), those operations can be handed off to worker
processes along with the original socket (passing file handles between
processes is quite nifty). For some operations, it may be a huge
win... but both would need access to the same dataset.

- Josiah

Jak Sprats

unread,
Jun 1, 2010, 4:39:35 AM6/1/10
to Redis DB
Ah OK, I get what you mean: Expensive queries could be passed to
workers.

I like the idea, hey I even tried to do it :)

I already know this will get shot down by the community, for the
following reasons:
1.) how do you share the memory (for the dataset)
A.) threads - not gonna happen, too many code changes
B.) shared memory - would be great in an ideal world,
but shared memory is a pain to do anything
big with
2.) it introduces complexity:
i.e. UnixDomainSockets or (PosixMessageQueues -my fav) are
another code path

The question of how to handle long-running operations is a tough one.
If your non trivial ops are all READs, you can replicate a bunch of
times, on a bunch of cores, or a bunch of nodes, load-balance your
READ reqs, and if you can live with not exactly realtime data, this
effectively does what you want.

What I am aiming for w/ this thread is a discussion on the vertical
scalability of redis, how it degrades w/ additional cores, and if
hyperthreading has an effect. Unfortunately I only have 2 Quad Core
machines at home, cant do 8 core tests myself.

On Jun 1, 12:32 am, Josiah Carlson <josiah.carl...@gmail.com> wrote:
> What I meant was: if you needed to do more than just gets, like say a
> set intersection, union, ..., you know, nontrivial operations (which I
> end up needing to do 1000x more than plain gets... except for hgetall,
> I do that a lot too), those operations can be handed off to worker
> processes along with the original socket (passing file handles between
> processes is quite nifty). For some operations, it may be a huge
> win... but both would need access to the same dataset.
>
> - Josiah
>
> >> >> of research on virtualisation:www.allinram.info(notaplug, dont
> ...
>
> read more >>

Tim Lossen

unread,
Jun 1, 2010, 4:45:57 AM6/1/10
to redi...@googlegroups.com
On 2010-06-01, at 10:39 , Jak Sprats wrote:
> What I am aiming for w/ this thread is a discussion on the vertical
> scalability of redis, how it degrades w/ additional cores, and if
> hyperthreading has an effect. Unfortunately I only have 2 Quad Core
> machines at home, cant do 8 core tests myself.

jak, how about using EC2? an extra-large high-cpu instance
with 8 (virtual) cores is only $0.68 per hour ...

--> http://aws.amazon.com/ec2/#instance

cheers
tim

--
http://tim.lossen.de

Jak Sprats

unread,
Jun 1, 2010, 5:45:16 AM6/1/10
to Redis DB
not a bad suggestion, just buy 2 and see what I get ... wait, just
checked the benchmarks, someone did this and got about 60K SET/GET. A
virtual core is 1.0GHz.

I am a big fan of running this stuff on steel tho (i.e. 2 cores got me
220K :). Putting a hypervisor into the equation complicates matters
and you never know what type of network lies inbetween 2 instances or
really who else is running on your physical machine.

Also I have worked w/ EC2 a lot and they upsell RAM considerably and
CPU somewhat and if you ever run production on them you will think
they are great until at some random time your machine slows to a near
crawl (which I believe personally to be caused mostly by HardDisk
Contention).

Despite redis being a perfect architecture to run on EC2 (mostly CPU<-
>RAM w/ snapshots to disk {works well w/ virtualised HDDs), I would
caution anyone running redis on EC2 on EC2s continuity (nice and fast
mostly, but sometimes dog slow for no real reason and you cant monitor
why, its at the hypervisor level). But EC2 is, to their credit, always
improving on this front.

Tim Lossen

unread,
Jun 1, 2010, 6:25:55 AM6/1/10
to redi...@googlegroups.com
On 2010-06-01, at 11:45 , Jak Sprats wrote:
> On Jun 1, 1:45 am, Tim Lossen <t...@lossen.de> wrote:
>> On 2010-06-01, at 10:39 , Jak Sprats wrote:
>>
>>> What I am aiming for w/ this thread is a discussion on the vertical
>>> scalability of redis, how it degrades w/ additional cores, and if
>>> hyperthreading has an effect. Unfortunately I only have 2 Quad Core
>>> machines at home, cant do 8 core tests myself.
>>
>> jak, how about using EC2? an extra-large high-cpu instance
>> with 8 (virtual) cores is only $0.68 per hour ...
>>
> not a bad suggestion, just buy 2 and see what I get ... wait, just
> checked the benchmarks, someone did this and got about 60K SET/GET. A
> virtual core is 1.0GHz.
>
> I am a big fan of running this stuff on steel tho (i.e. 2 cores got me
> 220K :). Putting a hypervisor into the equation complicates matters
> and you never know what type of network lies inbetween 2 instances or
> really who else is running on your physical machine.

jak, i totally agree, bare metal is of course much preferable for
benchmarking. for on-demand "steel" i would recommend

--> http://newservers.com

"The only server cloud that delivers dedicated servers instead of
virtual instances on shared servers."

a dual intel E5504 quadcore (2.00 GHz) box with 48 gigs DDR3 ECC
RAM is only $0.60 per hour -- cheaper than amazon, even.

the only problem though is that their signup process is a total
PITA -- you have to prepay $20, you have to fax (!) them a copy of
your credit card, and then it takes one or two days for your account
to be approved -- so i suggested EC2 as an easier alternative first.

send me a direct email if you want to go this route, i still have
about $10 left in my newservers eval account ...

Jak Sprats

unread,
Jun 1, 2010, 7:47:06 PM6/1/10
to Redis DB
Hi Tim,

thanks for the offer and for the tip on newservers.com, I can use them
for some tasks, great tip. On a related note VoxCloud http://tinyurl.com/2f3q235
can be used for SSD cloud instances, which are also nice for some use-
cases.

I already have 2 boxes w/ Phenom II X4's @ 3.0GHz, so none of the
newservers.com servers are that fast (4 core @2.0 GHz), and none are 6
or 8 core.

I am gonna ask some guys I know with big machines(8-12 cores) at their
work if they are willing to do this, if noone from the community
volunteers

- Jak

Jeremy Zawodny

unread,
Jun 1, 2010, 7:50:34 PM6/1/10
to redi...@googlegroups.com
I might be able to run some tests.  We have some dual proc, quad core (with HT) boxes.  In other words: 16 cores (as far as Linux sees).

I can't give you access to them but could build and benchmark if you can point me at some instructions.

Jeremy

> ...
>
> read more >>

Jak Sprats

unread,
Jun 1, 2010, 8:45:07 PM6/1/10
to Redis DB

OK, so you have 4 cores per box w/ HT, so logically 8 cores, and
physically 4 cores per box ... and 2 of these boxes connected via
1GbE, correct? How many GHz?

Worth a test, just to see if HT brings anything.

So I will write some very simple bash scripts that will run the tests
on 1,2,3,4,5,6,7,8 cores and then we can look at the results ... need
like 1-2 hours to write the scripts.

I am guessing on HT boxes that
physical core 0 has virtual cores 0,1
physical core 1 has virtual cores 2,3 ... etc....

This is important to test physical core against virtual core
performance.

back in 1-2 hours and thanks
> > > >> >> SharedMemory) showed that...
>
> read more »

Salvatore Sanfilippo

unread,
Jun 1, 2010, 8:58:25 PM6/1/10
to redi...@googlegroups.com
Hey, just to tell you guys that I'm following this thread and I think
it is pretty cool.

I firmly believed that the way to go is isolated processes to exploit
every core as it's best, but to see it tested in practice is a bold
confirmation that will drive us in the future. Thanks for your work.

Cheers,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"Once you have something that grows faster than education grows,
you’re always going to get a pop culture.", Alan Kay

Jak Sprats

unread,
Jun 1, 2010, 10:34:23 PM6/1/10
to Redis DB
Hi Jeremy, (just under 2 hours)

I wrote two scripts: (this is a simple way to do parallel processing)
1.) http://allinram.info/redis/parallel_tests/batch_parallel_client_test.sh
2.) http://allinram.info/redis/parallel_tests/launch_parallel.sh

USAGE:
0.) copy to both machines (CLIENT and SERVER) into your redis-dir
1.) on MACHINE CLIENT edit REMOTE_IP in launch_parallel.sh to be your
redis SERVER's IP
2.) on MACHINE SERVER (in your redis dir) run: "./launch_parallel.sh
SERVER 8 8"
-> this will launch 8 redis-server's each on its own core, each
on a seperate port (starting at port 6380)
3.) on MACHINE CLIENT (in your redis-dir) run: "./
batch_parallel_client_test.sh 8 8"
-> this will launch one client and test, then 2 clients and test,
etc.. up to 8 clients .. each on a dedicated core
-> results are in LOG/CLIENT/* dirs
4.) wait ... should take ? 10-30 minutes ?
5.) on MACHINE CLIENT, run: grep -r "requests per second" LOG/* |more
-> this will give you the results

This should give us some numbers on if HT helps (which I personally
think the answer will be: not much) but it needs to be tested
explicitly.

EXPLANATION:
The script launch_parallel.sh has all the brains, it either launches
clients (redis-benchmark) or servers (redis-server) and assigns them a
dedicated core and port.
When launching clients, if the number of servers (actually means
clients) is less than half of the number of cores, then clients will
be launched on even core numbers (which should be physical cores, not
virtual cores) - this allows us to correlate stats on HT.

NOTE: run "top" and type "f" then "j" then ENTER and there will be a
new column "P" that tells you what core a process is running on ...

- Jak
> >> > > >> >> got me...
>
> read more »

Jak Sprats

unread,
Jun 2, 2010, 12:38:48 AM6/2/10
to Redis DB
Hi Salvatore,

glad you're following this thread and brilliant architecture, it takes
loads of smarts/experience to not-implement, avoiding threads and
locks -> brilliant, boil down redis and you would get redis :)
The scripts are VERY simple. They cover the basics of testing redis on
multiple cores. If you have any comments, I am all ears, if you are
thinking of making such benchmarks then I would be happy to help out.
It would be cool to get people including multi-core implementations in
the benchmark section.

A very nice extension to these would be a ./redis-benchmark that
writes to multiple ports, so all 8 clients would be writing to all 8
servers (simulating consistent hashing), even put in a simple CRC32()
%node in it (which can be then ref'ed in the FAQ).

I have the feeling that being able to quantify scalability across
cores and then access why it isn't 100% linear and get loads of minds
thinking on it is good future thinking

And if you know someone w/ 2 16-core machines, ask them to try the
scripts :) Think 1G TPS on "commodity" hardware.

- Jak

On Jun 1, 5:58 pm, Salvatore Sanfilippo <anti...@gmail.com> wrote:
> >> > > >> >> got me...
>
> read more »

Tim Lossen

unread,
Jun 2, 2010, 3:32:37 AM6/2/10
to redi...@googlegroups.com
jak, i think there has been a misunderstanding. most servers
have more than one cpu socket. the newservers one has two
quadcore xeons, hence 8 physical cores. jeremy's machines seem
to have two sockets as well.

cheers
tim

On 2010-06-02, at 02:45 , Jak Sprats wrote:
> OK, so you have 4 cores per box w/ HT, so logically 8 cores, and
> physically 4 cores per box ... and 2 of these boxes connected via
> 1GbE, correct? How many GHz?

--
http://tim.lossen.de

Jak Sprats

unread,
Jun 2, 2010, 4:04:55 AM6/2/10
to Redis DB
Hi Tim,

ok, my bad and great :) (its funny as the west coast goes to bed, the
europeans get up)

I put up some simple tests in my reply to Jeremy.

if the only SET and GET are done (i.e. "#ifdef 0" other tests in
benchmark.c (lines 553-630)) these tests run in about 2 minutes on my
4 core @ 3.0GHz.
available here: http://allinram.info/redis/parallel_tests/benchmark.c

So the 8 core case @ 2.0GHz should take maybe 10-15 mins max w/ 2
machines.

Do you mind footing the $1 USD and doing this (plus effort) or should
I direct email you, do it myself, happy to do that.

Hopefully my instructions in my email to Jeremy are clear on how to do
it (but I am no pro documenter :)

- Jak

Tim Lossen

unread,
Jun 2, 2010, 4:23:11 AM6/2/10
to redi...@googlegroups.com
On 2010-06-02, at 10:04 , Jak Sprats wrote:
> Do you mind footing the $1 USD and doing this (plus effort) or should
> I direct email you, do it myself, happy to do that.

oh, just send me an email -- i'll boot up a server and send you
the root login back, ok?

cheers
tim

> On Jun 2, 12:32 am, Tim Lossen <t...@lossen.de> wrote:
>> jak, i think there has been a misunderstanding. most servers
>> have more than one cpu socket. the newservers one has two
>> quadcore xeons, hence 8 physical cores. jeremy's machines seem
>> to have two sockets as well.
>>
>> cheers
>> tim
>>
>> On 2010-06-02, at 02:45 , Jak Sprats wrote:
>>
>>> OK, so you have 4 cores per box w/ HT, so logically 8 cores, and
>>> physically 4 cores per box ... and 2 of these boxes connected via
>>> 1GbE, correct? How many GHz?
>>
>> --http://tim.lossen.de
>

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>

--
http://tim.lossen.de

Jak Sprats

unread,
Jun 2, 2010, 6:32:58 AM6/2/10
to Redis DB

Many Thanks to Tim

I will write up more results tommorow ... it is 3:30PST, too late

8 core @ 2GHz (2 CPUs w/ 4 cores each)

full results here: http://allinram.info/redis/parallel_tests/8CoreRedisTest.tar.gz
3 tests, the last simply avoiding core0 because it was running at 100%
for software interrupts softirqd

Brief summary of 7 core test (TEST THREE - avoiding core0)
1 cores -> 1 * 81 -> 81
2 cores -> 2 * 75 -> 150
3 cores -> 3 * 45 -> 135
4 cores -> 4 * 32 -> 128
5 cores -> 5 * 30 -> 150
6 cores -> 6 * 23 -> 138
7 cores -> 7 * 18.5 -> 130

so sweetspot is again at 2 cores ....

softirqd bottlenecked alot .... I had to avoid core0 altogether ....

There are alot of variables at work here, this is a 2 socket machine,
so possibly socket to socket communication bottlenecks, and im not
sure what version linux or what version softirqd, further SMP affinity
is a joke (which was "ff"), ALL software interrupts went to core0
(perhaps this is by design) but if it were just possible to get port
6382 to go to core2 .. that would make things scale a lot better ....

better write-up 2moro

thanks again to Tim

Jak Sprats

unread,
Jun 2, 2010, 8:29:28 AM6/2/10
to Redis DB
Tried to sleep and couldnt.

Jeremy does your 8 core machine have 1 or 2 Network cards.

I have this feeling that 4 cores per network card is the current
perfect ratio.

Also Physical Network Cards can be controlled thru IRQ affinity.

Jak Sprats

unread,
Jun 2, 2010, 9:31:38 AM6/2/10
to Redis DB

receive side packet steering (rps) will be in 2.6.35, which is 10
weeks off
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0a9627f2649a02bea165cfd529d7bcb625c2fcad

This hashes on the TCP 4-tuple to a core.

This should allow redis to map packets to dedicated redis-server
cores ...

time will tell, i think i am understanding this correctly...

Jak Sprats

unread,
Jun 2, 2010, 6:16:38 PM6/2/10
to Redis DB

So here is a short test on running redis from one 8 core machine to
another.

The machines were all 2 Socket Quad Core Machines @ 2.0GHz -> total 8
cores
(2 x Intel E5504 Quad Core 2.00 GHz)
http://newservers.com/ - Jumbo

So I ran 3 tests, found in LOG_CLIENT.tgz in the tarball
http://allinram.info/redis/parallel_tests/8CoreRedisTest.tar.gz

First heres the network traffic from the 3 tests
<img src="http://allinram.info/redis/parallel_tests/Screen%20shot
%202010-06-02%20at%2012.17%20.png">
(thanks to Tim, again this coordination worked really good, we even
synced on rebooting a node after I "iptables -F" killed a node :)
(and these tests took maybe an hour on 2 machines, so I owe Tim $1.20,
which cant even get you a Doener in Berlin :)

The highest peak is @ 175Kb, so its safe to say we are not saturating
1GbE lines (so network is cool)
NOTE: the 3rd hump looks the smoothest, this is most likely because
the 3rd test ran much better, so maybe less tcp rexmits or something

The first and second test I ran 1,2,3,4,5,6,7,8 clients against the
same amount of servers.

Both of these tests sucked on core0 because ALL tcp traffic (including
soft-interrupt distribution) was done on core0, and the redis-server
running on core0 was severely starved (OS has higher prio than
userland ./redis-server).

So I did a 3rd test, where I tested 1-7 clients and I avoided core0 on
both machines
HOWTO: on Machine CLIENT: I changed BASE_PORT to 6381 and changed
OFFSET to 1 in launch_parallel.sh, then run "./
batch_parallel_client_test.sh 7 8"

FORENOTE:
IRQ affinity means which cores a specific hardware interrupt should be
processed by.
For eth0 traffic the SMP affinity is usually "ff" which means ALL
cores.
The reality is it runs on one core, until that core is maxed out, and
then it runs on the next.
In practice this means either your first or last core handles ALL TCP
interrupts (and their distribution).
To figure out which core your eth0 interrrupts are being handled by
run the following command
# while true; do grep eth0 /proc/interrupts; sleep 1; done
Then download a big file and whichever number starts changing, this is
the core that is handling eth0 traffic
DO NOT PUT redis-server or redis-benchmark on this core, avoid this
core, leave it be.

Back to the 3rd Test, and keep in mind this is running on a 2 Socket
Machine, meaning IRQs between physical CPUs run on the frontside bus
(i.e. not on-die).
1 cores -> 1 * 81K -> 81K SET/GET on core 1
2 cores -> 2 * 75K -> 150K SET/GET on core 1,3
3 cores -> 3 * 45K -> 135K SET/GET on core 1,3,5
4 cores -> 4 * 32K -> 128K SET/GET on core 1,3,5,7
5 cores -> 5 * 30K -> 150K SET/GET on core 1,2,3,4,5
6 cores -> 6 * 23K -> 138K SET/GET on core 1-6
7 cores -> 7 * 18.5K -> 130K SET/GET on core 1-7

The dramatic drop from 2cores to 3cores is most likely as w/ 3cores,
core5 on CPU2 was used (this should be repeated using cores1,2,3 - but
this would just be a QuadCore Test)

There is an increase from 4cores to 5cores which is most likely
because 5cores runs cores1-3 on CPU1 and cores4-5 on CPU2 which are
good setups for both CPUs.

Possibly the best configuration for a 2 CPU machine would be to run
redis-server on cores1,2 and cores5,6

These test results are basically the exact same as for a QuadCore, the
2 core setup was the best.

Running:
grep ksoftirqd THREE/top_CLIENT
shows how ksoftirqd seems to be the bottleneck.

My best analysis is all eth0 interrupts are being sent to core0, which
then propagates software-interrupts to the other cores via ksoftirqd.
(all the makings of a bottleneck).

There are a few ways of dealing w/ this
1.) in linux 2.6.35 (out in 10 weeks) there are 2 technologies receive
side packet steering (rps) and receive flow steering (rfs) which will
most likely lessen the ksoftirqd bottleneck
2.) get a NIC w/ tx and rq queues and then do some magic to put a
specific queue to a specific core (I dont know how to do this, or if
it even works)
3.) buy another NIC and set its IRQ affinity to a certain core (this
effectively turns a single machine into a cluster as a core has its
own NIC, but it will work, and a PCIe x1 NIC costs $10)

So for now redis does not scale linearly across cores because linux
soft interrupt handling does not scale linearly across cores

It also raises the argument about how many cores should run redis, and
I think the answer is half. The data definitely points to this on
QuadCore and you need cores for the fork() backups and for the moment
for ksoftirqd.

So my recommendation until linux improves soft interrupt distribution
across cores is run 2 redis-server instances per quad core. If you
have two Sockets, have two NICs.

2 Socket CPU w/ 3.0 GHz Quad Cores and 2 NICs should be able to do
440K TPS.

If anyone has a true 8-core CPU or a (2 Socket, QuadCore, 2 NIC)
setup, please use my tests (which worked straight out of the box:)

Another note, is I should have clients running on 2 machines against 1
machine running a server to make sure the client is not bottlenecking.

ok comments and especially knowledge on linux distribution of software
interrupts and possible kernel patches are REALLY welcome.



On Jun 2, 6:31 am, Jak Sprats <jakspr...@gmail.com> wrote:
> receive side packet steering (rps) will be in 2.6.35, which is 10
> weeks offhttp://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=co...
Message has been deleted
Message has been deleted

Jak Sprats

unread,
Jun 3, 2010, 1:05:28 AM6/3/10
to Redis DB

So I went and installed 2.6.35 on Ubuntu, and it broke compiz, but
other than that no big problems

The results were ALL new maxes using AMD Phenom X4 QuadCore 3.0 GHz
client and server
1 core Max: 130K (RPS settings OFF)
2 core Max: 235K (RPS settings OFF)
3 core Max: 265K (RPS settings ON)
4 core test no longer done, cause one cpu always takes the brunt of
software-interrupts

So this is cool, I was at 110K Max a few days ago :)

The actual real win here is the 265K 3 core max, using "top" watching
the cpu.software_interrupts distribute over all 4 cores (still heavy
on one core, but much improved) made me think the 8-core game is back
on.

So, I doubt the 8-core will scale linearly (even 2 and 3 cores dont
scale linearly), but I would bet money on something like 350K TPS on
an 8-core 3.0GHz CPU.

Installing 2.6.35-rc1 took about 90 minutes (used this
http://www.cyberciti.biz/tips/compiling-linux-kernel-26.html, replaced
"25" w/ "35-rc1", used update-initramfs instead of initrd, and you
dont need to edit menu.lst)

-----------------------------------------------------------------
SETTINGS OFF
root@TWO-8GB:/sys/class/net/eth0/queues/rx-0# sysctl -a 2>/dev/null|
grep rps
net.core.rps_sock_flow_entries = 0
root@TWO-8GB:/sys/class/net/eth0/queues/rx-0# cat
/sys/class/net/eth0/queues/rx-0/*
00000000,00000000
0

RESULTS FOR SETTINGS OFF
NS1_NC4/0:127065.06 requests per second
NS1_NC4/0:134553.55 requests per second

NS2_NC4/0:119220.09 requests per second
NS2_NC4/0:122347.45 requests per second
NS2_NC4/2:116772.65 requests per second
NS2_NC4/2:120468.02 requests per second

NS3_NC4/0:75803.66 requests per second
NS3_NC4/0:83397.05 requests per second
NS3_NC4/1:79333.68 requests per second
NS3_NC4/1:88715.66 requests per second
NS3_NC4/2:69218.52 requests per second
NS3_NC4/2:87329.23 requests per second



SETTINGS ON
root@TWO-8GB:/sys/class/net/eth0/queues/rx-0# sysctl -a 2>/dev/null|
grep rps
net.core.rps_sock_flow_entries = 134217728
root@TWO-8GB:/sys/class/net/eth0/queues/rx-0# cat
/sys/class/net/eth0/queues/rx-0/*
00000000,00000000
134217728

RESULTS for SETTINGS ON
NS1_NC4/0:101102.82 requests per second
NS1_NC4/0:104647.55 requests per second

NS2_NC4/0:92319.15 requests per second
NS2_NC4/0:96265.12 requests per second
NS2_NC4/2:93423.02 requests per second
NS2_NC4/2:98232.02 requests per second

NS3_NC4/0:87161.16 requests per second
NS3_NC4/0:90342.48 requests per second
NS3_NC4/1:86843.59 requests per second
NS3_NC4/1:92039.02 requests per second
NS3_NC4/2:88276.84 requests per second
NS3_NC4/2:90859.72 requests per second

NOTE: someone w/ more knowledge on RPS may be able to make sense of /
sys/class/net/eth0/queues/rx-0/rps_cpus, i put in about 8 different
values and nothing changed

On Jun 2, 3:16 pm, Jak Sprats <jakspr...@gmail.com> wrote:
> So here is a short test on running redis from one 8 core machine to
> another.
>
> The machines were all 2 Socket Quad Core Machines @ 2.0GHz -> total 8
> cores
> (2 x Intel E5504 Quad Core 2.00 GHz)http://newservers.com/- Jumbo
>
> So I ran 3 tests, found in LOG_CLIENT.tgz in the tarballhttp://allinram.info/redis/parallel_tests/8CoreRedisTest.tar.gz

Jeremy Zawodny

unread,
Jun 3, 2010, 1:13:48 AM6/3/10
to redi...@googlegroups.com
Correct.  These are dual socket, quad core, with Hyperthreading.

Looking at running the scripts now...

Jeremy

Jak Sprats

unread,
Jun 3, 2010, 3:03:11 AM6/3/10
to Redis DB
so you have 2 cpus, each w/ 4 cores and HyperThreading .... that is 16
logical cores? you dont have 2 NICs do you?

make sure you run this command
#while true; do grep eth0 /proc/interrupts; sleep 1; done
and download some file and then avoid the core that is handling the
eth0 interrupts.

Tell me if that makes sense? its been a long day
> > redis-db+u...@googlegroups.com<redis-db%2Bunsubscribe@googlegroups.c om>
> > .

Tim Lossen

unread,
Jun 3, 2010, 4:23:52 AM6/3/10
to redi...@googlegroups.com
On 2010-06-03, at 00:16 , Jak Sprats wrote:
> (thanks to Tim, again this coordination worked really good, we even
> synced on rebooting a node after I "iptables -F" killed a node :)
> (and these tests took maybe an hour on 2 machines, so I owe Tim $1.20,
> which cant even get you a Doener in Berlin :)

you're welcome, jak. maybe you can buy me a beer sometime ...
we can also do another testing run if you have any new ideas.

> The highest peak is @ 175Kb, so its safe to say we are not saturating
> 1GbE lines (so network is cool)

hmmmm .... actually it is only 22KB (= 175k*bit*) per second -- which
makes sense, as you cannot send more than about 100KB/s across a
gigabit link, i think. but we are definitely not saturating the link.

> ok comments and especially knowledge on linux distribution of software
> interrupts and possible kernel patches are REALLY welcome.

+1

i will be going into production with these exact newservers boxes in a
few weeks -- so any tuning hints would be much appreciated.

Jak Sprats

unread,
Jun 3, 2010, 7:46:50 AM6/3/10
to Redis DB

yeah we can test again, the only real test is if running redis-server
on cores (1,2),(4,5) is significantly faster than running on just
(1,2)

if the 4 core variety is significantly faster go w/ it, otherwise go
with the 2 core solution

core 0 is going to be doing all the eth0 interrupt handling, so i
would try to get any process that uses significant cpu to AVOID this
core
(anyone except ksoftirqd, just let linux schedule that).

If you are running another server on this system that will use up
significant cpu cycles, use "taskset" to make sure it runs ONLY on
cores where redis-server is NOT running AND also not on core0.

This is true process isolation.

Other than that, if you could update your kernel to 2.6.35-rc1 it may
be interesting (but this is crazy on a production system).

Make sure you decide finally if you are gonna put redis-server on 2 OR
4 cores, because once you put CRC32()%2 or CRC32()%4 in your client
code, you are stuck with num_nodes because each server will have 25%
of the data and until redis-cluster comes around, you cant add or
remove redis-server's.

At the very minimum use 2 redis-server's, this is a huge performance
increase and a one line client change.

Joubin Houshyar

unread,
Jun 3, 2010, 10:27:21 AM6/3/10
to Redis DB
Hi Salvatore,

Any possibility that you would modularize Redis and allow swapping out
the interface layer (now TCP)? (Being able to embed Redis would open
up a lot of possibilities. :)

/R
> >> > > >> >> got me...
>
> read more »

Jak Sprats

unread,
Jun 3, 2010, 10:13:20 PM6/3/10
to Redis DB
Hi Tim,

on newservers.com, for testing, can I upgrade the kernel to 2.6.35-
rc1, does newservers.com give that sort of access?

Upgrading the kernel involves
1.) downloading kernel
2.) making kernel
3.) updating-grub - may be not allowed by provider
4.) rebooting

Step 2 takes 60 minutes.

Possible?

- Jak

Tim Lossen

unread,
Jun 3, 2010, 11:57:43 PM6/3/10
to redi...@googlegroups.com
hmmmmm ...... i have no idea. i'll ask them and get back to you.

tim

On 2010-06-04, at 4:13 AM, Jak Sprats wrote:
> on newservers.com, for testing, can I upgrade the kernel to 2.6.35-
> rc1, does newservers.com give that sort of access?
>
> Upgrading the kernel involves
> 1.) downloading kernel
> 2.) making kernel
> 3.) updating-grub - may be not allowed by provider
> 4.) rebooting

--
http://tim.lossen.de

Jak Sprats

unread,
Jun 4, 2010, 12:31:56 AM6/4/10
to Redis DB
reason i ask is, this would test if recieve side packet steering(RPS)
scales well over 5-8 cores.
in my 3 core test, RPS was about a 10-15% boost over w/o RPS
in general 2.6.35 was 10% faster than 2.6.31

and I think those servers are running 2.6.18, there have been LOADS of
network updates since then.

Pieter Noordhuis

unread,
Jun 4, 2010, 3:24:38 AM6/4/10
to redi...@googlegroups.com
Nice thread to read! Keep up the good work!

If I remember correctly from the parallel computing courses I've
followed, is that you always need to consider the memory lanes going
in/out of the different sockets. I thought of this when you mentioned
that (1,2) + (4,5) was significantly faster, because I would think
that on an 8-core machine, (1,2) are on a different socket than (4,5).
Redis is very memory heavy, with lots of really random reads from RAM
(so you don't have good cache locality). From your results, I would
say that stressing Redis on a single core per socket, might saturate
the bandwidth on the memory lanes. This could be a reason why you
don't see near-linear scaling in terms of the number of cores (apart
from all the soft-IRQ business of course). I'm far from an expert in
this area, but maybe this makes a little sense?

Cheers,
Pieter

Jak Sprats

unread,
Jun 4, 2010, 5:30:48 AM6/4/10
to Redis DB
Hi Tim,

I asked newservers.com about an upgrade to 2.6.35 and here was their
response:
"Yes, we can. The kernel 2.6.35 has been released in 1st of June.
We'll upgrade it and we'll add those servers into your account. Please
let us know when can do it. Thanks."

Where did you find such a kick-ass service? I just told a bunch of
people I know who need to test 16-node clusters and this is an example
of the cloud really saving money, thanks from them.

If you have time to do some more tests tell me, we can do them.

I just emailed newserver.com again, asked them if the have any Jumbos
w/ 2 NICS.

- Jak

On Jun 3, 8:57 pm, Tim Lossen <t...@lossen.de> wrote:

Jak Sprats

unread,
Jun 4, 2010, 6:24:05 AM6/4/10
to Redis DB
Hi Pieter,

youre talking about Cache Coherency and yeah this is an issue.

It needs to be said, that if you are using a chip w/ 8MB L2 Cache,
then any good test must far exceed 8MB of data in and out and use a
lot of random lookups/sets to simulate cache misses, which are the
norm in a production environment (my parallel tests dont do this).

./redis-benchmark's "-r" setting is good for this, also using "-d" and
writing big objects will trigger lots of L1/L2 Cache misses, where you
hit the memory wall (ca. 10X slower), and redis' performance drops. I
measured this using a single-process and was amazed at seeing a
maximum 50% performance loss even when using 4K sized objects. Using
multiple processes per CPU would no doubt exacerbate this problem, by
how much I dont know. My understanding of the rather small 50% drop in
performance in light of the 1000X larger data set was: redis-server is
mostly network wall bottlenecked, so the network wall dominates the
memory wall in terms of bottlenecking. (Network wall is the wall from
the NIC thru RAM to the CPU)

What can be done about Cache Coherency issues in redis-server? If the
data is in RAM and is much bigger than L2 Cache size, there will be
cache misses. Redis already has low memory overhead per row. There are
Cache-conscious data-structures, and I have no idea how Redis is in
terms of Cache Coherency, which is a very important question. Its a
complicated issue, the best general solution I know is: low memory-
overhead.

If you want a good read on modern hardware (at the level we are now
discussing), I highly recommend https://lwn.net/Articles/250967/ - 9
parts in total, about 4 semesters worth of info.

For the moment my tests ignore Cache Coherency, and this is planned,
because I am trying to isolate Software-interrupt propagation across
multi-cores.
So once I get my SET/GET numbers as high as I can, I will start
widdling them down by introducing Cache-Coherency issues :)

Hopefully all this fiddling I am doing will lead in the direction of
some tests that simulate a real world parallel work load (and Cache
Coherency is a MUST in such a test).
An open dialogue on all of the things that need to be tested to test
the parallel use or redis in the real world is something that would
benefit from having lots of people comment. There are just tons of
variables/bandwidths/hardware-paths/etc...

Good news is, this stuff is so complicated, that the only approach I
have seen work is one based on coding-discipline (minimal code, no
unneeded functionality, minimalistic data-structures, etc...) and
redis is the poster-child for this :)

On the software interrupt topic:
What I still dont know is, how does core0 communicate soft-interrupts
to core5 (which IS on a different cpu)? via the FSB? that would be
very slow compared to on-die communication.
So many variables.

Jak Sprats

unread,
Jun 4, 2010, 6:50:32 AM6/4/10
to Redis DB
On Cache Conscious HashTables, people talk about compact/array
chaining (for collisions):
http://tinyurl.com/2v6cfnv (MIT thesis for free from google books)

I am not sure what Redis uses, I think its NOT compact (i.e. lists).
Worth a read and a think for those interested, this IS core
functionality.
> discussing), I highly recommendhttps://lwn.net/Articles/250967/- 9

Jak Sprats

unread,
Jun 4, 2010, 7:24:37 PM6/4/10
to Redis DB

actually Cache Coherency is the wrong term. Redis w/ multiple
processes handles Cache Coherency by not needing it, aka process
isolation.

Cache lines is more the correct term, or multiple cores sharing L2/L3
Cache efficiently.

On Jun 4, 3:50 am, Jak Sprats <jakspr...@gmail.com> wrote:
> On Cache Conscious HashTables, people talk about compact/array
> chaining (for collisions):http://tinyurl.com/2v6cfnv(MIT thesis for free from google books)

Jak Sprats

unread,
Jun 5, 2010, 12:05:49 PM6/5/10
to Redis DB

??? Anyone know anything about TNAPI ???
http://www.ntop.org/TNAPI.html

it seems like TNAPI combined w/ the right NIC, you could put Tx/RX
Flows directly to core-dedicated redis-server's

this is a good solution to the software-interrupt bottleneck

On Jun 4, 4:24 pm, Jak Sprats <jakspr...@gmail.com> wrote:
> actually Cache Coherency is the wrong term. Redis w/ multiple
> processes handles Cache Coherency by not needing it, aka process
> isolation.
>
> Cache lines is more the correct term, or multiple cores sharing L2/L3
> Cache efficiently.
>
> On Jun 4, 3:50 am, Jak Sprats <jakspr...@gmail.com> wrote:
>
>
>
> > On Cache Conscious HashTables, people talk about compact/array
> > chaining (for collisions):http://tinyurl.com/2v6cfnv(MITthesis for free from google books)

Tim Lossen

unread,
Jun 5, 2010, 2:43:27 PM6/5/10
to redi...@googlegroups.com
hmmm ... looks interesting, but i found this in their FAQ:

"Q. Is TNAPI useful for general-purpose networking? -- A. No. TNAPI is
NOT designed for general purpose networking but ONLY for passive
packet capture."

so i don't think it is suited for redis.

--
http://tim.lossen.de

Jak Sprats

unread,
Jun 6, 2010, 12:20:09 AM6/6/10
to Redis DB
I read that too, its very vague ... I emailed TNAPI and asked them
what this means.
If not TNAPI then something like this ... in this direction.

if this problem can be solved in hardware, that is where it should be
solved. The decision as to which core a packet should go to should be
made at the hardware/driver level, otherwise you need to ask CPUA, who
will decide the packet needs to go to CPUX (extra CPU->CPU hop and
maybe a RAM->CPU->CPU<-RAM hop)

more info
http://download.intel.com/network/connectivity/products/whitepapers/Network_for_Multicore_wp_10_07.pdf
1.) These technologies work in concert to create independent packet
queues, direct network packets to the correct queue, map the queue to
a processor core or virtual machine, and facilitate the interaction
between system, queues, and cores.
....
2.) Multiple transmit and receive queues in the controllers allow net-
work traffic streams to be distributed into queues. These queues can
be associated with specific processor cores, allowing distribu- tion
of the workload and preventing data traffic processing from
overwhelming a single core.
.....
3.) The specific flow for a given packet is determined by the
calculation of a hash value derived from fields in the packet header

Number 3 means hashing on the TCP 4-tuple and is just what parallel
redis needs.

Linux supports this:
http://kernelnewbies.org/Linux_2_6_30
13.3 Network
netxen: Add receive side scaling (rss) support

This is the perfect solution for the software interrupt propagation
over multiple core issue. Now I just need somebody w/ one of these
NICs or from a NIC that supports this type of thing.

This type of hardware functionality is very necessary when we start
talking about 8 and 12 cores.

Jak Sprats

unread,
Jun 7, 2010, 5:37:55 PM6/7/10
to Redis DB
Ive been emailing w/ the TNAPI main guy Luca Deri, he is a smart
cookie

He pointed me to a paper he wrote: http://luca.ntop.org/MulticorePacketCapture.pdf

The hardware/driver functionality that would really benefit multiple
redis-servers is just around the corner:
http://www.ntop.org/blog/?p=86

So until that hardware functionality becomes cheaper, it will be
something to remember.

Pinning a redis-server to a core, and then using this hardware
functionality to have hardware map all incoming and outgoing tcp
traffic of this redis-server's port (in hardware) to the core the
redis-server is running on, would free up alot of CPU and will allow
linear scalability (which is the goal for the multi-core future).

I am gonna continue to experiment w/ Receive Flow Steering and also
see if multiplexing the IRQs created w/ MSI-X accross all cores speeds
stuff up.

On Jun 5, 9:20 pm, Jak Sprats <jakspr...@gmail.com> wrote:
> I read that too, its very vague ... I emailed TNAPI and asked them
> what this means.
> If not TNAPI then something like this ... in this direction.
>
> if this problem can be solved in hardware, that is where it should be
> solved. The decision as to which core a packet should go to should be
> made at the hardware/driver level, otherwise you need to ask CPUA, who
> will decide the packet needs to go to CPUX (extra CPU->CPU hop and
> maybe a RAM->CPU->CPU<-RAM hop)
>
> more infohttp://download.intel.com/network/connectivity/products/whitepapers/N...

Jak Sprats

unread,
Jun 8, 2010, 10:48:30 AM6/8/10
to Redis DB

something that comes up when running 2 redis-servers on a QuadCore
machine and using "taskset" to pin them to cores 0 and 1 is CPU
affinity also applies to all forked processes, so the background save
job can really impact performance (as it would run on the same core as
redis-server's).

My trick is a to do a exec() call in the child's pid, right after the
fork(), that "taskset"s itself to run on core 2. Its not elegant, but
it works.

This leaves core 3 free to handle all the software interrupts.

This is a pretty good usage of a QuadCore machine. 235K SET/GETs per
second and real stable
> > > >>>>>>>> rc1, does newservers.com give that sort of access?...
>
> read more »

Jak Sprats

unread,
Jun 16, 2010, 8:01:12 PM6/16/10
to Redis DB
what a difference a NIC makes, 428K GET/SET per second!

Using the Jumbo servers from newservers.com (thanks tim) which have
the following specs
2 x Intel E5504 Quad Core 2.00 GHz (NICS have Rx/TX Multiqueue)
http://www.dell.com/downloads/global/products/pwcnt/en/iscsi-hba-product-brief.pdf

So I ran the following very basic script: (250 chars)
http://allinram.info/redis/parallel_tests/muck_irqs.sh
which maps a single RX/TX NIC queue to a single core
On these boxes that means each RX/TX queue gets its own core (8to8).
This does nothing except spread out the software interrupts.

No process to RX/TX queue mapping is possible until hardware supports
it, so packets are still going to the wrong core 7/8 times :(.

The results were damn fast. 8 clients on one machine against 8 servers
on another machine, everyone w/ dedicated cores: 428K SET/GET per
second.
So pretty much 2X speedup due to the the 250 char script :)

I then upgraded to 2.6.35 and turned on RPS and there was no total
speed increase, but the spread across the cores was more even (before
core0 was 40% faster), so I recommend the upgrade.

W/ 2.6.31 (core0 heavy)
LOG/CLIENT/NS8_NC8/5:43699.96 requests per second - SET
LOG/CLIENT/NS8_NC8/5:49414.69 requests per second - GET
LOG/CLIENT/NS8_NC8/4:48016.32 requests per second
LOG/CLIENT/NS8_NC8/4:50813.31 requests per second
LOG/CLIENT/NS8_NC8/1:43041.10 requests per second
LOG/CLIENT/NS8_NC8/1:48248.58 requests per second
LOG/CLIENT/NS8_NC8/6:47856.82 requests per second
LOG/CLIENT/NS8_NC8/6:49152.13 requests per second
LOG/CLIENT/NS8_NC8/0:79568.43 requests per second - 80K not 45K
LOG/CLIENT/NS8_NC8/0:76706.99 requests per second - 80K not 45K
LOG/CLIENT/NS8_NC8/2:44107.62 requests per second
LOG/CLIENT/NS8_NC8/2:49195.45 requests per second
LOG/CLIENT/NS8_NC8/3:41574.91 requests per second
LOG/CLIENT/NS8_NC8/3:51067.71 requests per second
LOG/CLIENT/NS8_NC8/7:40570.79 requests per second
LOG/CLIENT/NS8_NC8/7:50249.54 requests per second
428K SET/GET per second

w/ 2.6.35 and RPS turned on (each rx-queue)
LOG/CLIENT/NS8_NC8/5:51641.11 requests per second - SET
LOG/CLIENT/NS8_NC8/5:50733.25 requests per second - GET
LOG/CLIENT/NS8_NC8/4:52571.66 requests per second
LOG/CLIENT/NS8_NC8/4:52271.39 requests per second
LOG/CLIENT/NS8_NC8/1:50789.79 requests per second
LOG/CLIENT/NS8_NC8/1:50872.87 requests per second
LOG/CLIENT/NS8_NC8/6:51492.41 requests per second
LOG/CLIENT/NS8_NC8/6:51840.95 requests per second
LOG/CLIENT/NS8_NC8/0:63021.43 requests per second - 62K not 51K
LOG/CLIENT/NS8_NC8/0:61406.27 requests per second - 62K not 51K
LOG/CLIENT/NS8_NC8/2:51570.80 requests per second
LOG/CLIENT/NS8_NC8/2:52118.78 requests per second
LOG/CLIENT/NS8_NC8/3:51132.79 requests per second
LOG/CLIENT/NS8_NC8/3:51414.39 requests per second
LOG/CLIENT/NS8_NC8/7:51605.53 requests per second
LOG/CLIENT/NS8_NC8/7:51285.40 requests per second
423K SET/GET per second

full results are here (totally unorganised)
http://allinram.info/redis/parallel_tests/Jun16_2_server_test.txt

I would imagine 2*Quad...@3.0Ghz could reach 550K-600K depending on
FSB speed.
> > > > >>>>> expert in...
>
> read more »

Pieter Noordhuis

unread,
Jun 17, 2010, 4:25:05 AM6/17/10
to redi...@googlegroups.com
Hi Jak,

Great results! Very interesting to see that a dedicated RX/TX queue per core improves performance this much. On to half a million kops! ;-)

Cheers,
Pieter

>
> read more »

Jak Sprats

unread,
Jun 17, 2010, 8:06:29 PM6/17/10
to Redis DB
I created a repository with these scripts:
http://github.com/JakSprats/Redis-1.2.6-cluster

Minimal documentation can be found here:
http://github.com/JakSprats/Redis-1.2.6-cluster/blob/master/CLUSTER_README

On Jun 17, 1:25 am, Pieter Noordhuis <pcnoordh...@gmail.com> wrote:
> Hi Jak,
>
> Great results! Very interesting to see that a dedicated RX/TX queue per core
> improves performance this much. On to half a million kops! ;-)
>
> Cheers,
> Pieter
>
>
>
> On Thu, Jun 17, 2010 at 2:01 AM, Jak Sprats <jakspr...@gmail.com> wrote:
> > what a difference a NIC makes, 428K GET/SET per second!
>
> > Using the Jumbo servers from newservers.com (thanks tim) which have
> > the following specs
> > 2 x Intel E5504 Quad Core 2.00 GHz (NICS have Rx/TX Multiqueue)
>
> >http://www.dell.com/downloads/global/products/pwcnt/en/iscsi-hba-prod...
> > I would imagine 2*Quadc...@3.0Ghz could reach 550K-600K depending on
> > > > > > >>>> multiple processes per CPU would no doubt exacerbate...
>
> read more »
Reply all
Reply to author
Forward
0 new messages