This is an issue with how your network interrupts are being routed, not
with how memcached is being threaded.
Wish I had some good links offhand for this, because it's a little obscure
to deal with. In short; you'll want to balance your network interrupts
across cores. Google for blog posts about smp_affinity for network cards
and irqbalance (which poorly tries to automatically do this).
Depending on how many NIC's you have and if it's multiqueued or not you'll
have to tune it differently. Linux 2.6.35 has some features for extending
the speed of single-queued NIC's (find the pages discussing it on
kernelnewbies.org).
- cat /proc/interrupts - look for "eth0", "eth1", etc. If you have one
interrupt assigned to eth0, you have a single-queue NIC.
If you have many interrupts that look like "eth0-0, eth0-1", etc, you have
a multi-queue NIC. These can have their interrupts spread out more.
Use either irqbalance (probably a bad idea) or echoing values into
/proc/irq/nn/smp_affinity (google for help with this), to spread out the
interrupts. You can then experiment with using `taskset` to bind memcached
to the same CPUs as the interrupts, or to different CPU's and see if the
throughput changes.
- look up "linux sysctl network tuning"
This tends to give you crap like this:
net.ipv4.ip_local_port_range = 9500 65536
net.core.rmem_max = 1048576
net.core.wmem_max = 1048576
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 43690 4194304
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_max_orphans = 65536
net.ipv4.tcp_max_syn_backlog = 16384
#net.ipv4.tcp_synack_retries = 2
net.core.netdev_budget=1000
net.ipv4.tcp_max_tw_buckets = 1512000
Which will vary by if you're using persistent connections or not (ie; how
high the turnover is). Don't blindly copy/paste this stuff.
- Read up on all the "ethtool" options available for your NIC. ensure the
defaults work.
NICs are all configured for a balance between packet latency and
throughput. The more interrupts they coalesce within the driver, the
higher potential latency of packet return. This can be really hard to go
through and the settings will vary by every NIC you fiddle with. I've
managed to get differences of 40k-60k pps by tuning these values. Often
the latency doesn't get much worse.
- Use a recent kernel.
On a particular piece of recent hardware I doubled packet throughput by
moving from 2.6.27 to 2.6.32. I was not able to push 2.6.18 hard without
having it drop networking.
- Use a more recent kernel
http://kernelnewbies.org/Linux_2_6_35#head-94daf753b96280181e79a71ca4bb7f7a423e302a
I haven't played with this much yet, but it looks like a BFD, especially
if you're stuck with single-queue NIC's.
- Get a better NIC.
10ge NICs have awesomesauce features for shoveling more packets around.
There're different levels of how awesome straight-gbit NICs are as well. I
like the high-end intels more than most of the broadcoms, for instance.
- Don't think running multiple instances of memcached will make much of a
difference.
Maybe run more threads though, or try pinning them to a set of CPU's or a
particular CPU.