Re: Memcached doesn't seem to be load balanced

32 views
Skip to first unread message

Brian Moon

unread,
Mar 20, 2009, 5:43:55 PM3/20/09
to memc...@googlegroups.com
On 3/20/09 4:17 PM, meppum wrote:
> Recently, I started to look closer at how our cluster is performing. I
> have noticed that it does not look like Memcached is as naturally
> balanced as it should be.

It is not balanced. It is psuedo-random.

> We run ten Memcached instances across ten physical machines. They are
> all windows boxes (we are eventually converting to Linux) and at the
> default settings (-c 1024 –m 1024, etc). There are no load balancers
> between the Memcached servers.

I am not really sure what you would do with load balancers and memcached.

> I have looked at many days of data and it seems that one server is
> always processing the most req/sec and another server is always
> processing the least. In fact, the req/sec that the cluster processes
> seems to cascade across each server, and the order is always the same.
> I have provided a link to this graph (real world data) and would like
> someone's input as to why this might be happening. Could the load
> balancing at the webserver tier be the culprit?

Do you perhaps have one very popular piece of cache that is requested
much more than others? We have this at dealnews. Our front page cache
is requested much more than any other single piece of cache. In our
case we have enough other stuff that you can't really see it on the
graphs, but that one cache is always coming from the same server.

> I have also noticed some odd behavior when one server goes "down" and
> is no longer available to the cluster.
>
> In this case it seems that one server takes the brunt of the load
> instead of it being distributed across the other servers evenly. Does
> anyone know more about the failover algorithm as to how the load to a
> "down" server is redistributed across the cluster? I have provided
> another graph (real world data).
>
> http://meppum.com/random/spike.png

What client are you using? I have never seen such a thing. And in that
graph, which server went down? All the nodes keep a consistent line
there. I don't see a node dropping out of the graph.

Brian.

meppum

unread,
Mar 20, 2009, 10:02:08 PM3/20/09
to memcached
Brian,

Thanks for the reply. My responses are below.

On Mar 20, 5:43 pm, Brian Moon <br...@moonspot.net> wrote:
> On 3/20/09 4:17 PM, meppum wrote:
>
> > Recently, I started to look closer at how our cluster is performing. I
> > have noticed that it does not look like Memcached is as naturally
> > balanced as it should be.
>
> It is not balanced.  It is psuedo-random.
Psuedo-random... now everything is starting to make sense, except for
the fact that the patterns I am noticing are not at all random.
>
> > We run ten Memcached instances across ten physical machines. They are
> > all windows boxes (we are eventually converting to Linux) and at the
> > default settings (-c 1024 –m 1024, etc). There are no load balancers
> > between the Memcached servers.
>
> I am not really sure what you would do with load balancers and memcached.
>
I was just trying to be clear. But I agree.

> > I have looked at many days of data and it seems that one server is
> > always processing the most req/sec and another server is always
> > processing the least. In fact, the req/sec that the cluster processes
> > seems to cascade across each server, and the order is always the same.
> > I have provided a link to this graph (real world data) and would like
> > someone's input as to why this might be happening. Could the load
> > balancing at the webserver tier be the culprit?
>
> Do you perhaps have one very popular piece of cache that is requested
> much more than others?  We have this at dealnews.  Our front page cache
> is requested much more than any other single piece of cache.  In our
> case we have enough other stuff that you can't really see it on the
> graphs, but that one cache is always coming from the same server.

This makes sense, and might be the reason for the single server spike
in the spike.png graph. What surprised me is that the balance.png
graph shows the req/sec for each server, and the order the servers are
in that graph is aways the same, even if the servers are restarted. I
would expect that this order would change and appear more random. Any
insight as to why this might be happening?
>
> > I have also noticed some odd behavior when one server goes "down" and
> > is no longer available to the cluster.
>
> > In this case it seems that one server takes the brunt of the load
> > instead of it being distributed across the other servers evenly. Does
> > anyone know more about the failover algorithm as to how the load to a
> > "down" server is redistributed across the cluster? I have provided
> > another graph (real world data).
>
> >http://meppum.com/random/spike.png
>
> What client are you using?  I have never seen such a thing.  And in that
> graph, which server went down?  All the nodes keep a consistent line
> there.  I don't see a node dropping out of the graph.

Spike.png is the one where the server went down. It is actually at
zero the entire time for that dataset. I'm using the latest windows
client 1.2.1. I'm thinking that the spike might be due to a piece of
common data being access. The one thing that makes me second guess
that idea is that the same server always experiences significantly
more load than the others.

-meppum
>
> Brian.

Jose Celestino

unread,
Mar 20, 2009, 10:09:11 PM3/20/09
to memc...@googlegroups.com
Words by meppum [Fri, Mar 20, 2009 at 07:02:08PM -0700]:

>
> Brian,
>
> Thanks for the reply. My responses are below.
>
> On Mar 20, 5:43 pm, Brian Moon <br...@moonspot.net> wrote:
> > On 3/20/09 4:17 PM, meppum wrote:
> >
> > > Recently, I started to look closer at how our cluster is performing. I
> > > have noticed that it does not look like Memcached is as naturally
> > > balanced as it should be.
> >
> > It is not balanced.  It is psuedo-random.
> Psuedo-random... now everything is starting to make sense, except for
> the fact that the patterns I am noticing are not at all random.

Do you mean you get more requests on one of the servers? More used
memory? More keys?

Of all of these only the last indicates a problem on memcached. Are
there any hot keys on your application? Are all the entries of similar
size? What client library are you using?

--
Jose Celestino | http://japc.uncovering.org/files/japc-pgpkey.asc
----------------------------------------------------------------
"One man’s theology is another man’s belly laugh." -- Robert A. Heinlein

Brian Moon

unread,
Mar 20, 2009, 11:46:39 PM3/20/09
to memc...@googlegroups.com
On 3/20/09 9:33 PM, meppum wrote:
> More requests, more memory, and more keys (i believe this is
> "curr_items" in stats). I am seeing a difference of 7-9% in the number
> of keys between the server with the most and the server with the
> least.
>
> I'm sure there are hot keys, thought I haven't really thought about
> what they might be specifically. The entries vary in size. I'm using
> the windows 1.2.1 library.

Windows is not a language. What _client_ library are you using?

Brian.

Peter J. Holzer

unread,
Mar 21, 2009, 5:00:10 AM3/21/09
to memc...@googlegroups.com
On 2009-03-20 19:02:08 -0700, meppum wrote:
> This makes sense, and might be the reason for the single server spike
> in the spike.png graph. What surprised me is that the balance.png
> graph shows the req/sec for each server, and the order the servers are
> in that graph is aways the same, even if the servers are restarted. I

The mapping from key to server happens in the client, so as long as all
the servers are up this is to be expected. The mapping doesn't change
when a server is restarted, so the order stays the same.

When a server is down, the load from this server will be redistributed
to the other servers, so the order on the remaining servers may change.
But when the server is up again, the original order should be restored.

> Spike.png is the one where the server went down. It is actually at
> zero the entire time for that dataset. I'm using the latest windows
> client 1.2.1. I'm thinking that the spike might be due to a piece of
> common data being access. The one thing that makes me second guess
> that idea is that the same server always experiences significantly
> more load than the others.

In http://meppum.com/random/balance.png the server with the highest load
is server5, not server1. I was wondering about that, but now I see that
you probably removed server1 from that graph so that the order of the
other servers is visible - both graphs show the same time period, right?

I also notice that all the servers except server1 have a very smooth
access pattern. But server1 has a jaggy access pattern - there are lots
of little spikes and troughs. Is it possible that these patterns come
from a single hot item? Or maybe some of your clients use only server1,
while others use all servers?

hp

--
_ | Peter J. Holzer | Openmoko has already embedded
|_|_) | Sysadmin WSR | voting system.
| | | h...@hjp.at | Named "If you want it -- write it"
__/ | http://www.hjp.at/ | -- Ilja O. on comm...@lists.openmoko.org

signature.asc

Henrik Schröder

unread,
Mar 21, 2009, 5:26:10 AM3/21/09
to memc...@googlegroups.com
On Sat, Mar 21, 2009 at 03:02, meppum <mme...@gmail.com> wrote:

This makes sense, and might be the reason for the single server spike
in the spike.png graph. What surprised me is that the balance.png
graph shows the req/sec for each server, and the order the servers are
in that graph is aways the same, even if the servers are restarted. I
would expect that this order would change and appear more random. Any
insight as to why this might be happening?


No, if you use consistent hashing, then given the same servers and the same keys, each key will always be mapped to the exact same server. Restarting the servers or the application won't change that, and it would be fatal if it did. And if the keys that your application is using doesn't change, then the distribution of items won't change either.

The simple explanation for you seeing different load on your servers is that some keys are more requested than others, and the servers with a higher load are simply the ones that those keys were mapped to.

A more interesting stat is to see the amount of items in each memcached server, if you have an imbalance there, it might indicate that the hash function your client is using isn't scattering keys that well, or if you're using custom hashes in your application, that you need to do something about their distribution.


/Henrik

Dustin

unread,
Mar 21, 2009, 2:16:58 PM3/21/09
to memcached

On Mar 21, 9:53 am, meppum <mmep...@gmail.com> wrote:

> item. However, 90% of the items we cache are session objects.

I suspect that in Brian Moon's configuration, significantly more
than 90% of his cached content is not his front page, but I'm sure
that doesn't affect its status as the most requested item.

Brian Moon

unread,
Mar 21, 2009, 3:35:39 PM3/21/09
to memc...@googlegroups.com

That is correct. We also store session in memcached. But, as for the
single most requested key, it is the proxied cache of our front page.

Brian.

Xaxo

unread,
Mar 22, 2009, 8:45:26 AM3/22/09
to memcached
On Mar 21, 5:53 pm, meppum <mmep...@gmail.com> wrote:
> I agree with what you are saying, however we aren't using consistent
> hashing. As far as I know the default PECL PHP client doesn't use
> consistent hashing. Taking into account that fact and that 90% of the
> data we cache are sessions with a 20 minute timeout I would think that
> the distribution wouldn't be so static. Also, I have noticed that the
> distribution of keys on our servers is up to 9% different from the
> server with the most keys and the least keys.

if you are using http://pecl.php.net/package/memcache , 3.x comes with
consistent as default. There might be 2 causes for your inbalance:
* several keys being more often requested than all the others (as
said before)
* the hashing anglorithm + the keys that you use
and of course the combination of both. A shot in the dark would be to
put more weights (just increase that number) on all servers, this will
influence both the standard and the consistent hashing algorithms and
will redistribute your keys in a different way.

What you should do in order to find out what is exactly happening is
log all the keys that you are requesting. You will then see if some
keys are being requested more than the others. Simulating a server
selection on this data set will also give you the key -> server
mappings, showing you what exactly happens. After that you can try
(simulating again) if the other hasing algorithm, that this module
provides, gives you better request distribution. Next try would be
changing the most requested keys, so that they get mapped to different
servers. Last try would be implementing an algorithm for fair request
distribution. This would require storing keys on multiple servers and
a lot of creativity :)

Reply all
Reply to author
Forward
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
0 new messages