A jstack log file might show over 200 of them at a given time (we have
300-worker threads per tomcat).
At first I thought this could be a networking issue... more latency /
bandwidth means longer response times, which could mean more time
locking for I/O. But we ran some tests in our prod environment and
network seems to be fine. We were able to test it from our client
tomcat to all the target memcached's and looked ok. Now we are having
a hard time identifying what might be causing this issue.
Our memcached box holds 4 memcached servers and has a peak eth0
transfer of 30Mbit/s (the link has proven to handle a lot more, more
than 100Mbit of real data transfer).
Our cache servers (all holding different data) in that box handle the
following data (Server # | Hits/s | GET/s | SET/s | Misses/s)
{{{
1 478 505 454 41
2 207 333 128 128
3 1350 1350 1480 0
4 2870 3210 836 339
}}}
Our CPU is about 95% idle all day, around 0.6% user and 0.35%System.
Load is 0.28Max (4-core machine) and 0.12 avg. Also a max of 7.7L eth0
interrupts/second.
Additionally, we are measuring an aggregated 1-hour view of the time
we spend going to memcached (from "around" calling the
memcachedClient). These values are measured from a tomcat, of course,
and include ALL get calls to memcached (all servers, all nodes).
{{{
Units=ms.: (Hits=3058393.0, Avg=43.88392073876706, Total=1.34214276E8,
Min=0.0, Max=3808.0)
}}}
That 43ms AVERAGE time is killing us...not to mention when we have
above that. I believe that value is high because most of the time is
spent waiting on the lock to free. Sadly, I have no that as how much
time the I/O part actually took, as that is inside spymemcached.
In the last web-site we had this issue we solved it by creating more
memcached clients for each server (around 5, to 10) and we've stopped
having this issue. But in this one, it seems that is not solving our
issues (or we haven't hit the sweet number of clients we should use).
Any insight or tips on how to find our bottleneck will be greatly
appreciated. I haven't found a forum for spymemcached so I'm sorry if
this is not the correct place to post it.
I've read several posts on this elsewhere (http://groups.google.com/
group/spymemcached/browse_thread/thread/
93e100893c7ac778/54163aec33c43e97?lnk=raot&pli=1 ,
http://code.google.com/p/spymemcached/issues/detail?id=104 and others,
but last time I posted in the wrong place and didn't receive a proper
answer)
Regards
Andres B.
PS. All our servers run inside a cloud, with high-IO servers and 8cpus
(equivalent).
Thanks for all the info. We'll be taking a deeper look into some of
the contention shortly as we're in the process of getting some more
test infrastructure built out.