> But yeah, hopefully you fixed it there.
I suppose we finnally found it.
I guess you might add those points to the timout wiki.
The problem seemed to boil down to the following:
vm.swappiness=60 (default) is a very bad idea, when combined
with deadline as a io scheduler.
symtom:
- after memcached up some days, timeouts start.
- after a server reboot, timeouts are gone.
- after memcache restart, timeouts are gone
- after memcached up some days, timeouts are back
- compiling with the newest libevent/glibc improved
things, but did remove problems.
environment:
- server 8 GB RAM, dedicated for mysql+memcached
- memcached configured for 2 GB
- mysqld runnig, showing ~2000 queries/second/avg/24h
- mysql needs about 2-3 GB RAM.
- we're caching ojects up to one week.
so normally one woud expect: 8 GB RAM -3 GB mysql -2 GB = 3 GB
should be sufficient for the OS doing nothing else (despite
running an amanda backup client during the bakcup in the night).
the above machine startet to swap out (only some MB, it was
less than 40MB, quite nothing, as the linux does using
vm.swappiness=60 swap out normal OS prcoessess, too).
And if there is one rarely used page of the memcache
swapped out, and an item from that page is requested, when
at the same time one mysql query creates some io, then
the corresponding memcache-page will be swapped in AFTER
the mysql process finishes (deadline io scheduler).
and that could be too late for the one second timeout.
I never had expected a machine to start swapping if
free tells me that there are some GB free for caches.
I never ever would have thought of 40M in swap when
3+ GB are caches/free would cause such a problem.
conclusion:
if you use memcache and need high amounts of memory with
many objects, keep a look at your swap, and if there is
something in it (even 1 kb) - it might be too much.
after setting vm.swappiness to zero and paging in all swap,
the effects were gone.
It explains also that we encountered the problem only
on this machine whereas on about 40 other servers we
did not - this machine was the oldest hardware,
the slowest disks (even if raid10 on 4 disks) and the
least memory (oure newer mysql/memcache servers
have 16..32GB RAM), so they did not start to swap out
even with vm.swappiness=60 ...
regards.
Werner.