caching too much data?

17 views
Skip to first unread message

Maher Kassem

unread,
Nov 30, 2010, 5:27:24 PM11/30/10
to lusca-users
There is a constant theme that we seem to be facing with our caches.
Whether it be squid 2.7 whatever stable, or LUSCA, it seems that once
our cached content reaches approximately 300+ gigs on our hard drives,
squid begins to _mess up_. It stops serving requests properly, and
then you can see a parabola with the bandwidth. It jitters, twitches,
brings down the bandwidth (by stopping reqs for a limited period of
time) then bring them up again, and then stopping them once again. We
are using 1 TB Hd's and we are no where near the limit of our hd's. Is
anyone else seeing this problem with extremely large caches? If so, do
you think a workaround would be maybe dividing a 1 TB Hd (on one
partition) into 4 smaller cache dirs instead of just one large
cache_dir? Any suggestions are much appreciated.

Sincerely,
Maher.

Adrian Chadd

unread,
Nov 30, 2010, 6:38:31 PM11/30/10
to lusca...@googlegroups.com
(Ah, why do I give away free advice without a support plan. :-)

One of the biggest mistakes people make is whacking a single enormous
disk in with a single cache_dir.

Sometimes people remember that there's a maximum number of seeks which
can occur on a disk, and they connect that to "objects I can HIT per
second." Sometimes they remember that it's also the maximum number of
objects that can be written to the disk too.

But the thing people forget the most is that it's the sum of all disk
operations - creation, reading AND deleting. Unlinking files is a huge
deal - and mainly because the unlink rate isn't one object at a time,
it's "however many objects need to be deleted to fulfill this request"
at a time.

If you have a single aufs cache_dir of 1TB and it's storing objects
between say, 64k and infinity (ie, < 64k goes on COSS!) then whenever
you need to make space for say, a 1gig file, then you have to:

* delete one 1 gig file to make room; or
* delete (1024 * 1024 / 64) 64kbyte files to make room.

Guess what next happens? :-)

One of the big reasons COSS wins in small object workloads is that it
doesn't -have- a delete pass - it's implied by the directory storage
mechanism.

It's possible I'm on the wrong track, but what you describe happening
is what I've seen happen time and time again. What I suggest you do is
take steps to limit how big objects can get in a cache dir. What I do
is this:

* COSS goes on separate disks, NEVER shared with anything
* each 1TB AUFS disk has two (maybe three now, if needed) AUFS cache_dir's:
+ 1 x large (128k -> 2mb)
+ 1 x huge (2mb -> 1gb)
+ (maybe 1 x enormous) (1gb -> whatever) - if you want to cache such
large files

The bulk of the average request rate is for objects 0 > n > 128kb;
that's where COSS gets you the win. Each "large" object will at most
have to delete (2048kb / 128kb) objects - ie, 16. Each "huge" object
will at most have to delete (1024mb / 2mb) objects - ie, 512; but
since you're (a) doing almost no requests for those sized objects, and
(b) those objects involve enough disk IO to make the deletion overhead
small in comparison, you still come out ahead.

Step 0 for identifying whether this is a problem is to write some code
to profile your request/hit rate. Then you can see what object sizes
see the bulk of your requests/hits.

adrian

> --
> You received this message because you are subscribed to the Google Groups "lusca-users" group.
> To post to this group, send email to lusca...@googlegroups.com.
> To unsubscribe from this group, send email to lusca-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/lusca-users?hl=en.
>
>

Reply all
Reply to author
Forward
0 new messages