Data Size Issue

98 views
Skip to first unread message

Donghua Xu

unread,
Apr 12, 2010, 4:03:03 AM4/12/10
to memcachedb
Hi there,

We are running memcachedb 1.2.1 on a 4GB memory machine with "-m 1836"
option, and it accumulated over 11GB data when it started to get
slower and slower. By this time we have about 12m data items. We
didn't want to shut down the site to do the repartitioning, so we ran
some scripts to delete about 1/3 of the data. But the data file sizes
didn't change.

-rw-r--r-- 1 dxu dxu 11796185088 Apr 12 07:55 data.db
-rw-r----- 1 dxu dxu 24576 Apr 11 18:02 __db.001
-rw-r----- 1 dxu dxu 76775424 Apr 11 18:02 __db.002
-rw-r----- 1 dxu dxu 1925185536 Apr 11 18:02 __db.003
-rw-r----- 1 dxu dxu 4259840 Apr 11 18:02 __db.004
-rw-r----- 1 dxu dxu 16252928 Apr 11 18:02 __db.005
-rw-r----- 1 dxu dxu 3891200 Apr 11 18:02 __db.006

We restarted memcachedb, and it is still the same. memcachedb is still
quite slow. It seems deleting the data items do not affect the data
file sizes or the performance. Is there an easy way to force
memcachedb to clean up the fragments in backend B-Tree and reduce the
data file size? Or what should we do to improve the performance in
this case?

A lot of thanks in advance.
-Donghua

Steve Chu

unread,
Apr 12, 2010, 6:59:00 AM4/12/10
to memca...@googlegroups.com, Greg Burd
Hi, Donghua,

Btree does not remve data actually when you delete an item. This is a
problem that current BerkeleyDB implementation has(or other btree
has). Several ways may be adopted to reduce the db size, but I am
afraid it will be offline or a bit complicated if you want it to be
online.

Offline: run a checkpoint, db_dump your database and db_load your
database. put the new dumped db into a new dir, start memcachedb with
the new db.(be careful with db name(-f <new db name>))

Online:Relay your write operations to a queue(such as memcacheq), read
from your original db. copy a database to another place, deal with
the copy, use db_dump and db_load as below. there will be a
inconsistency windows between read and write. when finished the new
db, load updates from mq, apply it to new db.

Hi, greg,

Any changes in bdb5.0 for btree disk reclaim? As I know the btree
compact api(DB->compact()) did not work well, and most of time fails
and blocks the db. What do you think?

Regards,

Steve

> --
> You received this message because you are subscribed to the Google Groups "memcachedb" group.
> To post to this group, send email to memca...@googlegroups.com.
> To unsubscribe from this group, send email to memcachedb+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/memcachedb?hl=en.
>
>

--
Best Regards,

Steve Chu
http://stvchu.org

Greg Burd

unread,
Apr 12, 2010, 10:48:26 AM4/12/10
to Steve Chu, memca...@googlegroups.com
Hey Steve,

A lot of work went into db->compact() between 4.8 and 5.0, it works for hash as well as btrees and if you have multiple databases in a single file we can move root pages around to compact those now (I forget, is this something you do in MemcacheDB?). Also, to the best of my knowledge it *worked* in 4.8.x (do you have a bug SR# or URL to reference for the issues you mention?). :)

We don't reclaim allocated disk space for performance reasons (really!). It is expensive to shrink a file, especially when it's about to grow again. So I'd not call this a problem, it's really an optimization. That's why we added compact(), for those cases where you explicitly wanted to incur the overhead of a) reclaiming the space and b) potentially re-allocating that space later. That said, there can be cases where performance degrades - this may be one of those.

To understand the performance problem mentioned by Donghua we really need to know more about the state of the system. Gather the statistics and post them somewhere. Also you might try using DTrace (or SystemTap) to find out what's really happening. We can help with all of this, start a thread on the OTN Forums for BDB and engineering/support will pick it up.

Thanks for asking, we're happy to help. :)

-greg
Product Manager, Oracle Berkeley DB

Steve Chu

unread,
Apr 12, 2010, 11:22:48 AM4/12/10
to memca...@googlegroups.com, Greg Burd
Hi, Greg,

I guess Donghua thought that the large no-reclamation database file
makes the cache ineffective(so he want to reclaim the db size.). When
some item is deleted but it still occupies the page space where some
valid items may colocate. If the valid items is read, the whole page
is swap into memory pool. The deleted items comes along with valid
ones(because they are in same page) and do occupy the memory pool
space, right?

Besides, does DB->compact() block the whole database? I tried compact
api when I used bdb4.7, but after waiting for a long time, finally the
call fails because it got lock number limitation, and also it costs
lots of memory. Can compact api work with very large db files(GB
size)? What we want is no matter how large the db file is, it can
slowly run compaction, we can afford the service degration(for
example, we can run compaction at night), but it should not block the
whole db, or block much.

Regards,

Steve

Greg Burd

unread,
Apr 12, 2010, 11:52:31 AM4/12/10
to Steve Chu, memca...@googlegroups.com
Hey Steve,

The cache and the size of the on-disk database file are unrelated. To BDB the disk file is just a set of pages, the cache holds the most active pages so the newly emptied pages quickly evicted out of cache as more relevant pages are accessed. The DB->compact() API can happen in small chunks of the key-space thus reducing and bounding the locks required to finish the task as well preventing total database lock-up. It accomplishes two things separately, 1) returning free pages in the disk file to the OS and 2) optimizing the btree itself.

Why don't you ask these questions on the OTN Forum. Be specific, provide example code, you know the drill. ;-) Maybe the easiest thing would be to build an example, 'ex_compact.c' or something so that everyone can benefit from. Copy on ex_access.c and add code to illustrate the problem. Post that to the OTN BDBB Forum. :)


-greg

Donghua Xu

unread,
Apr 12, 2010, 5:29:58 PM4/12/10
to memcachedb
Hi Greg, sorry I am not sure I understand your first sentence. I think
the cache and the size of the on-disk database file are actually
related. As Steve mentioned, the more un-reclaimed space in the
database file, the more likely that the swapped-in page would contain
useless data that's already marked as deleted. Eventually we hope to
fit all useful data into memory so no swapping would be needed. We
will probably run a nightly job to delete some data that gets too old
and becomes useless, to make room for the new data that would come in
the next day. But now it seems the prerequisite is that the database
file itself has to be able to fit into the memory..

> > On Mon, Apr 12, 2010 at 10:48 PM, Greg Burd <GREG.B...@oracle.com>

> > >> On Mon, Apr 12, 2010 at 4:03 PM, Donghua Xu <dong...@gmail.com>

Greg Burd

unread,
Apr 16, 2010, 11:03:28 AM4/16/10
to memca...@googlegroups.com
Donghua,

You're right, that was a vague statement for me to make. :) I think what you're concerned with is page fill factor, the amount of space per-page which does not contain active data. If you're caching very "unfull" pages then your cache is less effective - that's what I read as your primary concern, correct me if I'm wrong.

Allocating (or deallocating) new disk space is an expensive operation, so we try to avoid it. When we grow the database file size we don't shrink it (ask the OS to truncate/shrink/return space in the file to the filesystem) when pages empty because we might need that space again (and disk is cheap in terms of $, but allocating it isn't in terms of time). The db->compact() call is your way to tell us, "review the database file and a) optimize the btree, b) move data items around to increase the page fill factor if possible, and c) truncate the database file." You can optionally do any combination of those three things. There is no difference in database performance if you do AB or ABC, truncating the database file doesn't speed things up - improving the page fill factor does. This is where the confusion was in my last email. Of course if you're always truncating and then later growing the database file you'll take the hit as you change that file size, but other than that there should be no difference in performance.

Are you monitoring the database statistics? Sounds like you want to watch fill factor, cache misses and a few other things.
http://www.oracle.com/technology/documentation/berkeley-db/db/api_reference/C/dbstat.html

So, given your description of your situation I'd run db->compact() but without the "DB_FREE_SPACE" flag because you expect to reuse the free space in your next day's processing.
"""
Return pages to the filesystem when possible. If this flag is not specified, pages emptied as a result of compaction will be placed on the free list for re-use, but never returned to the filesystem.
"""
http://www.oracle.com/technology/documentation/berkeley-db/db/api_reference/C/dbcompact.html

Note that "compression" is an entirely difference feature.
http://www.oracle.com/technology/documentation/berkeley-db/db/api_reference/C/dbset_bt_compress.html

did that clear things up?

-greg

Donghua Xu

unread,
Apr 19, 2010, 3:49:21 AM4/19/10
to memcachedb
Thanks Greg. These explanations are very helpful. Our concern is
indeed the fill factor.

Steve, is there an easy way to call the db->compact() through the
memcachedb interface? Or do we have to do this offline, i.e., shut
down the memcachedb, run a standalone tool to do db->compact, then
restart memcachedb?

On Apr 16, 11:03 am, Greg Burd <GREG.B...@oracle.com> wrote:
> Donghua,
>
> You're right, that was a vague statement for me to make.  :)  I think what you're concerned with is page fill factor, the amount of space per-page which does not contain active data.  If you're caching very "unfull" pages then your cache is less effective - that's what I read as your primary concern, correct me if I'm wrong.
>
> Allocating (or deallocating) new disk space is an expensive operation, so we try to avoid it.  When we grow the database file size we don't shrink it (ask the OS to truncate/shrink/return space in the file to the filesystem) when pages empty because we might need that space again (and disk is cheap in terms of $, but allocating it isn't in terms of time).  The db->compact() call is your way to tell us, "review the database file and a) optimize the btree, b) move data items around to increase the page fill factor if possible, and c) truncate the database file."  You can optionally do any combination of those three things.  There is no difference in database performance if you do AB or ABC, truncating the database file doesn't speed things up - improving the page fill factor does.  This is where the confusion was in my last email.  Of course if you're always truncating and then later growing the database file you'll take the hit as you change that file size, but other than that there should be no difference in performance.
>
> Are you monitoring the database statistics?  Sounds like you want to watch fill factor, cache misses and a few other things.http://www.oracle.com/technology/documentation/berkeley-db/db/api_ref...
>
> So, given your description of your situation I'd run db->compact() but without the "DB_FREE_SPACE" flag because you expect to reuse the free space in your next day's processing.
> """
> Return pages to the filesystem when possible. If this flag is not specified, pages emptied as a result of compaction will be placed on the free list for re-use, but never returned to the filesystem.
> """http://www.oracle.com/technology/documentation/berkeley-db/db/api_ref...
>
> Note that "compression" is an entirely difference feature.http://www.oracle.com/technology/documentation/berkeley-db/db/api_ref...
> ...
>
> read more »

Steve Chu

unread,
Apr 19, 2010, 3:53:35 AM4/19/10
to memca...@googlegroups.com
I am afraid you should shutdown the daemon and use db_dump and db_load
standalone tool. There's no standalone utility for compact yet. As my
first reply.
--
Best Regards,

Steve Chu
http://stvchu.org

Donghua Xu

unread,
Apr 19, 2010, 4:03:11 AM4/19/10
to memcachedb
I see. Thanks a lot for your help Steve. This is a lesson we learned
the hard way. We didn't expect the data would grow so fast. We will
plan more carefully and have a data-partitioning and migration
mechanism in place to accommodate the data growth.

On Apr 19, 3:53 am, Steve Chu <stv...@gmail.com> wrote:
> I am afraid you should shutdown the daemon and use db_dump and db_load
> standalone tool. There's no standalone utility for compact yet. As my
> first reply.
>

Gregory Burd

unread,
Apr 19, 2010, 9:44:41 AM4/19/10
to memca...@googlegroups.com
Donghua,

It shouldn't be to hard to write a command line utility, like db_archive
or db_hotbackup, to call db->compact() on the DB_ENV used by memcachedb.
Run that from cron or some other script. That should solve your problem
and not require Steve to modify the memcachedb code at all.

-greg
Reply all
Reply to author
Forward
0 new messages