Membase Cluster Backups

Mike

unread,

Mar 27, 2012, 2:10:00 AM3/27/12

to membase

Currently, I have 3 membase nodes in our production environment. When
doing a mbbackup on the default bucket, it takes about 4 or 5 hours to
complete with the cpu load going to 100% for the entire time.

What is the recommended way to backup for a large production membase
cluster?
Using either the mbbackup tool or sqlite3 directly.

Any thoughts or ideas are greatly appreciated.

Chad Kouse

unread,

Mar 27, 2012, 3:35:46 AM3/27/12

to mem...@googlegroups.com

The part that is probably taking the longest is the vacuum step, but I
can't say that for sure. This sounds very similar to what we were
seeing when our cluster was undersized and therefore we were having to
do a lot of disk fetches when we had cache misses.

Do you see a lot of disk fetches on your cluster? This causes disk
contention because the backup process requires a lot of disk I/O. If
so your cluster is undersized and you should consider adding more
nodes. Also what version are you running?

What we ended up having to do was to run the sqlite .backup commands
manually and then vacuum the files later on another machine.

Another thing to do (hopefully after you get a successful backup) is
rotate nodes in and out of your cluster with rebalance. Removing a
node, rebalancing it out of the cluster, and then rebalancing it back
in has the same affect as shutting down a node and vacuuming it's data
files and starting it back up but without any downtime. However if
your cluster is undersized you will want to try to bring more nodes in
during the rebalance to not make matters worse.

--chad

Mike

unread,

Mar 27, 2012, 6:24:23 AM3/27/12

to membase, Marc Trudel

Chad.
Thank you for your help.

On Mar 27, 4:35 pm, Chad Kouse <chad.ko...@gmail.com> wrote:
> Do you see a lot of disk fetches on your cluster?

Disk fetches are pretty flat, but in terms of
normal operation when doing the mbbackup, we're getting
roughly 2000 gets per sec, between 1500 and 2000 ops per sec,
with so far about 30 million items.

> Also what version are you running?

membase-server 1.7.1.1 on Debian 6

> What we ended up having to do was to run the sqlite .backup commands
> manually and then vacuum the files later on another machine.

When running the sqlite3 commands directely to perform the backup,
does that have any adverse impact on the cluster?

> Another thing to do (hopefully after you get a successful backup) is
> rotate nodes in and out of your cluster with rebalance. Removing a
> node, rebalancing it out of the cluster, and then rebalancing it back
> in has the same affect as shutting down a node and vacuuming it's data
> files and starting it back up but without any downtime. However if
> your cluster is undersized you will want to try to bring more nodes in
> during the rebalance to not make matters worse.

That sounds like a good idea. I think I'm going to try that next.

Thanks

Chad Kouse

unread,

Mar 27, 2012, 9:12:19 AM3/27/12

to mem...@googlegroups.com, Marc Trudel

When you say the disk fetches are flat do you mean near 0?

If you look at the mbbackup python code you'll see it just runs a
sqlite .backup followed by a vacuum so running them manually should be
"as safe" as running mbbackup

--chad

Perry Krug

unread,

Mar 29, 2012, 2:11:45 PM3/29/12

to mem...@googlegroups.com

Hey Mike, we've actually become aware of some recent issues with the backup stuff taking a longer time than we/you would like. Even in some cases not completing. It has all to do with sqlite under the hood and some limitations with their backup capabilities.

While I can't promise anything immediately, we have identified this as a major priority to improve upon. We also know that 2.0 will be dramatically better due to the way backups work with CouchDB/etc. Just gotta make sure our current customers/users are able to continue using the product appropriately until then.

As long as your backups are eventually completing, I think you'll need to just let them run for now. If they're not completing at all (or you're still concerned about the time), the best approach we've seen is to decrease the overall size of the on-disk files which will likely mean having more nodes so that there is less data on each.