Do you see a lot of disk fetches on your cluster? This causes disk
contention because the backup process requires a lot of disk I/O. If
so your cluster is undersized and you should consider adding more
nodes. Also what version are you running?
What we ended up having to do was to run the sqlite .backup commands
manually and then vacuum the files later on another machine.
Another thing to do (hopefully after you get a successful backup) is
rotate nodes in and out of your cluster with rebalance. Removing a
node, rebalancing it out of the cluster, and then rebalancing it back
in has the same affect as shutting down a node and vacuuming it's data
files and starting it back up but without any downtime. However if
your cluster is undersized you will want to try to bring more nodes in
during the rebalance to not make matters worse.
--chad
If you look at the mbbackup python code you'll see it just runs a
sqlite .backup followed by a vacuum so running them manually should be
"as safe" as running mbbackup
--chad