Slow rebalance - how to debug

357 views
Skip to first unread message

Lasse Schou

unread,
Nov 11, 2013, 5:17:13 AM11/11/13
to couc...@googlegroups.com
Hi,

I have a cluster of 7 machines (32GB RAM, 3TB HDD, 1gbit) running Couchbase Server 2.1 CE, and I'm experiencing extremely slow rebalance. 

I have around 5 TB of disk usage (including views). A full rebalance has taken more than a week currently, and it's only around 30% done.

Does anyone know how I can debug this behavoir? Or does this sound like a reasonable rebalance time?

Thanks,
Lasse

Aliaksey Kandratsenka

unread,
Nov 11, 2013, 2:20:14 PM11/11/13
to couc...@googlegroups.com
Views can be very slow to rebalance. Regarding ~3 weeks for 100% of rebalance progress I cannot say if that's "reasonable" or not.

There are some ways to "debug" this with somewhat nice visualization even, but that's all not for end-users. If you're interested and willing to hack a bit see this: https://github.com/couchbase/ns_server/blob/master/scripts/visualize-rebalance-2.rb and this: https://github.com/couchbase/ns_server/blob/master/doc/master-events.txt. Be aware that script above (there's also visualizat-rebalance.rb, first version) is not built to deal with incomplete rebalances. It's likely fail, but at least you have some starting point to hack on in further.

Depending on your use-case there are several tunables that might help with rebalance speedup. They're all available through hidden (and not user-friendly) internal settings dialog. It'll be revealed if you add ?enableInternalSettings=1 to index.html in url:

* rebalance moves before compaction. Raising this is known to speed rebalance somewhat. I think in 2.2 we actually made it larger compared to 2.1 (or maybe it was earlier, I'm lazy to check right now). Basically after certain number of vbucket moves we forcefully compact views to make sure that view compaction is not concurrent with massive view updates and to keep view sizes from growing to big. Making this setting larger makes this happen less frequent, trading rebalance time for disk usage.

* disable index aware rebalance. Setting this to true will turn off coordination of index updates and vbucket moves. It'll make your views behave weird. Particularly "views may jump back in time" during rebalance. But it's known to speed up rebalance at least in some cases. If you can afford indexes temporarily losing data that may help too.

* rebalance moves per node (or something like that). It's set to 1 and it's limit of concurrent "backfills" for any node at any time during rebalance. I don't think raising it can help (except if you have almost no data where it is known to help). Backfill is usually reading vbucket data from disk and sending it to other node, it's clearly io, memory and network intensive. More of that concurrently might be helpful (i.e. perhaps on super-fast ssds), but it'll likely harm. Especially on hdds.

It's possible that views are not main reason for your slow rebalance. Consider also checking basic KV stats, like rate of bgfetches, updates, docs residency level etc.

Lasse Schou

unread,
Nov 11, 2013, 2:26:14 PM11/11/13
to couc...@googlegroups.com
Thanks so much for this cool stuff.


2013/11/11 Aliaksey Kandratsenka <alkond...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email to couchbase+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages