Pre-Split HBase Regions: best method and recommendations?

896 views

Skip to first unread message

Chris Christensen

unread,

Apr 4, 2016, 6:57:11 PM4/4/16

to OpenTSDB

According to the docs and usage pre-splitting hbase regions is a solid strategy for better performance and initialization; however, the docs still indicate "TODO - include scripts for pre-splitting."

Looking around the mailing list the most referenced [1,2] script appears to be: https://gist.github.com/johann8384/5544290

Is this the best one and could it be committed to the source base like src/create_table.sh?

Also are there some "rule of thumb" recommendations on region size and number to run the script(s) with; from the docs: "256 regions may be a good place to start depending on how many time series share each metric." (when looking with ~4000 metrics). But, there's more to the story: what's the decision process in the face of other variables, such as how long the metrics are stored (TTL), total size, and rate of ingest that influence the tuning?

Thanks!

[1] https://groups.google.com/d/topic/opentsdb/7fF-lmEHBW0/discussion

[2] https://groups.google.com/d/topic/opentsdb/IlOgr4wQfmA/discussion

Chris Christensen

unread,

Apr 12, 2016, 12:10:56 PM4/12/16

to OpenTSDB

Just to jot some notes on how to accomplish "256 regions may be a good place to start": (this also includes LZ4 compression, a TTL of 2 years - these can be tuned accordingly using the create and alter statements):

sudo -u hbase hbase org.apache.hadoop.hbase.util.RegionSplitter tsdb UniformSplit -c 256 -f t
sudo -u hbase hbase shell

alter 'tsdb', NAME => 't', DATA_BLOCK_ENCODING => 'FAST_DIFF'
alter 'tsdb', NAME => 't', TTL => 63072000
alter 'tsdb', NAME => 't', COMPRESSION => 'LZ4'
create 'tsdb-uid', {NAME => 'id', COMPRESSION => 'LZ4', BLOOMFILTER => 'ROW', TTL => 63072000, DATA_BLOCK_ENCODING => 'FAST_DIFF'}, {NAME => 'name', COMPRESSION => 'LZ4', BLOOMFILTER => 'ROW', TTL => 63072000, DATA_BLOCK_ENCODING => 'FAST_DIFF'}
create 'tsdb-tree', {NAME => 't', VERSIONS => 1, COMPRESSION => 'LZ4', BLOOMFILTER => 'ROW', TTL => 63072000, DATA_BLOCK_ENCODING => 'FAST_DIFF'}
create 'tsdb-meta', {NAME => 'name', COMPRESSION => 'LZ4', BLOOMFILTER => 'ROW', TTL => 63072000, DATA_BLOCK_ENCODING => 'FAST_DIFF'}

Since these are newly created tables - salting to the 256 regions helps with distribution immensely:

tsd.storage.salt.width = 1
tsd.storage.salt.buckets = 256

A concern: This configuration works well when initializing, but the only way to change tsd.storage.salt.buckets is to either abandon the data (or configure a ro instance to read the old ranges) or re-import to use a new bucket configuration. Given that consideration: is it crazy to set the width=2,buckets=65535 ? I am guessing the trade-off would be coordinating scans across that size.

Reply all

Reply to author

Forward

0 new messages