Pre-Split HBase Regions: best method and recommendations?

881 views
Skip to first unread message

Chris Christensen

unread,
Apr 4, 2016, 6:57:11 PM4/4/16
to OpenTSDB
According to the docs and usage pre-splitting hbase regions is a solid strategy for better performance and initialization; however, the docs still indicate "TODO - include scripts for pre-splitting."

Looking around the mailing list the most referenced [1,2] script appears to be: https://gist.github.com/johann8384/5544290

Is this the best one and could it be committed to the source base like src/create_table.sh?

Also are there some "rule of thumb" recommendations on region size and number to run the script(s) with; from the docs: "256 regions may be a good place to start depending on how many time series share each metric." (when looking with ~4000 metrics). But, there's more to the story: what's the decision process in the face of other variables, such as how long the metrics are stored (TTL), total size, and rate of ingest that influence the tuning?

Thanks!


Chris Christensen

unread,
Apr 12, 2016, 12:10:56 PM4/12/16
to OpenTSDB
Just to jot some notes on how to accomplish "256 regions may be a good place to start": (this also includes LZ4 compression, a TTL of 2 years - these can be tuned accordingly using the create and alter statements):


sudo -u hbase hbase org.apache.hadoop.hbase.util.RegionSplitter tsdb UniformSplit -c 256 -f t
sudo
-u hbase hbase shell

alter
'tsdb', NAME => 't', DATA_BLOCK_ENCODING => 'FAST_DIFF'
alter
'tsdb', NAME => 't', TTL => 63072000
alter
'tsdb', NAME => 't', COMPRESSION => 'LZ4'
create
'tsdb-uid', {NAME => 'id', COMPRESSION => 'LZ4', BLOOMFILTER => 'ROW', TTL => 63072000, DATA_BLOCK_ENCODING => 'FAST_DIFF'}, {NAME => 'name', COMPRESSION => 'LZ4', BLOOMFILTER => 'ROW', TTL => 63072000, DATA_BLOCK_ENCODING => 'FAST_DIFF'}
create
'tsdb-tree', {NAME => 't', VERSIONS => 1, COMPRESSION => 'LZ4', BLOOMFILTER => 'ROW', TTL => 63072000, DATA_BLOCK_ENCODING => 'FAST_DIFF'}
create
'tsdb-meta', {NAME => 'name', COMPRESSION => 'LZ4', BLOOMFILTER => 'ROW', TTL => 63072000, DATA_BLOCK_ENCODING => 'FAST_DIFF'}

Since these are newly created tables - salting to the 256 regions helps with distribution immensely:

tsd.storage.salt.width = 1
tsd
.storage.salt.buckets = 256



A concern: This configuration works well when initializing, but the only way to change tsd.storage.salt.buckets is to either abandon the data (or configure a ro instance to read the old ranges) or re-import to use a new bucket configuration. Given that consideration: is it crazy to set the width=2,buckets=65535 ? I am guessing the trade-off would be coordinating scans across that size.
Reply all
Reply to author
Forward
0 new messages