Dataset not evenly distributed on HBase

172 views
Skip to first unread message

Fredrik

unread,
Jun 2, 2016, 11:39:31 PM6/2/16
to CDAP User
Hello,

I've using an app to read stream and store them in the dataset. The row key of the stream events is in this format: <UUID>+<DataTime>. Also I used importTSV to import the lines of datasource file to HBase directly with the same key, the destination table's regions are spread evenly over all of the region nodes, while with CDAP it's only 2 regions and on the same node with very different size (e.g., 4M rows: 600M + 1.1G regions). I've testing 1M rows created in one table, and 4M rows the same approach. The results are the same, only 2 regions on the same node.

My question are:
1. Why this happened? Since this will be a performance impact to app access large dataset.
2. Beyond the pre-split configuration of Table API, is there any other approach?

Best Regards,
Fredrik

rus...@cask.co

unread,
Jun 3, 2016, 8:38:16 PM6/3/16
to CDAP User
Hi Fredrik, In CDAP, if you create a table without defining splits, you will only get a single region. As the data in the table grows, CDAP will split the table into two regions, but will not continue to split after that. This is why you never get more than two regions no matter how much data you put in.

As you mentioned, the best way to handle this in CDAP would be to use the pre splits API of the table dataset.

http://docs.cask.co/cdap/current/en/developers-manual/building-blocks/datasets/table.html

Thanks,
Russ

Andreas Neumann

unread,
Jun 3, 2016, 9:26:40 PM6/3/16
to rus...@cask.co, CDAP User
Correction: it will split again after it reaches the region size limit. But apparently you have not reached that limit.
-Andreas

Sent from my iPhone
> --
> You received this message because you are subscribed to the Google Groups "CDAP User" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.
> To post to this group, send email to cdap...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/a21da14f-2175-4bbe-8a9b-85df68bd8011%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Fredrik

unread,
Jun 3, 2016, 9:59:51 PM6/3/16
to CDAP User
Hello Russ,

Thank you for the detailed information which will definitely help me designing my dataset more properly.
And since the row keys of my dataset are randomly generated and hashed (verified by the importTSV region distribution statistics for the importTSV generated HBase table with the same rows), which process or underlying considerations suppress this evenly distribution nature of HBase?
Will new features / enhancements be released for this to automatically split the rows to evenly distributed regions for a better performance consideration?

Thanks.
Best Regards,
Fredrik

Fredrik

unread,
Jun 3, 2016, 10:06:37 PM6/3/16
to CDAP User
Hello Andreas,

Thank you for the information :)  I'm curious about why the 2 regions(600M, 1.1G) are on the same node, since if they are on different node the performance will be better? And what's the upper limit for the region to split again? Can I configure it?

As the row keys of my dataset are randomly generated and hashed (verified by the importTSV region distribution statistics for the importTSV generated HBase table with the same rows), which process or underlying considerations suppress this evenly distribution nature of HBase?

Thanks.
Best Regards,
Fredrik

Sagar Kapare

unread,
Jun 8, 2016, 10:19:29 PM6/8/16
to Fredrik, CDAP User
Hi Fredrik,

Can you please provide more information - 

1. The complete importTSV command that is used for the data upload?
2. When using importTSV for loading data, was HBase table already existed or the table was created by importTSV utility during the data upload process?
3. What version of HBase you are using?

Thanks and Regards,
Sagar

--
You received this message because you are subscribed to the Google Groups "CDAP User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.
To post to this group, send email to cdap...@googlegroups.com.

Fredrik

unread,
Jun 9, 2016, 12:06:29 PM6/9/16
to CDAP User, frederic...@gmail.com
Hello Sagar,

Please check the inline comments. Thanks.

1. The complete importTSV command that is used for the data upload?
> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns="HBASE_ROW_KEY,<COLUMN_FAMILY>:<COLUMN_NAME>..." <TABLE_NAME> /user/sqoop2/tmp_02_txt 
2. When using importTSV for loading data, was HBase table already existed or the table was created by importTSV utility during the data upload process?
The HBase table was created via command on HBase shell (create ‘<TABLE_NAME>’, ‘<COLUMN_FAMILY_NAME>’), before importTSV loading data.
3. What version of HBase you are using?
hbase(main):001:0> version
1.0.0-cdh5.5.2, rUnknown, Mon Jan 25 16:26:48 PST 2016

Sagar Kapare

unread,
Jun 17, 2016, 10:26:34 PM6/17/16
to Fredrik, CDAP User
Hi Fredrik,

Sorry for the delayed response!

I tried reproducing the issue on my end and noticed the similar behavior. I have filed improvement JIRA for this - https://issues.cask.co/browse/CDAP-6228

Please watch that JIRA for further updates.

Thanks and Regards,
Sagar

Fredrik

unread,
Jun 19, 2016, 11:07:04 AM6/19/16
to CDAP User, frederic...@gmail.com
Hello Sagar,

Thank you for the information.

Best Regards,
Fredrik
Reply all
Reply to author
Forward
0 new messages