max
unread,Sep 12, 2011, 6:45:28 AM9/12/11Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to HBaseWD - Distribute Sequential HBase Writes
Hi,
here you'll find a short summary of what I learned in the conversation
with Alex, the author of HBaseWD, and my usage of his project.
My original problem was: Having a HBase table, small in data size,
containing rows with nearly no values, just keys, used as input for
several map reduce tasks (currently ~17 million rows). Larger amount
of data is produced in the map reduce tasks run on the input table.
Because of the 'small' size of the input table only a few map reduce
tasks were started when using the default TableInputFormat, which
starts a map task for every region (only one region -> only one map
task).
With the need for smaller splits of the input table I found HBaseWD,
which now helps me to use all my cluster nodes when processing the
input table.
By using the RowKeyDistributorByOneBytePrefix _in my case_ all my
input table keys are prefixed with a byte in the range between 0 and
31.
The WdTableInputFormat then can provide a split for each of these 32
ranges of distributed keys, so I have at minimum 32 tasks processing
all rows of my 'small' input table.
A very good side effect is that I avoid hotspots on some cluster nodes
when processing just a subset of the input table. By prefixing each
key, in my case a geo location address, all the rows are mixed up.
This is spreading the calculations even on small geo areas represented
by a subset of all locations in the input table, although all rows
with the original (unprefixed) keys would be located nearby of each
other and so possibly be located in the same split and the same map
task.
Hopefully this could help some of you as it helped me.
Best,
Max