HTable: number of map tasks table in few regions

max

unread,

Sep 12, 2011, 6:45:28 AM9/12/11

to HBaseWD - Distribute Sequential HBase Writes

Hi,

here you'll find a short summary of what I learned in the conversation
with Alex, the author of HBaseWD, and my usage of his project.

My original problem was: Having a HBase table, small in data size,
containing rows with nearly no values, just keys, used as input for
several map reduce tasks (currently ~17 million rows). Larger amount
of data is produced in the map reduce tasks run on the input table.
Because of the 'small' size of the input table only a few map reduce
tasks were started when using the default TableInputFormat, which
starts a map task for every region (only one region -> only one map
task).

With the need for smaller splits of the input table I found HBaseWD,
which now helps me to use all my cluster nodes when processing the
input table.

By using the RowKeyDistributorByOneBytePrefix _in my case_ all my
input table keys are prefixed with a byte in the range between 0 and
31.
The WdTableInputFormat then can provide a split for each of these 32
ranges of distributed keys, so I have at minimum 32 tasks processing
all rows of my 'small' input table.

A very good side effect is that I avoid hotspots on some cluster nodes
when processing just a subset of the input table. By prefixing each
key, in my case a geo location address, all the rows are mixed up.
This is spreading the calculations even on small geo areas represented
by a subset of all locations in the input table, although all rows
with the original (unprefixed) keys would be located nearby of each
other and so possibly be located in the same split and the same map
task.

Hopefully this could help some of you as it helped me.
Best,
Max

Alex Baranau

unread,

Sep 12, 2011, 7:20:34 AM9/12/11

to HBaseWD - Distribute Sequential HBase Writes

Thank you for getting back, Max.

I will review the changes you made in a fork and apply the fixes
you've made.

Alex.

Alex Baranau

unread,

Sep 15, 2011, 11:07:32 AM9/15/11

to HBaseWD - Distribute Sequential HBase Writes

Merged.

Alex.

Reply all

Reply to author

Forward