start& stop keys in a mapreduce job

91 views
Skip to first unread message

Andre

unread,
Jul 16, 2012, 5:13:25 AM7/16/12
to hba...@googlegroups.com
Hi,

this is the example how to use hbasewd in a mapreduce job

    Configuration conf = HBaseConfiguration.create();
    Job job = new Job(conf, "testMapreduceJob");

    Scan scan = new Scan(startKey, stopKey);

    TableMapReduceUtil.initTableMapperJob("table", scan,
      RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, job);

    // Substituting standard TableInputFormat which was set in
    // TableMapReduceUtil.initTableMapperJob(...)
    job.setInputFormatClass(WdTableInputFormat.class);
    keyDistributor.addInfo(job.getConfiguration());

now my simple question is:
startKey and stopKey, are they the original keys or distributed keys?
cheers
andre

Andre

unread,
Jul 16, 2012, 5:20:46 AM7/16/12
to hba...@googlegroups.com
ok, i guess they are original keys :-) 
pls correct me, if i'm wron

Alex Baranau

unread,
Jul 16, 2012, 10:14:36 AM7/16/12
to hba...@googlegroups.com
Yes, they are original keys. Thus you can configure & run one MR job. Otherwise you'd have to have multiple to cover each distributed keys "bucket".

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

On Mon, Jul 16, 2012 at 5:20 AM, Andre <and...@googlemail.com> wrote:
ok, i guess they are original keys :-) 
pls correct me, if i'm wron



--
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

Lily Hendra

unread,
Sep 26, 2013, 11:41:05 AM9/26/13
to hba...@googlegroups.com
Hi Alex, 

I have a follow-up question on the start and stop keys.
Can I provide partial start and stop keys of the original keys?
E.g. original keys: ABCDE and ABCEE. My scan's startkey is ABC, and stopkey is ABD
Would the key distributor distribute my start and stop keys correctly such that ABCDE and ABCEE are found in one MR job?
Thanks.

Lily

Ionut Ignatescu

unread,
Sep 26, 2013, 11:51:19 AM9/26/13
to hba...@googlegroups.com
Hi Lily,

I will offer you my response:
In brief: YES
In details: supposing you have original keys ABCDE and ABCEE and you created a tables with 16 distinct regions...from 0 to 15. It's not important if you run a scan in MR job or in a dummy main program, it's important to have the correct setup, according to the docs. When you run a scan with HBaseWD, in the current context you'll have 16 distinct scans with the next start and end key
Scan 0: 0ABC ... 0ABD
Scan 1: 1ABC ... 1ABD
...
Scan 15: 15ABC ... 15ABD
After that, all returned results are ordered based on the original key.

Really hope this answer will help you.

Regards,
Ionut I.





2013/9/26 Lily Hendra <lilyh...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "HBaseWD - Distribute Sequential HBase Writes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hbasewd+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Lily Hendra

unread,
Sep 26, 2013, 12:02:12 PM9/26/13
to hba...@googlegroups.com
Ionut, 
Thanks so much for your quick response. That's exactly what I need to hear before I start the effort of using the library in our project. We've been having so much problem with one of our own solutions of distributing our keys, that doesn't allow what I previously described.
Lily


--
You received this message because you are subscribed to a topic in the Google Groups "HBaseWD - Distribute Sequential HBase Writes" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hbasewd/aw1behO9VyU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hbasewd+u...@googlegroups.com.

Alex Baranau

unread,
Oct 1, 2013, 7:55:35 PM10/1/13
to hba...@googlegroups.com
Ionut,

Thank you for taking time to answer! I'd also add to that a note, that in mapreduce case you will have a map task per partition, so in Ionut's example 16 times more map tasks. Each reading its own partition. Which should not break the logic of any well-written mapreduce job, and also depending on the use-case may even be a benefit (I believe this is mentioned in docs or in other discussions on the mailing list).

Alex Baranau

Lily Hendra

unread,
Oct 2, 2013, 10:18:13 AM10/2/13
to hba...@googlegroups.com
Alex, thanks for the clarification. That confirms what I thought it would do.
Lily
Reply all
Reply to author
Forward
0 new messages