Atomic increment Distribute Sequential HBase

210 views
Skip to first unread message

Syed Abdul Kather

unread,
Jul 20, 2012, 8:55:00 AM7/20/12
to HBaseWD - Distribute Sequential HBase Writes
Team ,
In my use case i have hbase table which contains more than 20
million records and my rowkey is unique(using incrementColumnValue
like auto gentrated number) which will be further used by solr for
permission. But i am calling that incrementColumnValue function
from .net by using ThriftAPI. Process is handled at the Java End
(Hadoop). when i tried to run Map - reduce Program i had suffered from
Hot Spotting Problem.


I had plan to change this incrementColumnValue function which will
return the salting Row Key( as per HBASEWD) . Is this a right way to
approach or let me know if there is any other method which gives
performance.

Thanks in Advance

Syed Abdul Kather


Alex Baranau

unread,
Jul 20, 2012, 10:23:47 AM7/20/12
to hba...@googlegroups.com
Hi Syed,

Before I can judge whether HBaseWD is the right thing to use for you, let me ask you several Qs:

What is the access pattern of the data in HBase? I guess you process data with MapReduce job (to create index in Solr?). Is this the primary access pattern?

I guess apart from uniqueness of row keys, the reason why you use continuously incrementing row key is to be able to fetch "newly arrived (non-process) delta" (e.g. to feed into MR job). Is that so? Or you use it solely to achieve uniqueness? Do you write from different machines in HBase?

Perhaps, to better understand your use-case, I should ask: have you considered using random UUID as a row key? If yes, why it is not appropriate to use it in your case?

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

P.S. Thanx for the interest in HBaseWD rpoject.
--
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

Syed Abdul Kather

unread,
Jul 20, 2012, 11:30:37 AM7/20/12
to HBaseWD - Distribute Sequential HBase Writes
Thanks Alex Baranau for your quick response

What is the access pattern of the data in HBase?
We are maintaining the row key to solely to achieve uniqueness.
But at the same time we would be having the data in sequence it should
be incremental order.

I guess you process data with MapReduce job (to create index in
Solr?)
No , i have a custom logic in which i am generating Map file /per
user . Which will be then accessed by solr for authentication
purpose.

Or you use it solely to achieve uniqueness?
yes , Let say if there 10 document(in real time i have 10Billion) and
if i have 10 user .
USER1[0,0,1,0,1,0,0,0,0,0] == User 1 has access to doc 3, doc 5, doc
10 . In lucence level i am deciding whether for the rowkey(Unique Key
in solr) he has access or not

Perhaps, to better understand your use-case, I should ask: have you
considered using random UUID as a row key? If yes, why it is not
appropriate to use it in your case?

No , As per our logic each row key is going to bit position of my
map(Custom map) file. if 8 Million documents is the then size of
map(custom map) will be 1MB


My Map Reduce program is generating this Custom Map File..




Do you write from different machines in HBase?
No it will from one machin only.

On Jul 20, 7:23 pm, Alex Baranau <alex.barano...@gmail.com> wrote:
> Hi Syed,
>
> Before I can judge whether HBaseWD is the right thing to use for you, let
> me ask you several Qs:
>
> What is the access pattern of the data in HBase? I guess you process data
> with MapReduce job (to create index in Solr?). Is this the primary access
> pattern?
>
> I guess apart from uniqueness of row keys, the reason why you use
> continuously incrementing row key is to be able to fetch "newly arrived
> (non-process) delta" (e.g. to feed into MR job). Is that so? Or you use it
> solely to achieve uniqueness? Do you write from different machines in HBase?
>
> Perhaps, to better understand your use-case, I should ask: have you
> considered using random UUID as a row key? If yes, why it is not
> appropriate to use it in your case?
>
> Alex Baranau
> ------
> Sematext ::http://blog.sematext.com/:: Hadoop - HBase - ElasticSearch -
> Solr
>
> P.S. Thanx for the interest in HBaseWD rpoject.
>

Alex Baranau

unread,
Jul 20, 2012, 12:30:45 PM7/20/12
to hba...@googlegroups.com
> we would be having the data in sequence it should
> be incremental order

Ok, if this is a requirement in your system. And you need scan (or feed to MR job) based on range in this sequence of ids? 

If not, i.e. if you don't scan (or feed to MR job) based on the row keys range (which represents a sequence), then it might be OK just to prefix the keys with hash. Then in your MR job you can just omit that prefix and in also will be able to access individual document by id, using rowKey=hash(originalId)+(originalId). If you don't need to fetch a *range* of records by start/stop documentId, then this should be enough.

If you do have to fetch *range* of records by start/stop documentId, in this case HBaseWD will be of help, as it provides ability to scan and feed range of records to MR based on start/stop key, after you prefixed the row key.


Hope this helps,

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

syed kather

unread,
Jul 20, 2012, 1:14:34 PM7/20/12
to hba...@googlegroups.com

Thank you so much for urban valuable reply..
In my case I need to do range query start/stop .  Hbasewd has that features right.

syed kather

unread,
Aug 3, 2012, 11:21:46 AM8/3/12
to hba...@googlegroups.com
Alex ,
   I had added this code .
byte bucketsCount=32

public byte[] incrementColumnValue(final byte[] row, final byte[] family, final byte[] qualifier, final long amount, boolean writeToWAL, byte bucketsCount) throws IOException
    {
        RowKeyDistributorByOneBytePrefix keyDistributor = new RowKeyDistributorByOneBytePrefix(bucketsCount);
        return keyDistributor.getDistributedKey(Bytes.toBytes(incrementColumnValue(keyDistributor.getOriginalKey(row), family, qualifier, amount, true)));

    }

but all the key is getting into one region

What may be the reason ?
Please help me




On Jul 20, 2012 10:00 PM, "Alex Baranau" <alex.ba...@gmail.com> wrote:

Syed Abdul Kather

unread,
Aug 3, 2012, 11:54:28 AM8/3/12
to hba...@googlegroups.com

Syed Abdul Kather

unread,
Aug 3, 2012, 11:54:48 AM8/3/12
to hba...@googlegroups.com
Alex , Sorry that my mistake i had understood the problem the constructor is keep on getting called . i am trying to fix now .

syed kather

unread,
Aug 3, 2012, 12:44:05 PM8/3/12
to hba...@googlegroups.com
Team,
   How to do Pre splitting  for HBase table . As i mention in the earlier mail that my rowkey is followed by incremental sequence


when i had created a table like give below . I had notice the startkey and endkey is looking like that

bin/hbase org.apache.hadoop.hbase.util.RegionSplitter ObjectSequence4 -c 3 -f SequenceFamily:ObjectArray:UserArray

Name Region Server Start Key End Key Requests
ObjectSequence4,,1344011081570.941ae89b0c729c924b80d8cf8b8fa4b8. slave1:60030
2aaaaaaa 31670
ObjectSequence4,2aaaaaaa,1344011081570.f4c5647be69af33116fc214f913bb312. slave2:60030 2aaaaaaa 55555554 0
ObjectSequence4,55555554,1344011081570.e5bc298b6915f2b9c42bb28e31c8e640. master:60030 55555554
0
 Sorry if the question is so simple . I searched in net but i couldnt able to find right one

 Other than that i have another doubt
  I am calling the incrementColumnValue from ThriftAPI . How to call this custom incrementColumnValue   function from ThriftAPI and as i know that 
RowKeyDistributorByOneBytePrefix keyDistributor = new RowKeyDistributorByOneBytePrefix(bucketsCount);
we need to initialize this object only at the begining . I dont have idea how to do that

Now my IncrementColumnValue function looks like

public byte[] incrementColumnValue(final byte[] row, RowKeyDistributorByOneBytePrefix keyDistributor,final byte[] family, final byte[] qualifier, final long amount, boolean writeToWAL, byte bucketsCount) throws IOException
    {
        byte[] key=keyDistributor.getDistributedKey(Bytes.toBytes(incrementColumnValue(keyDistributor.getOriginalKey(row), family, qualifier, amount, true)));
        return key;

    }






            Thanks and Regards,
        S SYED ABDUL KATHER 
Reply all
Reply to author
Forward
0 new messages