Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
scan.setTimeRange performance
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  3 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Eugeny Morozov  
View profile  
 More options Sep 21 2012, 8:20 am
From: Eugeny Morozov <emoro...@griddynamics.com>
Date: Fri, 21 Sep 2012 16:03:21 +0400
Local: Fri, Sep 21 2012 8:03 am
Subject: scan.setTimeRange performance

Hello!

It is known and I saw it in the code that time range set by
scan.setTimeRange is used to filter out HFiles for further scan.
Which means that speed of following scanner.next must be almost zero in
case if I set time range far away in future. I am sure that I do not have
HFiles that fall into the set time range period.

But - and here is the question - surprisingly scanning with set time range
is far longer than without it.

My results are following:
Use range [false]. Time spent (avg): [0]
Use range [true]. Time spent (avg): [525]

There are KeyValues listed, when time range is not used.

The code is following:
    public static void run(boolean useRange, HTable table) throws Exception
{
        Scan scan = new Scan().addFamily( family ).setCaching( -1
).setCacheBlocks( false );
        scan.setStartRow( random start row );
        if (useRange) scan.setTimeRange(1348114401600L, 1348114401700L);

        ResultScanner scanner = table.getScanner(scan);
        for(int i = 0 ; i < N; i++) { // There were bunch of measures,
where N was from 10 to 50
            long time = System.currentTimeMillis();
            result = scanner.next();
            sum += (System.currentTimeMillis() - time) / N;
        }
    }

Of course such a measurements are include all sort of noise like network
overhead, etc, but I'm using virtual machine on my own box, and at the time
I do measurement there is no other activity neither on my own box or this
virtual machine, so such a noise is minimum.

Also I've used YourKit to measure tracing and sampling of running
HRegionServer, but didn't found anything suspicious. Though I didn't look
at heap and GC perf. Tracing is in attach.

So, the question is why is it so slow when time range is set and so fast
without it?
--
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
emoro...@griddynamics.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jean-Daniel Cryans  
View profile  
 More options Sep 24 2012, 3:15 pm
From: Jean-Daniel Cryans <jdcry...@apache.org>
Date: Mon, 24 Sep 2012 12:15:08 -0700
Local: Mon, Sep 24 2012 3:15 pm
Subject: Re: scan.setTimeRange performance
Hi Eugeny,

The mailing list stripped your attachement (as it often does) so you
might want to put it on a public web server.

I don't have much to contribute except than to point to a recent
conversation that you can find here:
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/28722

Hope this helps,

J-D

On Fri, Sep 21, 2012 at 5:03 AM, Eugeny Morozov


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eugeny Morozov  
View profile  
 More options Sep 25 2012, 2:53 am
From: Eugeny Morozov <emoro...@griddynamics.com>
Date: Tue, 25 Sep 2012 10:52:34 +0400
Local: Tues, Sep 25 2012 2:52 am
Subject: Re: scan.setTimeRange performance

Hi, Jean-Daniel, thanks for the reply.

I've found the reason. And it's quite simple to understand. Don't know why
I've missed it.
The reason for slow processing was the fact that specified time range was
too thin.

So, firstly Region Server filter out HFiles, which it will scan.
Then, it reads them (or just one HFile as in my case) and trying to find
first 10 to 50 values that fall into given time range. If time range is
thin, then Region Server must read the HFile almost completely. At the
contrast, whey there is no time range, then it just return first 10 to 50
values.

That's the difference.
That's actually the reason I'm sure that time range is working correctly =)

On Mon, Sep 24, 2012 at 11:15 PM, Jean-Daniel Cryans <jdcry...@apache.org>wrote:

--
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
emoro...@griddynamics.com

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »