Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Hbase lookup and tap
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  5 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Yuri Finkelstein  
View profile  
 More options Aug 21 2012, 2:18 am
From: Yuri Finkelstein <yurif2...@gmail.com>
Date: Mon, 20 Aug 2012 23:18:48 -0700 (PDT)
Local: Tues, Aug 21 2012 2:18 am
Subject: Hbase lookup and tap

Hello,
I'm trying to determine if Cascading can simplify the task of building the
following hadoop flow. The following are the steps in this flow:

1. Fetch a set of documents from MongoDB using a mongo query (it would be
great to use 10gen's Mongo Hadoop adaptor since it handles input split
better then mongo tap for Cascading)
2. Locate "picture URL" field in these documents (it's a top level field)
3. Lookup the picture in Hbase in hadoop where the row key in Hbase is a
simple function of the picture URL (assume row key IS the picture URL from
simplicity).
4. In case the picture URL is not found in Hbase - fetch the picture using
http get and store its bytes into Hbase

A highly desirable optimization is to use async http reads such that N get
calls are invoked concurrently (to reduce job execution time).

I can program that as a native map/red client but the code will be fairly
complicated. Can Cascading offer a better story?

Thanks!
Yuri


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ken Krugler  
View profile  
 More options Aug 21 2012, 9:18 am
From: Ken Krugler <kkrugler_li...@transpac.com>
Date: Tue, 21 Aug 2012 06:18:15 -0700
Local: Tues, Aug 21 2012 9:18 am
Subject: Re: Hbase lookup and tap

Hi Yuri,

On Aug 20, 2012, at 11:18pm, Yuri Finkelstein wrote:

> Hello,
> I'm trying to determine if Cascading can simplify the task of building the following hadoop flow. The following are the steps in this flow:

> 1. Fetch a set of documents from MongoDB using a mongo query (it would be great to use 10gen's Mongo Hadoop adaptor since it handles input split better then mongo tap for Cascading)
> 2. Locate "picture URL" field in these documents (it's a top level field)
> 3. Lookup the picture in Hbase in hadoop where the row key in Hbase is a simple function of the picture URL (assume row key IS the picture URL from simplicity).
> 4. In case the picture URL is not found in Hbase - fetch the picture using http get and store its bytes into Hbase

> A highly desirable optimization is to use async http reads such that N get calls are invoked concurrently (to reduce job execution time).

> I can program that as a native map/red client but the code will be fairly complicated. Can Cascading offer a better story?

Yes, it should be relatively simple, other than integration issues (e.g. using the Mongo Hadoop adapter w/Cascading, and I'm not sure of the status of the HBase Tap)

As you pointed out, fetching URLs efficiently means threading, to avoid blocking on external I/O. You also want to ensure that you obey robots.txt, fetch politely with a crawl delay, etc. Typically I pipe URLs into a FetchPipe subassembly (see the Bixo project).

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Yuri Finkelstein  
View profile   Translate to Translated (View Original)
 More options Aug 21 2012, 12:31 pm
From: Yuri Finkelstein <yurif2...@gmail.com>
Date: Tue, 21 Aug 2012 09:31:14 -0700
Local: Tues, Aug 21 2012 12:31 pm
Subject: Re: Hbase lookup and tap

I was hoping somebody would sketch up a prototype of assembly for this
flow. I'm new to Cascading and don't see any indication that Cascading
supports threading explicitly. Also, for lookup phase of my flow, should I
use a custom Operation to invoke Hbase client explicitly or instead try to
reuse HBase tap as a source? In the later case I'm having difficulty
visualizing the shape of my flow.
Thanks.

On Tue, Aug 21, 2012 at 6:18 AM, Ken Krugler <kkrugler_li...@transpac.com>wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ken Krugler  
View profile   Translate to Translated (View Original)
 More options Aug 21 2012, 1:12 pm
From: Ken Krugler <kkrugler_li...@transpac.com>
Date: Tue, 21 Aug 2012 10:12:23 -0700
Local: Tues, Aug 21 2012 1:12 pm
Subject: Re: Hbase lookup and tap

On Aug 21, 2012, at 9:31am, Yuri Finkelstein wrote:

> I was hoping somebody would sketch up a prototype of assembly for this flow.

Something like the below…

Pipe p = new Pipe("picture url pipe");
p = new Each(p, new Fields("picture-url"), new Identity());
p = new Each(p, new Fields("picture-url"), new MyCustomHBaseOperation());             // Assume this only emits entries where picture isn't in HBase
p = new FetchPipe(xxx, yyy);                                                                                                    // from Bixo
p = p.getContentTailPipe();                                                                                                     // two pipes from sub-assembly - one has status, one has fetched content
p = new Each(p, new MyCustomContentPreparer());                                                         // get content into form/fields required for HBase

Tap source = new MongoDBTap(xxx);
Tap sink = new HBaseDBTap(xxx);

Flow f = FlowConnector().connect(source, sink, p);

> I'm new to Cascading and don't see any indication that Cascading supports threading explicitly.

It doesn't.

But the Bixo open source web mining toolkit has a Cascading Subassembly (called FetchPipe) that you can feed URLs, and it has threading support.

> Also, for lookup phase of my flow, should I use a custom Operation to invoke Hbase client explicitly or instead try to reuse HBase tap as a source? In the later case I'm having difficulty visualizing the shape of my flow.

You'd want a custom operation.

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Yuri Finkelstein  
View profile  
 More options Aug 21 2012, 9:22 pm
From: Yuri Finkelstein <yurif2...@gmail.com>
Date: Tue, 21 Aug 2012 18:22:26 -0700
Local: Tues, Aug 21 2012 9:22 pm
Subject: Re: Hbase lookup and tap

Ok, thanks. That's a good starting point! I might come back with some
questions :)

On Tue, Aug 21, 2012 at 10:12 AM, Ken Krugler
<kkrugler_li...@transpac.com>wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »