Hbase lookup and tap

Yuri Finkelstein

unread,

Aug 21, 2012, 2:18:48 AM8/21/12

to cascadi...@googlegroups.com

Hello,

I'm trying to determine if Cascading can simplify the task of building the following hadoop flow. The following are the steps in this flow:

1. Fetch a set of documents from MongoDB using a mongo query (it would be great to use 10gen's Mongo Hadoop adaptor since it handles input split better then mongo tap for Cascading)

2. Locate "picture URL" field in these documents (it's a top level field)

3. Lookup the picture in Hbase in hadoop where the row key in Hbase is a simple function of the picture URL (assume row key IS the picture URL from simplicity).

4. In case the picture URL is not found in Hbase - fetch the picture using http get and store its bytes into Hbase

A highly desirable optimization is to use async http reads such that N get calls are invoked concurrently (to reduce job execution time).

I can program that as a native map/red client but the code will be fairly complicated. Can Cascading offer a better story?

Thanks!

Yuri

Ken Krugler

unread,

Aug 21, 2012, 9:18:15 AM8/21/12

to cascadi...@googlegroups.com

Hi Yuri,

Yes, it should be relatively simple, other than integration issues (e.g. using the Mongo Hadoop adapter w/Cascading, and I'm not sure of the status of the HBase Tap)

As you pointed out, fetching URLs efficiently means threading, to avoid blocking on external I/O. You also want to ensure that you obey robots.txt, fetch politely with a crawl delay, etc. Typically I pipe URLs into a FetchPipe subassembly (see the Bixo project).

-- Ken

--------------------------

Ken Krugler

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Mahout & Solr

Yuri Finkelstein

unread,

Aug 21, 2012, 12:31:14 PM8/21/12

to cascadi...@googlegroups.com

I was hoping somebody would sketch up a prototype of assembly for this flow. I'm new to Cascading and don't see any indication that Cascading supports threading explicitly. Also, for lookup phase of my flow, should I use a custom Operation to invoke Hbase client explicitly or instead try to reuse HBase tap as a source? In the later case I'm having difficulty visualizing the shape of my flow.

Thanks.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Ken Krugler

unread,

Aug 21, 2012, 1:12:23 PM8/21/12

to cascadi...@googlegroups.com

On Aug 21, 2012, at 9:31am, Yuri Finkelstein wrote:

I was hoping somebody would sketch up a prototype of assembly for this flow.

Something like the below…

Pipe p = new Pipe("picture url pipe");

p = new Each(p, new Fields("picture-url"), new Identity());

p = new Each(p, new Fields("picture-url"), new MyCustomHBaseOperation()); // Assume this only emits entries where picture isn't in HBase

p = new FetchPipe(xxx, yyy); // from Bixo

p = p.getContentTailPipe(); // two pipes from sub-assembly - one has status, one has fetched content

p = new Each(p, new MyCustomContentPreparer()); // get content into form/fields required for HBase

Tap source = new MongoDBTap(xxx);

Tap sink = new HBaseDBTap(xxx);

Flow f = FlowConnector().connect(source, sink, p);

I'm new to Cascading and don't see any indication that Cascading supports threading explicitly.

It doesn't.

But the Bixo open source web mining toolkit has a Cascading Subassembly (called FetchPipe) that you can feed URLs, and it has threading support.

Also, for lookup phase of my flow, should I use a custom Operation to invoke Hbase client explicitly or instead try to reuse HBase tap as a source? In the later case I'm having difficulty visualizing the shape of my flow.

You'd want a custom operation.

-- Ken

Yuri Finkelstein

unread,

Aug 21, 2012, 9:22:26 PM8/21/12

to cascadi...@googlegroups.com

Ok, thanks. That's a good starting point! I might come back with some questions :)

Reply all

Reply to author

Forward