Hello, I'm trying to determine if Cascading can simplify the task of building the following hadoop flow. The following are the steps in this flow:
1. Fetch a set of documents from MongoDB using a mongo query (it would be great to use 10gen's Mongo Hadoop adaptor since it handles input split better then mongo tap for Cascading) 2. Locate "picture URL" field in these documents (it's a top level field) 3. Lookup the picture in Hbase in hadoop where the row key in Hbase is a simple function of the picture URL (assume row key IS the picture URL from simplicity). 4. In case the picture URL is not found in Hbase - fetch the picture using http get and store its bytes into Hbase
A highly desirable optimization is to use async http reads such that N get calls are invoked concurrently (to reduce job execution time).
I can program that as a native map/red client but the code will be fairly complicated. Can Cascading offer a better story?
On Aug 20, 2012, at 11:18pm, Yuri Finkelstein wrote:
> Hello, > I'm trying to determine if Cascading can simplify the task of building the following hadoop flow. The following are the steps in this flow:
> 1. Fetch a set of documents from MongoDB using a mongo query (it would be great to use 10gen's Mongo Hadoop adaptor since it handles input split better then mongo tap for Cascading)
> 2. Locate "picture URL" field in these documents (it's a top level field) > 3. Lookup the picture in Hbase in hadoop where the row key in Hbase is a simple function of the picture URL (assume row key IS the picture URL from simplicity).
> 4. In case the picture URL is not found in Hbase - fetch the picture using http get and store its bytes into Hbase
> A highly desirable optimization is to use async http reads such that N get calls are invoked concurrently (to reduce job execution time).
> I can program that as a native map/red client but the code will be fairly complicated. Can Cascading offer a better story?
Yes, it should be relatively simple, other than integration issues (e.g. using the Mongo Hadoop adapter w/Cascading, and I'm not sure of the status of the HBase Tap)
As you pointed out, fetching URLs efficiently means threading, to avoid blocking on external I/O. You also want to ensure that you obey robots.txt, fetch politely with a crawl delay, etc. Typically I pipe URLs into a FetchPipe subassembly (see the Bixo project).
-- Ken
--------------------------
Ken Krugler
http://www.scaleunlimited.com custom big data solutions & training
Hadoop, Cascading, Mahout & Solr
I was hoping somebody would sketch up a prototype of assembly for this
flow. I'm new to Cascading and don't see any indication that Cascading
supports threading explicitly. Also, for lookup phase of my flow, should I
use a custom Operation to invoke Hbase client explicitly or instead try to
reuse HBase tap as a source? In the later case I'm having difficulty
visualizing the shape of my flow.
Thanks.
On Tue, Aug 21, 2012 at 6:18 AM, Ken Krugler <kkrugler_li...@transpac.com>wrote:
> On Aug 20, 2012, at 11:18pm, Yuri Finkelstein wrote:
> Hello,
> I'm trying to determine if Cascading can simplify the task of building the
> following hadoop flow. The following are the steps in this flow:
> 1. Fetch a set of documents from MongoDB using a mongo query (it would be
> great to use 10gen's Mongo Hadoop adaptor since it handles input split
> better then mongo tap for Cascading)
> 2. Locate "picture URL" field in these documents (it's a top level field)
> 3. Lookup the picture in Hbase in hadoop where the row key in Hbase is a
> simple function of the picture URL (assume row key IS the picture URL from
> simplicity).
> 4. In case the picture URL is not found in Hbase - fetch the picture using
> http get and store its bytes into Hbase
> A highly desirable optimization is to use async http reads such that N get
> calls are invoked concurrently (to reduce job execution time).
> I can program that as a native map/red client but the code will be fairly
> complicated. Can Cascading offer a better story?
> Yes, it should be relatively simple, other than integration issues (e.g.
> using the Mongo Hadoop adapter w/Cascading, and I'm not sure of the status
> of the HBase Tap)
> As you pointed out, fetching URLs efficiently means threading, to avoid
> blocking on external I/O. You also want to ensure that you obey robots.txt,
> fetch politely with a crawl delay, etc. Typically I pipe URLs into a
> FetchPipe subassembly (see the Bixo project).
> -- Ken
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com > custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> --
> You received this message because you are subscribed to the Google Groups
> "cascading-user" group.
> To post to this group, send email to cascading-user@googlegroups.com.
> To unsubscribe from this group, send email to
> cascading-user+unsubscribe@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/cascading-user?hl=en.
On Aug 21, 2012, at 9:31am, Yuri Finkelstein wrote:
> I was hoping somebody would sketch up a prototype of assembly for this flow.
Something like the below…
Pipe p = new Pipe("picture url pipe");
p = new Each(p, new Fields("picture-url"), new Identity());
p = new Each(p, new Fields("picture-url"), new MyCustomHBaseOperation()); // Assume this only emits entries where picture isn't in HBase
p = new FetchPipe(xxx, yyy); // from Bixo
p = p.getContentTailPipe(); // two pipes from sub-assembly - one has status, one has fetched content
p = new Each(p, new MyCustomContentPreparer()); // get content into form/fields required for HBase
Tap source = new MongoDBTap(xxx);
Tap sink = new HBaseDBTap(xxx);
Flow f = FlowConnector().connect(source, sink, p);
> I'm new to Cascading and don't see any indication that Cascading supports threading explicitly.
It doesn't.
But the Bixo open source web mining toolkit has a Cascading Subassembly (called FetchPipe) that you can feed URLs, and it has threading support.
> Also, for lookup phase of my flow, should I use a custom Operation to invoke Hbase client explicitly or instead try to reuse HBase tap as a source? In the later case I'm having difficulty visualizing the shape of my flow.
> On Tue, Aug 21, 2012 at 6:18 AM, Ken Krugler <kkrugler_li...@transpac.com> wrote:
> Hi Yuri,
> On Aug 20, 2012, at 11:18pm, Yuri Finkelstein wrote:
>> Hello, >> I'm trying to determine if Cascading can simplify the task of building the following hadoop flow. The following are the steps in this flow:
>> 1. Fetch a set of documents from MongoDB using a mongo query (it would be great to use 10gen's Mongo Hadoop adaptor since it handles input split better then mongo tap for Cascading)
>> 2. Locate "picture URL" field in these documents (it's a top level field) >> 3. Lookup the picture in Hbase in hadoop where the row key in Hbase is a simple function of the picture URL (assume row key IS the picture URL from simplicity).
>> 4. In case the picture URL is not found in Hbase - fetch the picture using http get and store its bytes into Hbase
>> A highly desirable optimization is to use async http reads such that N get calls are invoked concurrently (to reduce job execution time).
>> I can program that as a native map/red client but the code will be fairly complicated. Can Cascading offer a better story?
> Yes, it should be relatively simple, other than integration issues (e.g. using the Mongo Hadoop adapter w/Cascading, and I'm not sure of the status of the HBase Tap)
> As you pointed out, fetching URLs efficiently means threading, to avoid blocking on external I/O. You also want to ensure that you obey robots.txt, fetch politely with a crawl delay, etc. Typically I pipe URLs into a FetchPipe subassembly (see the Bixo project).
> -- Ken
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com > custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> -- > You received this message because you are subscribed to the Google Groups "cascading-user" group.
> To post to this group, send email to cascading-user@googlegroups.com.
> To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
> -- > You received this message because you are subscribed to the Google Groups "cascading-user" group.
> To post to this group, send email to cascading-user@googlegroups.com.
> To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
--------------------------
Ken Krugler
http://www.scaleunlimited.com custom big data solutions & training
Hadoop, Cascading, Mahout & Solr
> On Aug 21, 2012, at 9:31am, Yuri Finkelstein wrote:
> I was hoping somebody would sketch up a prototype of assembly for this
> flow.
> Something like the below…
> Pipe p = new Pipe("picture url pipe");
> p = new Each(p, new Fields("picture-url"), new Identity());
> p = new Each(p, new Fields("picture-url"), new MyCustomHBaseOperation()); //
> Assume this only emits entries where picture isn't in HBase
> p = new FetchPipe(xxx, yyy); // from Bixo
> p = p.getContentTailPipe(); // two pipes from
> sub-assembly - one has status, one has fetched content
> p = new Each(p, new MyCustomContentPreparer()); // get content into
> form/fields required for HBase
> Tap source = new MongoDBTap(xxx);
> Tap sink = new HBaseDBTap(xxx);
> Flow f = FlowConnector().connect(source, sink, p);
> I'm new to Cascading and don't see any indication that Cascading supports
> threading explicitly.
> It doesn't.
> But the Bixo open source web mining toolkit has a Cascading Subassembly
> (called FetchPipe) that you can feed URLs, and it has threading support.
> Also, for lookup phase of my flow, should I use a custom Operation to
> invoke Hbase client explicitly or instead try to reuse HBase tap as a
> source? In the later case I'm having difficulty visualizing the shape of my
> flow.
> You'd want a custom operation.
> -- Ken
> On Tue, Aug 21, 2012 at 6:18 AM, Ken Krugler <kkrugler_li...@transpac.com>wrote:
>> Hi Yuri,
>> On Aug 20, 2012, at 11:18pm, Yuri Finkelstein wrote:
>> Hello,
>> I'm trying to determine if Cascading can simplify the task of building
>> the following hadoop flow. The following are the steps in this flow:
>> 1. Fetch a set of documents from MongoDB using a mongo query (it would be
>> great to use 10gen's Mongo Hadoop adaptor since it handles input split
>> better then mongo tap for Cascading)
>> 2. Locate "picture URL" field in these documents (it's a top level field)
>> 3. Lookup the picture in Hbase in hadoop where the row key in Hbase is a
>> simple function of the picture URL (assume row key IS the picture URL from
>> simplicity).
>> 4. In case the picture URL is not found in Hbase - fetch the picture
>> using http get and store its bytes into Hbase
>> A highly desirable optimization is to use async http reads such that N
>> get calls are invoked concurrently (to reduce job execution time).
>> I can program that as a native map/red client but the code will be fairly
>> complicated. Can Cascading offer a better story?
>> Yes, it should be relatively simple, other than integration issues (e.g.
>> using the Mongo Hadoop adapter w/Cascading, and I'm not sure of the status
>> of the HBase Tap)
>> As you pointed out, fetching URLs efficiently means threading, to avoid
>> blocking on external I/O. You also want to ensure that you obey robots.txt,
>> fetch politely with a crawl delay, etc. Typically I pipe URLs into a
>> FetchPipe subassembly (see the Bixo project).
>> -- Ken
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com >> custom big data solutions & training
>> Hadoop, Cascading, Mahout & Solr
>> --
>> You received this message because you are subscribed to the Google Groups
>> "cascading-user" group.
>> To post to this group, send email to cascading-user@googlegroups.com.
>> To unsubscribe from this group, send email to
>> cascading-user+unsubscribe@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/cascading-user?hl=en.
> --
> You received this message because you are subscribed to the Google Groups
> "cascading-user" group.
> To post to this group, send email to cascading-user@googlegroups.com.
> To unsubscribe from this group, send email to
> cascading-user+unsubscribe@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/cascading-user?hl=en.
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com > custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> --
> You received this message because you are subscribed to the Google Groups
> "cascading-user" group.
> To post to this group, send email to cascading-user@googlegroups.com.
> To unsubscribe from this group, send email to
> cascading-user+unsubscribe@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/cascading-user?hl=en.