Efficient split of output data

22 views
Skip to first unread message

Ken Krugler

unread,
Nov 5, 2009, 5:39:04 PM11/5/09
to cascadi...@googlegroups.com
Hi all,

Currently in Bixo I have a Cascading sub-assembly called ParsePipe,
that consumes FetchedDatum tuples (output from good web page fetches),
runs an injected parser, and generates ParsedDatum tuples.

The ParsedDatum tuple has the result of parsing the downloaded
content, and also an array of discovered outlinks from web pages.

Currently I take this output, split it by defining a new pipe, and
process the outlinks to generate new UrlDatum tuples that can be
injected back into the crawl process.

But I was wondering if there was some more efficient way, where the
parse operation could avoid payloading the outlinks in the
ParsedDatum, but instead immediately split it out into a separate tail
pipe for the ParsePipe subassembly that would emit new OutlinkDatum
tuples (source page URL, outlink URL, attributes on link, anchor text,
etc). And then I'd remove the outlink array from the ParsedDatum tuple.

Based on what I had to do for a similar situation in the FetchPipe,
there's no really simple way to do this. I'll need to append the
outlinks to the parsed result tuple being generated by my Each
operator, then split that pipe and filter two ways - the new pipe
would just take the outlinks, and the original pipe would filter them
out to just leave the parse data.

Or is there some cleaner way?

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g


Chris K Wensel

unread,
Nov 6, 2009, 11:46:36 AM11/6/09
to cascadi...@googlegroups.com
sorry, must be too early. but I can't parse out what your asking.

might be easier to put this on paper over coffee.

ckw
--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

Ken Krugler

unread,
Nov 10, 2009, 1:07:42 PM11/10/09
to cascadi...@googlegroups.com
Hi Chris,

Sorry, I should have written a generic example that didn't include so
much Bixo stuff :)

I have a pipe with tuples, and a binary function that classifies each
tuple as "left" or "right",.

I want to wind up with two pipes, one that just has tuples classified
as "left", and the other with only those tuples that were classified
as "right".

Currently I do this in a Splitter sub-assembly that splits the input
pipe to create two tail pipes, and applies the function (via a Filter)
to each of the tail pipes (with one having reversed sense).

But this means the same data is getting processed twice.

If the upstream operation that's feeding my Splitter input pipe is a
reducer, then under the hood Cascading is writing those tuples to a
temp HDFS location anyway, and using that as input for the next map,
right?

So I could use a modified version of MultiSinkTap to create two
temporary output files from the input pipe (via applying the
classifier function), and use those as separate input sources for a
subsequent flow, but that's kind of complicated and creates two flows.

Which made me wonder if I was over-complicated things, and there was
an easier way to efficiently do this split.

Thanks,

-- Ken
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google
> Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en
> -~----------~----~----~----~------~----~------~--~---

Chris K Wensel

unread,
Nov 12, 2009, 2:32:07 PM11/12/09
to cascadi...@googlegroups.com
This all depends really if your cluster is IO or CPU bound. would be
nice to have a tool to benchmark clusters to decide what method makes
more sense for the cluster.

ckw
> --
>
> You received this message because you are subscribed to the Google
> Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=
> .
Reply all
Reply to author
Forward
0 new messages