Currently in Bixo I have a Cascading sub-assembly called ParsePipe,
that consumes FetchedDatum tuples (output from good web page fetches),
runs an injected parser, and generates ParsedDatum tuples.
The ParsedDatum tuple has the result of parsing the downloaded
content, and also an array of discovered outlinks from web pages.
Currently I take this output, split it by defining a new pipe, and
process the outlinks to generate new UrlDatum tuples that can be
injected back into the crawl process.
But I was wondering if there was some more efficient way, where the
parse operation could avoid payloading the outlinks in the
ParsedDatum, but instead immediately split it out into a separate tail
pipe for the ParsePipe subassembly that would emit new OutlinkDatum
tuples (source page URL, outlink URL, attributes on link, anchor text,
etc). And then I'd remove the outlink array from the ParsedDatum tuple.
Based on what I had to do for a similar situation in the FetchPipe,
there's no really simple way to do this. I'll need to append the
outlinks to the parsed result tuple being generated by my Each
operator, then split that pipe and filter two ways - the new pipe
would just take the outlinks, and the original pipe would filter them
out to just leave the parse data.
Or is there some cleaner way?
Thanks,
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g