Crawling using thee scalding framework

64 views
Skip to first unread message

Jyotirmoy Sundi

unread,
May 25, 2012, 9:36:16 PM5/25/12
to cascading-user
Hi,
I am looking for a crawling a site .
The contents of apage can have:
1 can have information to extract +
2 some more urls to crawl(pagination)
I tried to follow the bixo tuorial
https://github.com/javasoze/bixo/blob/master/src/main/java/bixo/examples/SimpleCrawlWorkflow.java,
but is wondering if there is anything similar in the scalding(by
twitter) framework.

Regards
Sundi

Oscar Boykin

unread,
May 26, 2012, 4:22:58 PM5/26/12
to cascadi...@googlegroups.com
I'm not aware of an example of crawling with scalding. We (Twitter)
do it for log processing. That said it shouldn't be too hard to
represent the crawl function they have as a map operation:

map('url -> 'crawlData) { url : String => getPageData(url) }

then process the return of getPageData.

Scalding has built in support for Kryo serialization, so you don't
need to worry about mapping the return of getPageData onto a Tuple if
you don't want.

Hope this helps somewhat.
> --
> You received this message because you are subscribed to the Google Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com.
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
>



--
Oscar Boykin :: @posco :: https://twitter.com/intent/user?screen_name=posco

Jyotirmoy Sundi

unread,
May 27, 2012, 5:02:08 AM5/27/12
to cascading-user
Thanks Oskar, will try this out and get back.

On May 26, 4:22 pm, Oscar Boykin <os...@twitter.com> wrote:
> I'm not aware of an example of crawling with scalding.  We (Twitter)
> do it for log processing.  That said it shouldn't be too hard to
> represent the crawl function they have as a map operation:
>
> map('url -> 'crawlData) { url : String => getPageData(url) }
>
> then process the return of getPageData.
>
> Scalding has built in support for Kryo serialization, so you don't
> need to worry about mapping the return of getPageData onto a Tuple if
> you don't want.
>
> Hope this helps somewhat.
>
>
>
>
>
>
>
>
>
> On Fri, May 25, 2012 at 6:36 PM, Jyotirmoy Sundi <sundi...@gmail.com> wrote:
> > Hi,
> >  I am looking for a crawling a site .
> >  The contents of  apage can have:
> >  1 can have information to extract +
> >  2 some more urls to crawl(pagination)
> >  I tried to follow the bixo tuorial
> >https://github.com/javasoze/bixo/blob/master/src/main/java/bixo/examp...,
> > but is wondering if there is anything similar in the scalding(by
> > twitter) framework.
>
> > Regards
> > Sundi
>
> > --
> > You received this message because you are subscribed to the Google Groups "cascading-user" group.
> > To post to this group, send email to cascadi...@googlegroups.com.
> > To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/cascading-user?hl=en.

Jyotirmoy Sundi

unread,
Jun 1, 2012, 2:03:37 PM6/1/12
to cascading-user
Hi Oskar,
  May I know when are you planning to update the Documentation of
https://github.com/twitter/scalding/wiki/Reading-Writing-Source-and-Sink-Taps
  In regards to my question , When I do a crawling , I want to make
sure that I dont do any repeated calls. Can I ensure that by reading
the tap sink file after a crash occurs and before the crawling
happens ?


Regards
Sundi

Oscar Boykin

unread,
Jul 16, 2012, 3:04:33 PM7/16/12
to cascadi...@googlegroups.com
I actually don't know the answer to this question.

We have a little extension to scalding internally that we haven't pushed that does something close to what you want: we added a new Operation that allows us to access the prepare/cleanup methods of Cascading Each.  We did this with an implicit conversion to a RichPipeExtension class.

The code is attached.  We'll try to add some form of this to the main version of scalding soon.

You could use this to prepare and cleanup your crawler.  For now, this code only does foreach, and just mutates something remote, but you could just as easily have map/flatMap/filter.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

SideEffect.scala
Reply all
Reply to author
Forward
0 new messages