Crawling pre-defined list of URLs with non-blocking parallel UserAgent

414 views
Skip to first unread message

hammondos

unread,
Jul 22, 2012, 5:05:03 AM7/22/12
to mojol...@googlegroups.com
Hi,

I'm trying to use the non-blocking parallel function of Mojo::UserAgent & Mojo::IOLoop to crawl a pre-defined list of several thousand URLs from a database and place their content into XML files for indexing (ie with Perl's "Lucy").

I'm completely new to event loops so may be making a fundamental error here - but have tried all of the examples I can find in the documentation & cookbook. The cookbook example comes close to what I want to do but always seems to hang once the URLs have been crawled (the example was adjusted to only crawl links from the initial array).

The closest I can come up with is the below, which for some reason only returns ~78 URLs (from the $tx->success loop) from an array of 1,300 are actually fetched.

Any help at all would be greatly appreciated!

#!/usr/bin/env perl
use v5.10;
use Mojo::UserAgent;
use Mojo::IOLoop;
use Mojo::URL;

## fetch URLs from database and load into AoH @urls

my $ua = Mojo::UserAgent->new(
    max_redirects => 8,
    name => 'xyz',
    );
my $delay = Mojo::IOLoop->delay(sub {
    my $delay = @_;
});

for my $url_ref (@urls) {
    $delay->begin;
    my ($url_id, $url) = each(%$url_ref);
   
    $ua->get($url => sub {
        my ($ua, $tx) = @_;
        if ($tx->success) {
            # store destination url
            my $uri = $tx->req->url;
               
            # Store title or URL if title not found
            my $title = $tx->res->dom->at('head > title') ? $tx->res->dom->at('head > title')->text : $uri;
       
            # Clean out script and other extraneous tags we don't want turned into text
            my $content = $tx->res->body;
            $content =~ s!<(script|style|iframe)[^>]*>.*?</\1>!!gis;
               
            # Turn back into DOM object to retrieve text
            my $dom = Mojo::DOM->new($content);
            my $clean_content = $dom->all_text;
            $clean_content =~ s![<>"',.&*\!$()^]! !g;
       
            # fetch headings
            my $headings = '';
            $tx->res->dom('h1, h2')->each(sub {
                $headings .= shift->all_text . " ";
            });
               
            ## url crawled successfully, update db
           
            # print content to xml file for indexing
            open(XML,">:utf8", catfile( $corpus_source, "$ts-$url_id.xml") ) or die $!;
            print XML "<uri>$uri</uri><content>$clean_content</content><headings>$headings</headings><title>$title</title>";
            close XML;
           
        } else {
            ## error crawling url, update database
        }
        $delay->end;
    });
}
$delay->wait unless Mojo::IOLoop->is_running;

Gabriel Vieira

unread,
Jul 22, 2012, 9:09:39 AM7/22/12
to mojol...@googlegroups.com
No error?
> --
> You received this message because you are subscribed to the Google Groups
> "Mojolicious" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mojolicious/-/qRQWVKM4UAIJ.
> To post to this group, send email to mojol...@googlegroups.com.
> To unsubscribe from this group, send email to
> mojolicious...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mojolicious?hl=en.



--
Gabriel Vieira

hammondos

unread,
Jul 22, 2012, 12:58:45 PM7/22/12
to mojol...@googlegroups.com
Not that I can see - it appears as if the script has executed successfully but then when you look at the urls actually processed it's only a small subset

Gabriel Vieira

unread,
Jul 22, 2012, 2:31:05 PM7/22/12
to mojol...@googlegroups.com
Try debugging printing the URL that is been processed... Maybe some of
them have no body and the execution stops.

On Sun, Jul 22, 2012 at 1:58 PM, hammondos <robbie...@gmail.com> wrote:
> Not that I can see - it appears as if the script has executed successfully but then when you look at the urls actually processed it's only a small subset
>
> --
> You received this message because you are subscribed to the Google Groups "Mojolicious" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/mojolicious/-/cIZUzwoqJ-gJ.

hammondos

unread,
Jul 23, 2012, 2:43:30 AM7/23/12
to mojol...@googlegroups.com
Thanks have tried that; the first thousand or so URLs enter the "get" loop, but only the very last 70-80 in the queue actually get processed, which leads me to think perhaps it's something to do with the Mojo::IOLoop function rather than the URLs themselves?


On Sunday, July 22, 2012 7:31:05 PM UTC+1, Gabriel Vieira wrote:
Try debugging printing the URL that is been processed... Maybe some of
them have no body and the execution stops.

On Sun, Jul 22, 2012 at 1:58 PM, hammondos <robbie...@gmail.com> wrote:
> Not that I can see - it appears as if the script has executed successfully but then when you look at the urls actually processed it's only a small subset
>
> --
> You received this message because you are subscribed to the Google Groups "Mojolicious" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/mojolicious/-/cIZUzwoqJ-gJ.
> To post to this group, send email to mojol...@googlegroups.com.
> To unsubscribe from this group, send email to mojolicious+unsubscribe@googlegroups.com.

Skye Shaw!@#$

unread,
Jul 23, 2012, 7:49:57 PM7/23/12
to mojol...@googlegroups.com


On Sunday, July 22, 2012 11:43:30 PM UTC-7, hammondos wrote:
Thanks have tried that; the first thousand or so URLs enter the "get" loop, but only the very last 70-80 in the queue actually get processed, which leads me to think perhaps it's something to do with the Mojo::IOLoop function rather than the URLs themselves?

You can set MOJO_USERAGENT_DEBUG to 1 though be warned: it outputs the HTML too. 

Maybe you're hitting an open file descriptor limit opening 1300 URLs like that.      






hammondos

unread,
Jul 24, 2012, 4:18:02 AM7/24/12
to mojol...@googlegroups.com
The USERAGENT_DEBUG option showed connection errors, however I've gone back to the cookbook example and modified it some more - seems to be running & exiting ok now:

my $ua = Mojo::UserAgent->new(
    max_redirects => 8,
    name => 'xyz',
    inactivity_timeout => 15

    );
my $delay = Mojo::IOLoop->delay(sub {
    my $delay = @_;
});

# Crawler
my $crawl;
$crawl = sub {
    my $id = shift;
    $delay->begin;

    # If there are no more URLs in the array, remove crawler id
    return Mojo::IOLoop->remove($id) unless my $url_ref = shift @urls;


    my ($url_id, $url) = each(%$url_ref);

    $ua->get($url => sub {
        my ($ua, $tx) = @_;

        if ($tx->success) {
            # store destination url
            my $uri = $tx->req->url;

            # Store title or URL if title not found
            my $title = $tx->res->dom->at('head > title') ? $tx->res->dom->at('head > title')->text : $uri;
       
            # Clean out script and other extraneous tags we don't want turned into text
            my $content = $tx->res->body;
            $content =~ s!<(script|style|iframe)[^>]*>.*?</\1>!!gis;
               
            # Turn back into DOM object to retrieve text
            my $dom = Mojo::DOM->new($content);
            my $clean_content = $dom->all_text;
            $clean_content =~ s![<>"',.&*\!$()^]! !g;
       
            # fetch headings
            my $headings = '';
            $tx->res->dom('h1, h2')->each(sub {
                $headings .= shift->all_text . " ";
            });
               
            ## Update DB
           
            my $ts = time;

            open(XML,">:utf8", catfile( $corpus_source, "$ts-$url_id.xml") ) or die $!;
            print XML "<uri>$uri</uri><url_id>$url_id</url_id><title>$title</title><headings>$headings</headings><content>$clean_content</content>";
            close XML;
         
        } else {
            my ($message, $code) = $tx->error;
            say $code ? "$code response: $message" : "Connection error: $message";
            ## Update DB
        }
        $delay->end;
        # Next - *only* if there's more to crawl
        if (@urls) {
            $crawl->($id);
        }
    });
};

# Start a bunch of parallel crawlers sharing the same user agent
$crawl->($_) for 1 .. 10;

# Start reactor if necessary

$delay->wait unless Mojo::IOLoop->is_running;
Reply all
Reply to author
Forward
0 new messages