Crawling pre-defined list of URLs with non-blocking parallel UserAgent

hammondos

unread,

Jul 22, 2012, 5:05:03 AM7/22/12

to mojol...@googlegroups.com

Hi,

I'm trying to use the non-blocking parallel function of Mojo::UserAgent & Mojo::IOLoop to crawl a pre-defined list of several thousand URLs from a database and place their content into XML files for indexing (ie with Perl's "Lucy").

I'm completely new to event loops so may be making a fundamental error here - but have tried all of the examples I can find in the documentation & cookbook. The cookbook example comes close to what I want to do but always seems to hang once the URLs have been crawled (the example was adjusted to only crawl links from the initial array).

The closest I can come up with is the below, which for some reason only returns ~78 URLs (from the $tx->success loop) from an array of 1,300 are actually fetched.

Any help at all would be greatly appreciated!

#!/usr/bin/env perl
use v5.10;
use Mojo::UserAgent;
use Mojo::IOLoop;
use Mojo::URL;

## fetch URLs from database and load into AoH @urls

my $ua = Mojo::UserAgent->new(
    max_redirects => 8,
    name => 'xyz',
    );
my $delay = Mojo::IOLoop->delay(sub {
    my $delay = @_;
});

for my $url_ref (@urls) {
    $delay->begin;
    my ($url_id, $url) = each(%$url_ref);

    $ua->get($url => sub {
        my ($ua, $tx) = @_;
        if ($tx->success) {
            # store destination url
            my $uri = $tx->req->url;

            # Store title or URL if title not found
            my $title = $tx->res->dom->at('head > title') ? $tx->res->dom->at('head > title')->text : $uri;

            # Clean out script and other extraneous tags we don't want turned into text
            my $content = $tx->res->body;
            $content =~ s!<(script|style|iframe)[^>]*>.*?</\1>!!gis;

            # Turn back into DOM object to retrieve text
            my $dom = Mojo::DOM->new($content);
            my $clean_content = $dom->all_text;
            $clean_content =~ s![<>"',.&*\!$()^]! !g;

            # fetch headings
            my $headings = '';
            $tx->res->dom('h1, h2')->each(sub {
                $headings .= shift->all_text . " ";
            });

            ## url crawled successfully, update db

            # print content to xml file for indexing
            open(XML,">:utf8", catfile( $corpus_source, "$ts-$url_id.xml") ) or die $!;
            print XML "<uri>$uri</uri><content>$clean_content</content><headings>$headings</headings><title>$title</title>";
            close XML;

        } else {
            ## error crawling url, update database
        }
        $delay->end;
    });
}
$delay->wait unless Mojo::IOLoop->is_running;

Gabriel Vieira

unread,

Jul 22, 2012, 9:09:39 AM7/22/12

to mojol...@googlegroups.com

No error?

> --
> You received this message because you are subscribed to the Google Groups
> "Mojolicious" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mojolicious/-/qRQWVKM4UAIJ.
> To post to this group, send email to mojol...@googlegroups.com.
> To unsubscribe from this group, send email to
> mojolicious...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mojolicious?hl=en.

--
Gabriel Vieira

hammondos

unread,

Jul 22, 2012, 12:58:45 PM7/22/12

to mojol...@googlegroups.com

Not that I can see - it appears as if the script has executed successfully but then when you look at the urls actually processed it's only a small subset

Gabriel Vieira

unread,

Jul 22, 2012, 2:31:05 PM7/22/12

to mojol...@googlegroups.com

Try debugging printing the URL that is been processed... Maybe some of
them have no body and the execution stops.

On Sun, Jul 22, 2012 at 1:58 PM, hammondos <robbie...@gmail.com> wrote:
> Not that I can see - it appears as if the script has executed successfully but then when you look at the urls actually processed it's only a small subset
>

> --
> You received this message because you are subscribed to the Google Groups "Mojolicious" group.

> To view this discussion on the web visit https://groups.google.com/d/msg/mojolicious/-/cIZUzwoqJ-gJ.

hammondos

unread,

Jul 23, 2012, 2:43:30 AM7/23/12

to mojol...@googlegroups.com

Thanks have tried that; the first thousand or so URLs enter the "get" loop, but only the very last 70-80 in the queue actually get processed, which leads me to think perhaps it's something to do with the Mojo::IOLoop function rather than the URLs themselves?

On Sunday, July 22, 2012 7:31:05 PM UTC+1, Gabriel Vieira wrote:

Try debugging printing the URL that is been processed... Maybe some of
them have no body and the execution stops.

On Sun, Jul 22, 2012 at 1:58 PM, hammondos <robbie...@gmail.com> wrote:
> Not that I can see - it appears as if the script has executed successfully but then when you look at the urls actually processed it's only a small subset
>
> --
> You received this message because you are subscribed to the Google Groups "Mojolicious" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/mojolicious/-/cIZUzwoqJ-gJ.
> To post to this group, send email to mojol...@googlegroups.com.

> To unsubscribe from this group, send email to mojolicious+unsubscribe@googlegroups.com.

Skye Shaw!@#$

unread,

Jul 23, 2012, 7:49:57 PM7/23/12

to mojol...@googlegroups.com

On Sunday, July 22, 2012 11:43:30 PM UTC-7, hammondos wrote:

Thanks have tried that; the first thousand or so URLs enter the "get" loop, but only the very last 70-80 in the queue actually get processed, which leads me to think perhaps it's something to do with the Mojo::IOLoop function rather than the URLs themselves?

You can set MOJO_USERAGENT_DEBUG to 1 though be warned: it outputs the HTML too.

Maybe you're hitting an open file descriptor limit opening 1300 URLs like that.

hammondos

unread,

Jul 24, 2012, 4:18:02 AM7/24/12

to mojol...@googlegroups.com

The USERAGENT_DEBUG option showed connection errors, however I've gone back to the cookbook example and modified it some more - seems to be running & exiting ok now:

my $ua = Mojo::UserAgent->new(
max_redirects => 8,
name => 'xyz',

inactivity_timeout => 15

);
my $delay = Mojo::IOLoop->delay(sub {
my $delay = @_;
});

# Crawler
my $crawl;
$crawl = sub {
    my $id = shift;
    $delay->begin;

    # If there are no more URLs in the array, remove crawler id
    return Mojo::IOLoop->remove($id) unless my $url_ref = shift @urls;

    my ($url_id, $url) = each(%$url_ref);

    $ua->get($url => sub {
        my ($ua, $tx) = @_;

        if ($tx->success) {
            # store destination url
            my $uri = $tx->req->url;

            # Store title or URL if title not found
            my $title = $tx->res->dom->at('head > title') ? $tx->res->dom->at('head > title')->text : $uri;

            # Clean out script and other extraneous tags we don't want turned into text
            my $content = $tx->res->body;
            $content =~ s!<(script|style|iframe)[^>]*>.*?</\1>!!gis;

            # Turn back into DOM object to retrieve text
            my $dom = Mojo::DOM->new($content);
            my $clean_content = $dom->all_text;
            $clean_content =~ s![<>"',.&*\!$()^]! !g;

            # fetch headings
            my $headings = '';
            $tx->res->dom('h1, h2')->each(sub {
                $headings .= shift->all_text . " ";
            });

            ## Update DB

            my $ts = time;

open(XML,">:utf8", catfile( $corpus_source, "$ts-$url_id.xml") ) or die $!;

            print XML "<uri>$uri</uri><url_id>$url_id</url_id><title>$title</title><headings>$headings</headings><content>$clean_content</content>";
            close XML;

        } else {
            my ($message, $code) = $tx->error;
            say $code ? "$code response: $message" : "Connection error: $message";
            ## Update DB
        }
        $delay->end;
        # Next - *only* if there's more to crawl
        if (@urls) {
            $crawl->($id);
        }
    });
};

# Start a bunch of parallel crawlers sharing the same user agent
$crawl->($_) for 1 .. 10;

# Start reactor if necessary

$delay->wait unless Mojo::IOLoop->is_running;

Reply all

Reply to author

Forward