mozrepl-timeouts in a web-thumbnail-scraper with www::mechanize::firefox

79 views
Skip to first unread message

matze

unread,
Feb 24, 2012, 2:25:09 AM2/24/12
to WWW::Mechanize users
good day dear list


first of all i apologize for asking a question that might have been
asked a million times before..
i have some problems with mozrepl-timeouts in a web-thumbnail-scraper
that runs on a openSuse-linux-box.


tryin to find a better solution either in Ruby / Python or PHP - but
if you have ideas to re-work the perl-script. i would be glad too.
The question: Is there a way to specify Net::Telnet timeout with
WWW::Mechanize::Firefox?
At the moment my internet connection [a quite fast dsl one] is very
slow and sometimes I get error

with $mech->get():
command timed-out at /usr/local/share/perl/5.12.3/MozRepl/Client.pm
line 186

SEE THIS ONE: $mech->repl->repl->timeout(100000);

Unfortunatly it does not work: Can't locate object method "timeout"
via package "MozRepl"

Documentation says this should:

$mech->repl->repl->setup_client( { extra_client_args => { timeout => 1
+80 } } );

problem: I have a list of 2500 websites and need to grab a thumbnail
screenshot (!) of them. How do I do that?
I could try to parse the sites either with Perl.- Mechanize would be a
good thing.
Note: i only need the results as a thumbnails that are a maximum 240
pixels in the long dimension.
At the moment i have a solution which is slow and does not give back
thumbnails:
How to make the script running faster with less overhead - spiting out
the thumbnails


My prerequisites: addon/mozrepl/
the module WWW::Mechanize::Firefox;
the module imager


This is my source ... see a snippet [example]of the sites i have in
the url-list.

urls.txt [the list of sources in a file]

www.google.com
www.cnn.com
www.msnbc.com
news.bbc.co.uk
www.bing.com
www.yahoo.com - and so on and so forth...:


What i have tried allready; here it is:



#!/usr/bin/perl

use strict;
use warnings;
use WWW::Mechanize::Firefox;

my $mech = new WWW::Mechanize::Firefox();

open(INPUT, "<urls.txt") or die $!;

while (<INPUT>) {
chomp;
print "$_\n";
$mech->get($_);
my $png = $mech->content_as_png();
my $name = "$_";
$name =~s/^www\.//;
$name .= ".png";
open(OUTPUT, ">$name");
print OUTPUT $png;
sleep (5);
}

Well this does not care about the size:

See the output commandline:

linux-vi17:/home/martin/perl # perl mecha_test_1.pl
www.google.com
www.cnn.com
www.msnbc.com
command timed-out at /usr/lib/perl5/site_perl/5.12.3/MozRepl/Client.pm
line 186
linux-vi17:/home/martin/perl #


Question: how to extend the solution either to make sure that it does
not stop in a time out.
Note again: i only need the results as a thumbnails that are a maximum
240 pixels in the long dimension.
As a prerequisites, i allready have installed the module imager.
How to make the script running faster with less overhead - spiting out
the thumbnails

Update: in addition to the mothere is a Monksthread perlmonks.org/?
node_id=901572


i also tried out this one here:

$mech->repl->repl->setup_client( { extra_client_args => { timeout =>
5*60 } } );

putting links to @list and use eval

while (scalar(@list)) {
my $link = pop(@list);
print "trying $link\n";
eval{
$mech->get($link);
sleep (5);
my $png = $mech->content_as_png();
my $name = "$_";
$name =~s/^www\.//;
$name .= ".png";
open(OUTPUT, ">$name");
print OUTPUT $png;
close(OUTPUT);
}
if ($@){
print "link: $link failed\n";
push(@list,$link);#put the end of the list
next;
}
print "$link is done!\n";

}


Question: is there a Ruby / Python /PHP-Solution that runs more
efficient - or can you suggest a Perl-solution that is more stable..


Look forward to hear from you

greetings
martin

Reply all
Reply to author
Forward
0 new messages