Scraper , simple script

46 views
Skip to first unread message

Tin Woodman

unread,
Jan 19, 2017, 3:58:42 AM1/19/17
to Mojolicious
Hi , i m beginner in Mojo. I have some expirience of php. Sorry if i use it for more understanding.
I m develop for self simple script scrapper.
Part 1; Scrape root elements
I have a page with a urls and titles in table html.
my $res=
  Mojo::UserAgent->new->get('http://example.com')->res->dom;

I need grab from this page all elements, 
But i find in internet only this example . There i m find only text of element
my $texts =
  $res->find('.tdcont td a')->map(sub { $_->text });

I need a create array or something else . Maybe csv file. 
how i can create statement, when i need save from one element 2 or more data. For example
url;title;
url2;title2;
or in php(sorry)
array(array('url','title),array('url2','title2))

Part 2; Scrape child elements
When exists array or something else data . I need run another scrapp in loop .
For example(php):
foreach($data as $item) {
  $url = $item[0];
  $title = $item[1];
  // there i need a parse elements
  // go to url
  doParseChild();
  // there i need a exmaple how to check - exist element or not on page 
  if (pagination exists) {
    //foreach ($pages as $page) {
    doParseChild();
  }
  }
}

When first iteration of loop ended , go to second iteraion . etc..

Please help me , at least for a general understanding. Sorry for bad english and php .

mimosinnet

unread,
Jan 19, 2017, 5:18:34 PM1/19/17
to Mojolicious
Your question has inspired me this code, based on Joel Berger post, that finds a href tags in a page (I am also a beginner ;-) ). 

#!/usr/bin/env perl
use Modern::Perl;
use Mojo::UserAgent;
use Mojo::DOM;

my $ua = Mojo::UserAgent->new;
my $page = $ua->get('http://mojolicious.org/')->res->dom;
my $dom = Mojo::DOM->new($page);

foreach my $a_href ( $dom->find('a[href]')->each ) {
 say $a_href
;
}

I am sorry it does not answer your question, but I hope it helps. 

Cheers! 

El dijous, 19 gener de 2017 9:58:42 UTC+1, Tin Woodman va escriure:

Joel Berger

unread,
Jan 20, 2017, 10:23:04 AM1/20/17
to Mojolicious
You don't need to do $dom = Mojo::DOM->new($page); at all. ->res->dom returns an instance of Mojo::DOM, therefore your $page was already a Mojo::DOM. What you did next was then re-serialize the DOM object back to an HTML string and then REparse it! Of course it works, but it ... lets say ... not very efficient :-P

mimosinnet

unread,
Jan 20, 2017, 4:03:24 PM1/20/17
to Mojolicious
Ups...! Your right! Thanks! This would be the right script:

#!/usr/bin/env perl
use Modern::Perl;
use Mojo::UserAgent;
use Mojo::DOM;


my $ua = Mojo::UserAgent->new;
my $dom = $ua->get('http://mojolicious.org/')->res->dom;



foreach my $a_href ( $dom->find('a[href]')->each ) {
 say $a_href
;
}

Cheers!

El divendres, 20 gener de 2017 16:23:04 UTC+1, Joel Berger va escriure:
Reply all
Reply to author
Forward
0 new messages