Scraper , simple script

Tin Woodman

unread,

Jan 19, 2017, 3:58:42 AM1/19/17

to Mojolicious

Hi , i m beginner in Mojo. I have some expirience of php. Sorry if i use it for more understanding.

I m develop for self simple script scrapper.

Part 1; Scrape root elements

I have a page with a urls and titles in table html.

my $res=
  Mojo::UserAgent->new->get('http://example.com')->res->dom;

I need grab from this page all elements,

But i find in internet only this example . There i m find only text of element

my $texts =
  $res->find('.tdcont td a')->map(sub { $_->text });

I need a create array or something else . Maybe csv file.

how i can create statement, when i need save from one element 2 or more data. For example

url;title;
url2;title2;

or in php(sorry)

array(array('url','title),array('url2','title2))

Part 2; Scrape child elements

When exists array or something else data . I need run another scrapp in loop .

For example(php):

foreach($data as $item) {
  $url = $item[0];
  $title = $item[1];
  // there i need a parse elements
  // go to url
  doParseChild();
  // there i need a exmaple how to check - exist element or not on page 
  if (pagination exists) {
    //foreach ($pages as $page) {
    doParseChild();
  }
  }
}

When first iteration of loop ended , go to second iteraion . etc..

Please help me , at least for a general understanding. Sorry for bad english and php .

mimosinnet

unread,

Jan 19, 2017, 5:18:34 PM1/19/17

to Mojolicious

Your question has inspired me this code, based on Joel Berger post, that finds a href tags in a page (I am also a beginner ;-) ).

#!/usr/bin/env perl
use Modern::Perl;
use Mojo::UserAgent;
use Mojo::DOM;

my $ua = Mojo::UserAgent->new;
my $page = $ua->get('http://mojolicious.org/')->res->dom;
my $dom = Mojo::DOM->new($page);

foreach my $a_href ( $dom->find('a[href]')->each ) {
 say $a_href;
}

I am sorry it does not answer your question, but I hope it helps.

Cheers!

El dijous, 19 gener de 2017 9:58:42 UTC+1, Tin Woodman va escriure:

Joel Berger

unread,

Jan 20, 2017, 10:23:04 AM1/20/17

to Mojolicious

You don't need to do $dom = Mojo::DOM->new($page); at all. ->res->dom returns an instance of Mojo::DOM, therefore your $page was already a Mojo::DOM. What you did next was then re-serialize the DOM object back to an HTML string and then REparse it! Of course it works, but it ... lets say ... not very efficient :-P

mimosinnet

unread,

Jan 20, 2017, 4:03:24 PM1/20/17

to Mojolicious

Ups...! Your right! Thanks! This would be the right script:

#!/usr/bin/env perl
use Modern::Perl;
use Mojo::UserAgent;
use Mojo::DOM;


my $ua = Mojo::UserAgent->new;

my $dom = $ua->get('http://mojolicious.org/')->res->dom;




foreach my $a_href ( $dom->find('a[href]')->each ) {
 say $a_href;
}

Cheers!

El divendres, 20 gener de 2017 16:23:04 UTC+1, Joel Berger va escriure:

Reply all

Reply to author

Forward