Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

fetching webpage and extracting contents

1 view
Skip to first unread message

alfonsobaldaserra

unread,
Oct 4, 2010, 6:33:25 AM10/4/10
to
hello

i am trying to write a script which will go to bbc's top 40 pages and
show only intended contents.

i have written a script

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, ">", "bbc.txt" or die "$!\n";
print $bbc $res->decoded_content;
close $bbc;
} else {
die "could not fetch bbc.co.uk\n";
}

open my $bbc, "<", "bbc.txt";
while (<$bbc>) {
print if m!<span class="artist">(.*)</span>!;
print if m!<span class="track">(.*)</span>!;
#next unless $_ =~ m[(<span class="artist">)|(<span
class="track">)];
#my ($foo) =~ m!<span class="artist">(.*)</span>!;
#my ($bar) =~ m!<span class="track">(.*)</span>!;
# print "$foo -> $bar\n";
}

__RESULT__
<span class="artist">Tinie Tempah</span>
<span class="track">Written In The Stars</span>
<span class="artist">Bruno Mars</span>
<span class="track">Just The Way You Are (Amazing)</span>
<span class="artist">Labrinth</span>
<span class="track">Let The Sun Shine</span>
<span class="artist">Adele</span>
<span class="track">Make You Feel My Love</span>
<span class="artist">Taio Cruz</span>
<span class="track">Dynamite</span>

but i can't figure out

#1 how to parse $res->decoded_content without writing it to a file
because apparently the whole page is a single string

#2 how to show data in artist - track format, like
Tinie Tempah - Written In The Stars

#3 how to make this work
#next unless $_ =~ m[(<span class="artist">)|(<span
class="track">)];
#my ($foo) =~ m!<span class="artist">(.*)</span>!;
#my ($bar) =~ m!<span class="track">(.*)</span>!;
# print "$foo -> $bar\n"

appreciate your time gents.

salute :)

alfonsobaldaserra

unread,
Oct 5, 2010, 3:24:27 AM10/5/10
to
> #1 how to parse $res->decoded_content without writing it to a file
> because apparently the whole page is a single string

got it fixed by opening a fh to $res->decoded_content

> #2 how to show data in artist - track format, like
> Tinie Tempah - Written In The Stars


so the new code is

#!/usr/bin/perl

use strict;
#use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";
while (defined (my $con = <$bbc>)) {
chomp $con;
next unless $con =~ m!(<span class="artist">)|(<span
class="track">)!;
my ($artist) = $con =~ m!<span class="artist">(.*?)</
span>!;
my ($track) = $con =~ m!<span class="track">(.*?)</
span>!;
print "$artist - $track\n";
}

} else {
die "could not fetch bbc.co.uk\n";
}


but the output is coming as

Tinie Tempah -


- Written In The Stars

Bruno Mars -
- Just The Way You Are (Amazing)
Labrinth -
- Let The Sun Shine
Adele -
- Make You Feel My Love

while it should have been

Tinie Tempah - Written In The Stars

Bruno Mars - Just The Way You Are (Amazing)
Labrinth - Let The Sun Shine
Adele - Make You Feel My Love

i cant figure out why this is happening.

any help guys?

thanku :)

alfonsobaldaserra

unread,
Oct 5, 2010, 4:13:03 AM10/5/10
to
i got a real bad code working :)

#!/usr/bin/perl

use strict;


use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";
while (defined (my $con = <$bbc>)) {
chomp $con;

next if $con =~ /^\s*$/;


next unless $con =~ m!(<span class="artist">)|(<span
class="track">)!;

$con =~ s/^\s*|\s*$//g;
if ($con =~ m!<span class="artist">(.*)</span>!) {
print $1, " - ";
} elsif ($con =~ m!<span class="track">(.*)</span>!) {
print $1, "\n";
}
}
}


thank you gents for giving me a chance to do it myself.

though i am still looking for any improvements that you could
suggest :-)

Peter Makholm

unread,
Oct 5, 2010, 4:39:33 AM10/5/10
to
alfonsobaldaserra <alfonso.b...@gmail.com> writes:

> i got a real bad code working :)
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> use LWP::UserAgent;
>
> my $ua = LWP::UserAgent->new;
> $ua->timeout(10);
> $ua->env_proxy;
>
> my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");
>
> if ($res->is_success) {
> open my $bbc, "<", \$res->decoded_content or die "$!\n";

Don't do this. While possible, it is kind of obscure and shoul in my
opinion only be used when existing interfaces requires a perl file
handle.

Just split the content on newlines if you want to iterate over the
lines.

> while (defined (my $con = <$bbc>)) {
> chomp $con;
> next if $con =~ /^\s*$/;
> next unless $con =~ m!(<span class="artist">)|(<span
> class="track">)!;
> $con =~ s/^\s*|\s*$//g;
> if ($con =~ m!<span class="artist">(.*)</span>!) {
> print $1, " - ";
> } elsif ($con =~ m!<span class="track">(.*)</span>!) {
> print $1, "\n";
> }

Don't parse HTML by throwing naive regexpes at the problem. This would
fail horribly if BBC decided to remove unneded newlines from their
content.

> }
> }

I would rather use one of the existing HTML parsing modules. One
option could be HTML::TreeBuilder. Base on a quick read in the
documentation it would looke something like this:

my $html = HTML::TreeBuilder->new_from_content( $res->decoded_content );
for my $tag ($html->find('span') {
my $class = $tag->attr('class');

if ( $class eq 'artist' ) {
...;
} elsif ( $class eq 'track' ) {
...;
}
}

This would be a much more robust solution. (But I don't parse HTML in
my day to day work, so I might not be uptodate on the current set of
HTML parsers.)

//Makholm

s...@netherlands.com

unread,
Oct 5, 2010, 12:01:13 PM10/5/10
to

Along the lines of what you are doing, something like below.
-sln
-----------
use strict;
use warnings;

my $string =<<EOHTML;
<html>


<span class="artist">
Tinie Tempah
</span>
<span class="track">
Written In The Stars
</span>
<span class="artist"> Bruno Mars </span>
<span class="track">Just The Way You Are (Amazing)</span>
<span class="artist">
Labrinth</span>
<span class="track">Let The Sun Shine
</span>

<span class="track">A song by Labrinth</span>


<span class="artist">Adele </span>
<span class="track">Make You Feel My Love</span>
<span class="artist">Taio Cruz</span>
<span class="track">Dynamite</span>

<html/>
EOHTML
my $artist;

while ( $string =~
/ <span \s+ class \s* = \s* ['"]\s* (artist|track) \s*['"] \s* >
\s* (.*?) \s*
<\/span\s*>
/xsig )
{
if ($1 eq 'artist') {
$artist = $2;
}
else {
if (length $artist) {
print "$artist - $2\n";
}
$artist = '';
}
}
print "\n";

## Alternate -
##

$artist = '';
my %tracks;

while ( $string =~
/ <span \s+ class \s* = \s* ['"]\s* (artist|track) \s*['"] \s* >
\s* (.*?) \s*
<\/span\s*>
/xsig )
{
if ($1 eq 'artist') {
$artist = $2;
}
else {
push @{ $tracks{$artist} }, $2;
}
}

for $artist (sort keys %tracks) {
print "\n$artist\n";
for my $track ( sort @{ $tracks{$artist} } ) {
print " - $track\n"
}
}

alfonsobaldaserra

unread,
Oct 6, 2010, 1:35:01 AM10/6/10
to
thank you for such beautiful codes sln.

though i am inclined towards peter's advise to use html parsers.
unfortunately, i couldn't get your code to work due to lack of usage
examples of html::treebuilder online.

does anybody happen to know a good html parser with some good examples
online?

Peter Makholm

unread,
Oct 6, 2010, 3:31:03 AM10/6/10
to
alfonsobaldaserra <alfonso.b...@gmail.com> writes:

> though i am inclined towards peter's advise to use html parsers.
> unfortunately, i couldn't get your code to work due to lack of usage
> examples of html::treebuilder online.

Huh?

http://www.perlmonks.org/?node_id=280461
http://search.cpan.org/perldoc?HTML::TreeBuilder
http://groups.google.com/group/comp.lang.perl.misc/msg/372b363f0e9be360

//Makholm

alfonsobaldaserra

unread,
Oct 21, 2010, 3:25:29 AM10/21/10
to
> Huh?
>
> http://www.perlmonks.org/?node_id=280461http://search.cpan.org/perldoc?HTML::TreeBuilderhttp://groups.google.com/group/comp.lang.perl.misc/msg/372b363f0e9be360
>
> //Makholm

thank you guys :)

i finally utilised perlmonks link, read a little at cpan at here i am

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Tree;
use LWP::Simple;

my $uri = "http://www.bbc.co.uk/radio1/chart/singles";

my $html = get($uri);
my $tree = HTML::Tree->new();
$tree->parse($html);

my @artist = $tree->look_down('_tag' , 'span', 'class', 'artist');
my @track = $tree->look_down('_tag' , 'span', 'class', 'track');

foreach my $i (0..$#artist) {
print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
}


again i am wondering if there is a better way to group these two
arrays together instead of the way i did

foreach my $i (0..$#artist) {
print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
}

thank you

Peter Makholm

unread,
Oct 21, 2010, 3:52:23 AM10/21/10
to
alfonsobaldaserra <alfonso.b...@gmail.com> writes:

> my @artist = $tree->look_down('_tag' , 'span', 'class', 'artist');
> my @track = $tree->look_down('_tag' , 'span', 'class', 'track');
>
> foreach my $i (0..$#artist) {
> print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
> }
>
> again i am wondering if there is a better way to group these two
> arrays together instead of the way i did

It all depends on the HTML. But looking at the URL you posted it looks
like you're looke for a structure looking like this:

<a class="artist-link" href="/music/artists/ba7d2626-38ce-4859-8495-bdb5732715c4" id="link-13">

<span class="artist">Taio Cruz</span>
<span class="track">Dynamite</span>

</a>

What you could do was to iterate over all the <a class="artist-link>
nodes and then look for the artist and track below this
node. Untested, but something like this:

for my $link ( $tree->look_down(_tag => 'a', class => 'artist-link') ) {
my $artist = $link->look_down(class => 'artist')->as_text;
my $track = $link->look_down(class => 'track' )->as_text;

print "$artist - $track\n";
}

//Makholm

alfonsobaldaserra

unread,
Oct 21, 2010, 5:10:48 AM10/21/10
to
> for my $link ( $tree->look_down(_tag => 'a', class => 'artist-link') ) {
>     my $artist = $link->look_down(class => 'artist')->as_text;
>     my $track  = $link->look_down(class => 'track' )->as_text;
>
>     print "$artist - $track\n";
>
> }
>
> //Makholm

thank you again makholm, your code worked sexily without any
modification :)

0 new messages