i am trying to write a script which will go to bbc's top 40 pages and
show only intended contents.
i have written a script
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");
if ($res->is_success) {
open my $bbc, ">", "bbc.txt" or die "$!\n";
print $bbc $res->decoded_content;
close $bbc;
} else {
die "could not fetch bbc.co.uk\n";
}
open my $bbc, "<", "bbc.txt";
while (<$bbc>) {
print if m!<span class="artist">(.*)</span>!;
print if m!<span class="track">(.*)</span>!;
#next unless $_ =~ m[(<span class="artist">)|(<span
class="track">)];
#my ($foo) =~ m!<span class="artist">(.*)</span>!;
#my ($bar) =~ m!<span class="track">(.*)</span>!;
# print "$foo -> $bar\n";
}
__RESULT__
<span class="artist">Tinie Tempah</span>
<span class="track">Written In The Stars</span>
<span class="artist">Bruno Mars</span>
<span class="track">Just The Way You Are (Amazing)</span>
<span class="artist">Labrinth</span>
<span class="track">Let The Sun Shine</span>
<span class="artist">Adele</span>
<span class="track">Make You Feel My Love</span>
<span class="artist">Taio Cruz</span>
<span class="track">Dynamite</span>
but i can't figure out
#1 how to parse $res->decoded_content without writing it to a file
because apparently the whole page is a single string
#2 how to show data in artist - track format, like
Tinie Tempah - Written In The Stars
#3 how to make this work
#next unless $_ =~ m[(<span class="artist">)|(<span
class="track">)];
#my ($foo) =~ m!<span class="artist">(.*)</span>!;
#my ($bar) =~ m!<span class="track">(.*)</span>!;
# print "$foo -> $bar\n"
appreciate your time gents.
salute :)
got it fixed by opening a fh to $res->decoded_content
> #2 how to show data in artist - track format, like
> Tinie Tempah - Written In The Stars
so the new code is
#!/usr/bin/perl
use strict;
#use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");
if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";
while (defined (my $con = <$bbc>)) {
chomp $con;
next unless $con =~ m!(<span class="artist">)|(<span
class="track">)!;
my ($artist) = $con =~ m!<span class="artist">(.*?)</
span>!;
my ($track) = $con =~ m!<span class="track">(.*?)</
span>!;
print "$artist - $track\n";
}
} else {
die "could not fetch bbc.co.uk\n";
}
but the output is coming as
Tinie Tempah -
- Written In The Stars
Bruno Mars -
- Just The Way You Are (Amazing)
Labrinth -
- Let The Sun Shine
Adele -
- Make You Feel My Love
while it should have been
Tinie Tempah - Written In The Stars
Bruno Mars - Just The Way You Are (Amazing)
Labrinth - Let The Sun Shine
Adele - Make You Feel My Love
i cant figure out why this is happening.
any help guys?
thanku :)
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");
if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";
while (defined (my $con = <$bbc>)) {
chomp $con;
next if $con =~ /^\s*$/;
next unless $con =~ m!(<span class="artist">)|(<span
class="track">)!;
$con =~ s/^\s*|\s*$//g;
if ($con =~ m!<span class="artist">(.*)</span>!) {
print $1, " - ";
} elsif ($con =~ m!<span class="track">(.*)</span>!) {
print $1, "\n";
}
}
}
thank you gents for giving me a chance to do it myself.
though i am still looking for any improvements that you could
suggest :-)
> i got a real bad code working :)
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> use LWP::UserAgent;
>
> my $ua = LWP::UserAgent->new;
> $ua->timeout(10);
> $ua->env_proxy;
>
> my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");
>
> if ($res->is_success) {
> open my $bbc, "<", \$res->decoded_content or die "$!\n";
Don't do this. While possible, it is kind of obscure and shoul in my
opinion only be used when existing interfaces requires a perl file
handle.
Just split the content on newlines if you want to iterate over the
lines.
> while (defined (my $con = <$bbc>)) {
> chomp $con;
> next if $con =~ /^\s*$/;
> next unless $con =~ m!(<span class="artist">)|(<span
> class="track">)!;
> $con =~ s/^\s*|\s*$//g;
> if ($con =~ m!<span class="artist">(.*)</span>!) {
> print $1, " - ";
> } elsif ($con =~ m!<span class="track">(.*)</span>!) {
> print $1, "\n";
> }
Don't parse HTML by throwing naive regexpes at the problem. This would
fail horribly if BBC decided to remove unneded newlines from their
content.
> }
> }
I would rather use one of the existing HTML parsing modules. One
option could be HTML::TreeBuilder. Base on a quick read in the
documentation it would looke something like this:
my $html = HTML::TreeBuilder->new_from_content( $res->decoded_content );
for my $tag ($html->find('span') {
my $class = $tag->attr('class');
if ( $class eq 'artist' ) {
...;
} elsif ( $class eq 'track' ) {
...;
}
}
This would be a much more robust solution. (But I don't parse HTML in
my day to day work, so I might not be uptodate on the current set of
HTML parsers.)
//Makholm
Along the lines of what you are doing, something like below.
-sln
-----------
use strict;
use warnings;
my $string =<<EOHTML;
<html>
<span class="artist">
Tinie Tempah
</span>
<span class="track">
Written In The Stars
</span>
<span class="artist"> Bruno Mars </span>
<span class="track">Just The Way You Are (Amazing)</span>
<span class="artist">
Labrinth</span>
<span class="track">Let The Sun Shine
</span>
<span class="track">A song by Labrinth</span>
<span class="artist">Adele </span>
<span class="track">Make You Feel My Love</span>
<span class="artist">Taio Cruz</span>
<span class="track">Dynamite</span>
<html/>
EOHTML
my $artist;
while ( $string =~
/ <span \s+ class \s* = \s* ['"]\s* (artist|track) \s*['"] \s* >
\s* (.*?) \s*
<\/span\s*>
/xsig )
{
if ($1 eq 'artist') {
$artist = $2;
}
else {
if (length $artist) {
print "$artist - $2\n";
}
$artist = '';
}
}
print "\n";
## Alternate -
##
$artist = '';
my %tracks;
while ( $string =~
/ <span \s+ class \s* = \s* ['"]\s* (artist|track) \s*['"] \s* >
\s* (.*?) \s*
<\/span\s*>
/xsig )
{
if ($1 eq 'artist') {
$artist = $2;
}
else {
push @{ $tracks{$artist} }, $2;
}
}
for $artist (sort keys %tracks) {
print "\n$artist\n";
for my $track ( sort @{ $tracks{$artist} } ) {
print " - $track\n"
}
}
though i am inclined towards peter's advise to use html parsers.
unfortunately, i couldn't get your code to work due to lack of usage
examples of html::treebuilder online.
does anybody happen to know a good html parser with some good examples
online?
> though i am inclined towards peter's advise to use html parsers.
> unfortunately, i couldn't get your code to work due to lack of usage
> examples of html::treebuilder online.
Huh?
http://www.perlmonks.org/?node_id=280461
http://search.cpan.org/perldoc?HTML::TreeBuilder
http://groups.google.com/group/comp.lang.perl.misc/msg/372b363f0e9be360
//Makholm
thank you guys :)
i finally utilised perlmonks link, read a little at cpan at here i am
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Tree;
use LWP::Simple;
my $uri = "http://www.bbc.co.uk/radio1/chart/singles";
my $html = get($uri);
my $tree = HTML::Tree->new();
$tree->parse($html);
my @artist = $tree->look_down('_tag' , 'span', 'class', 'artist');
my @track = $tree->look_down('_tag' , 'span', 'class', 'track');
foreach my $i (0..$#artist) {
print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
}
again i am wondering if there is a better way to group these two
arrays together instead of the way i did
foreach my $i (0..$#artist) {
print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
}
thank you
> my @artist = $tree->look_down('_tag' , 'span', 'class', 'artist');
> my @track = $tree->look_down('_tag' , 'span', 'class', 'track');
>
> foreach my $i (0..$#artist) {
> print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
> }
>
> again i am wondering if there is a better way to group these two
> arrays together instead of the way i did
It all depends on the HTML. But looking at the URL you posted it looks
like you're looke for a structure looking like this:
<a class="artist-link" href="/music/artists/ba7d2626-38ce-4859-8495-bdb5732715c4" id="link-13">
<span class="artist">Taio Cruz</span>
<span class="track">Dynamite</span>
</a>
What you could do was to iterate over all the <a class="artist-link>
nodes and then look for the artist and track below this
node. Untested, but something like this:
for my $link ( $tree->look_down(_tag => 'a', class => 'artist-link') ) {
my $artist = $link->look_down(class => 'artist')->as_text;
my $track = $link->look_down(class => 'track' )->as_text;
print "$artist - $track\n";
}
//Makholm
thank you again makholm, your code worked sexily without any
modification :)