I'm a Perl newbie and am having a nightmare trying to get the code below
working. I'm trying to fetch a webpage and if a link within the page matches
the search criterion - return the text after the link. It doesn't seem to be
working and I'm wondering if it's because the pattern match is within the
while loop. If anybody can shed some light I'd be eternally grateful!
Cheers,
Francis
# --------------------------
use LWP::Simple;
use HTML::TokeParser;
my $document = get("http://www.anexamplesite.com");
my $mymatch = "searchstring";
my $parser = HTML::TokeParser->new(\$document);
while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {
# print $server.$token->[1]->{href}."\n";
$document =~ /$searchstring(.+?)someidentifier/;
print "$1";
}
}
> I'm a Perl newbie and am having a nightmare trying to get the code
> below working. I'm trying to fetch a webpage and if a link within the
> page matches the search criterion - return the text after the link. It
> doesn't seem to be working and I'm wondering
As it is, we have no idea "doesn't seem to be working means". Please
read the posting guidelines to find out how you can help yourself, and,
in the process, help others help you.
use strict;
use warnings;
missing.
> use LWP::Simple;
> use HTML::TokeParser;
>
> my $document = get("http://www.anexamplesite.com");
> my $mymatch = "searchstring";
>
> my $parser = HTML::TokeParser->new(\$document);
>
> while ($token = $parser->get_tag("a")) {
> if ($token->[1]->{"href"} =~ /$mymatch/) {
> # print $server.$token->[1]->{href}."\n";
> $document =~ /$searchstring(.+?)someidentifier/;
The exact contents of $mymatch, $searchstring and whatever
someidentifier might have something to do with what's actually being
matched, no?
> print "$1";
You are not capturing anything, why do you expect there to be anything
valid in $1?
Sinan
--
A. Sinan Unur <1u...@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)
comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
try:
if ( $token->[1]{href} =~ /$mymatch/o ) {
I fail to see why that would make a difference. Could you please explain
why you think it would?
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
Not capturing? I'd say the parens in /$searchstring(.+?)someidentifier/
capture (if the match is succesful), or there's a bug in perl.
Abigail
--
perl -MLWP::UserAgent -MHTML::TreeBuilder -MHTML::FormatText -wle'print +(
HTML::FormatText -> new -> format (HTML::TreeBuilder -> new -> parse (
LWP::UserAgent -> new -> request (HTTP::Request -> new ("GET",
"http://work.ucsd.edu:5141/cgi-bin/http_webster?isindex=perl")) -> content))
=~ /(.*\))[-\s]+Addition/s) [0]'
I looked up HTML::TokeParse in CPAN.
The first Example displayed illustrated that the way to get the href
was:
my $url = $token->[1]{href} || "-";
...i noticed that the OP did not use the same syntax. I didn't know if
this was causing his problem. the 'o' at the end of the pattern was
just to optimize the pattern match, since it doesn't seem like the OP
needed to recompile the regex every time...
That's a good start, I suppose. :)
> The first Example displayed illustrated that the way to get the href
> was:
>
> my $url = $token->[1]{href} || "-";
>
> ...i noticed that the OP did not use the same syntax. I didn't know if
> this was causing his problem.
The reason why I asked is that I thought that
$token->[1]->{"href"}
is always the same as
$token->[1]{href}
following Perl's syntax for references and data structures.
ahh, i think you're right. pg. 254 Programming Perl 3rd ed.
"The arrow is optional between brackets or braces, or between a closing
bracket or brace and a parenthesis for an indirect function call."
> A. Sinan Unur (1u...@llenroc.ude.invalid) wrote on MMMMCDLVIII
> September MCMXCIII in
> <URL:news:Xns970EBA81F5AA...@127.0.0.1>:
>:) "Francis Sylvester" <fra...@nospam.com> wrote in
>:) news:AH8ef.16551$Es4....@fe2.news.blueyonder.co.uk:
...
>:) > $document =~ /$searchstring(.+?)someidentifier/;
>:)
>:) The exact contents of $mymatch, $searchstring and whatever
>:) someidentifier might have something to do with what's actually
>:) being matched, no?
>:)
>:) > print "$1";
>:)
>:) You are not capturing anything, why do you expect there to be
>:) anything valid in $1?
> Not capturing? I'd say the parens in
> /$searchstring(.+?)someidentifier/ capture (if the match is
> succesful), or there's a bug in perl.
Arrgh! Thank you very much for catching that.
Yes, using a module for parsing an HTML document is a good idea.
> my $document = get("http://www.anexamplesite.com");
> my $mymatch = "searchstring";
>
> my $parser = HTML::TokeParser->new(\$document);
>
> while ($token = $parser->get_tag("a")) {
> if ($token->[1]->{"href"} =~ /$mymatch/) {
> # print $server.$token->[1]->{href}."\n";
> $document =~ /$searchstring(.+?)someidentifier/;
What's that? After you have possibly found your search string, you let
the program search the whole document using a simple regex. Doing so
makes no sense to me.
Either you'd better stick to a simple regex, and skip the parsing
module, or (better) taking advantage of the module you are using, and
doing something like:
while ( my $token = $parser->get_tag('a') ) {
if ($token->[1]{href} =~ /$mymatch/) {
print $parser->get_text('a')."\n";
}
}
(I'm not sure if that's what you're looking for, but hopefully you get
the idea.)
Many thanks for all your replies. I'm sorry, I should have been clearer -
the code executes without error messages but I sometimes get unwanted
results in $1. After closer inspection, I think it's because sometimes it's
returning $1 from the earlier pattern match ( if ($token->[1]->{"href"} =~
/$mymatch/) rather than the pattern match I wanted ($document =~
/$searchstring(.+?)someidentifier/;)
Is there a way to reset the value of $1?
Many thanks,
Francis
>> if ($token->[1]{href} =~ /$mymatch/) {
> I sometimes get unwanted
> results in $1. After closer inspection, I think it's because sometimes it's
> returning $1 from the earlier pattern match ( if ($token->[1]->{"href"} =~
> /$mymatch/)
Note that that code ensures that the pattern match *succeeded*.
> rather than the pattern match I wanted ($document =~
> /$searchstring(.+?)someidentifier/;)
We don't really know, since you did not quote that part of the code,
but you should always ensure that the match succeeded before
using the dollar-digit variables, so:
Is _your_ pattern match being tested for success?
> Is there a way to reset the value of $1?
Yes. They are reset on every _successful_ pattern match.
--
Tad McClellan SGML consulting
ta...@augustmail.com Perl programming
Fort Worth, Texas
And that may well be a result of the fact that you don't actually make
use of the module you are using for parsing HTML...
Didn't you understand my objection to your code?
http://groups.google.com/group/comp.lang.perl.misc/msg/60f72a205520c4b1
Thanks Gunnar. I did understand your objection but thought I needed to
resort to pattern matching for a specific section of the text I'm retrieving
after the link. Having read your message and looking at the module docs
again now - I think I might be able to achieve the desired result without
the pattern match. I'm very grateful to you for the responses - you've
probably saved me hours!
Thanks again,
Francis