Pattern Matching problem!

Francis Sylvester

unread,

Nov 14, 2005, 6:01:52 PM11/14/05

to

Hi,

I'm a Perl newbie and am having a nightmare trying to get the code below
working. I'm trying to fetch a webpage and if a link within the page matches
the search criterion - return the text after the link. It doesn't seem to be
working and I'm wondering if it's because the pattern match is within the
while loop. If anybody can shed some light I'd be eternally grateful!

Cheers,
Francis

# --------------------------
use LWP::Simple;
use HTML::TokeParser;

my $document = get("http://www.anexamplesite.com");
my $mymatch = "searchstring";

my $parser = HTML::TokeParser->new(\$document);

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {
# print $server.$token->[1]->{href}."\n";
$document =~ /$searchstring(.+?)someidentifier/;
print "$1";
}
}

A. Sinan Unur

unread,

Nov 14, 2005, 6:20:00 PM11/14/05

to

"Francis Sylvester" <fra...@nospam.com> wrote in
news:AH8ef.16551$Es4....@fe2.news.blueyonder.co.uk:

> I'm a Perl newbie and am having a nightmare trying to get the code
> below working. I'm trying to fetch a webpage and if a link within the
> page matches the search criterion - return the text after the link. It
> doesn't seem to be working and I'm wondering

As it is, we have no idea "doesn't seem to be working means". Please
read the posting guidelines to find out how you can help yourself, and,
in the process, help others help you.

use strict;
use warnings;

missing.

> use LWP::Simple;
> use HTML::TokeParser;
>
> my $document = get("http://www.anexamplesite.com");
> my $mymatch = "searchstring";
>
> my $parser = HTML::TokeParser->new(\$document);
>
> while ($token = $parser->get_tag("a")) {
> if ($token->[1]->{"href"} =~ /$mymatch/) {
> # print $server.$token->[1]->{href}."\n";
> $document =~ /$searchstring(.+?)someidentifier/;

The exact contents of $mymatch, $searchstring and whatever
someidentifier might have something to do with what's actually being
matched, no?

> print "$1";

You are not capturing anything, why do you expect there to be anything
valid in $1?

Sinan
--
A. Sinan Unur <1u...@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

it_says_BALLS_on_your forehead

unread,

Nov 14, 2005, 6:53:40 PM11/14/05

to

Francis Sylvester wrote:
> Hi,
>
> I'm a Perl newbie and am having a nightmare trying to get the code below
> working. I'm trying to fetch a webpage and if a link within the page matches
> the search criterion - return the text after the link. It doesn't seem to be
> working and I'm wondering if it's because the pattern match is within the
> while loop. If anybody can shed some light I'd be eternally grateful!
>
> Cheers,
> Francis
>
> # --------------------------
> use LWP::Simple;
> use HTML::TokeParser;
>
> my $document = get("http://www.anexamplesite.com");
> my $mymatch = "searchstring";
>
> my $parser = HTML::TokeParser->new(\$document);
>
> while ($token = $parser->get_tag("a")) {
> if ($token->[1]->{"href"} =~ /$mymatch/) {

try:
if ( $token->[1]{href} =~ /$mymatch/o ) {

Gunnar Hjalmarsson

unread,

Nov 14, 2005, 7:03:44 PM11/14/05

to

it_says_BALLS_on_your forehead wrote:

> Francis Sylvester wrote:
>>
>>while ($token = $parser->get_tag("a")) {
>> if ($token->[1]->{"href"} =~ /$mymatch/) {
>
> try:
> if ( $token->[1]{href} =~ /$mymatch/o ) {

I fail to see why that would make a difference. Could you please explain
why you think it would?

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Abigail

unread,

Nov 14, 2005, 7:10:07 PM11/14/05

to

A. Sinan Unur (1u...@llenroc.ude.invalid) wrote on MMMMCDLVIII September
MCMXCIII in <URL:news:Xns970EBA81F5AA...@127.0.0.1>:
:) "Francis Sylvester" <fra...@nospam.com> wrote in
:) news:AH8ef.16551$Es4....@fe2.news.blueyonder.co.uk:
:)
:) > I'm a Perl newbie and am having a nightmare trying to get the code
:) > below working. I'm trying to fetch a webpage and if a link within the
:) > page matches the search criterion - return the text after the link. It
:) > doesn't seem to be working and I'm wondering
:)
:) As it is, we have no idea "doesn't seem to be working means". Please
:) read the posting guidelines to find out how you can help yourself, and,
:) in the process, help others help you.
:)
:) use strict;
:) use warnings;
:)
:) missing.
:)
:) > use LWP::Simple;
:) > use HTML::TokeParser;
:) >
:) > my $document = get("http://www.anexamplesite.com");
:) > my $mymatch = "searchstring";
:) >
:) > my $parser = HTML::TokeParser->new(\$document);
:) >
:) > while ($token = $parser->get_tag("a")) {
:) > if ($token->[1]->{"href"} =~ /$mymatch/) {
:) > # print $server.$token->[1]->{href}."\n";
:) > $document =~ /$searchstring(.+?)someidentifier/;
:)
:) The exact contents of $mymatch, $searchstring and whatever
:) someidentifier might have something to do with what's actually being
:) matched, no?
:)
:) > print "$1";
:)
:) You are not capturing anything, why do you expect there to be anything
:) valid in $1?

Not capturing? I'd say the parens in /$searchstring(.+?)someidentifier/
capture (if the match is succesful), or there's a bug in perl.

Abigail
--
perl -MLWP::UserAgent -MHTML::TreeBuilder -MHTML::FormatText -wle'print +(
HTML::FormatText -> new -> format (HTML::TreeBuilder -> new -> parse (
LWP::UserAgent -> new -> request (HTTP::Request -> new ("GET",
"http://work.ucsd.edu:5141/cgi-bin/http_webster?isindex=perl")) -> content))
=~ /(.*\))[-\s]+Addition/s) [0]'

it_says_BALLS_on_your forehead

unread,

Nov 14, 2005, 7:11:31 PM11/14/05

to

Gunnar Hjalmarsson wrote:
> it_says_BALLS_on_your forehead wrote:
> > Francis Sylvester wrote:
> >>
> >>while ($token = $parser->get_tag("a")) {
> >> if ($token->[1]->{"href"} =~ /$mymatch/) {
> >
> > try:
> > if ( $token->[1]{href} =~ /$mymatch/o ) {
>
> I fail to see why that would make a difference. Could you please explain
> why you think it would?
>

I looked up HTML::TokeParse in CPAN.

The first Example displayed illustrated that the way to get the href
was:

my $url = $token->[1]{href} || "-";

...i noticed that the OP did not use the same syntax. I didn't know if
this was causing his problem. the 'o' at the end of the pattern was
just to optimize the pattern match, since it doesn't seem like the OP
needed to recompile the regex every time...

Gunnar Hjalmarsson

unread,

Nov 14, 2005, 7:23:09 PM11/14/05

to

it_says_BALLS_on_your forehead wrote:
> Gunnar Hjalmarsson wrote:
>>it_says_BALLS_on_your forehead wrote:
>>>Francis Sylvester wrote:
>>>>
>>>>while ($token = $parser->get_tag("a")) {
>>>> if ($token->[1]->{"href"} =~ /$mymatch/) {
>>>
>>>try:
>>>if ( $token->[1]{href} =~ /$mymatch/o ) {
>>
>>I fail to see why that would make a difference. Could you please explain
>>why you think it would?
>
> I looked up HTML::TokeParse in CPAN.

That's a good start, I suppose. :)

> The first Example displayed illustrated that the way to get the href
> was:
>
> my $url = $token->[1]{href} || "-";
>
> ...i noticed that the OP did not use the same syntax. I didn't know if
> this was causing his problem.

The reason why I asked is that I thought that

$token->[1]->{"href"}

is always the same as

$token->[1]{href}

following Perl's syntax for references and data structures.

it_says_BALLS_on_your forehead

unread,

Nov 14, 2005, 7:46:34 PM11/14/05

to

Gunnar Hjalmarsson wrote:
> it_says_BALLS_on_your forehead wrote:
> > Gunnar Hjalmarsson wrote:
> >>it_says_BALLS_on_your forehead wrote:
> >>>Francis Sylvester wrote:
> >>>>
> >>>>while ($token = $parser->get_tag("a")) {
> >>>> if ($token->[1]->{"href"} =~ /$mymatch/) {
> >>>
> >>>try:
> >>>if ( $token->[1]{href} =~ /$mymatch/o ) {
> >>
> >>I fail to see why that would make a difference. Could you please explain
> >>why you think it would?
> >
> > I looked up HTML::TokeParse in CPAN.
>
> That's a good start, I suppose. :)
>
> > The first Example displayed illustrated that the way to get the href
> > was:
> >
> > my $url = $token->[1]{href} || "-";
> >
> > ...i noticed that the OP did not use the same syntax. I didn't know if
> > this was causing his problem.
>
> The reason why I asked is that I thought that
>
> $token->[1]->{"href"}
>
> is always the same as
>
> $token->[1]{href}
>
> following Perl's syntax for references and data structures.

ahh, i think you're right. pg. 254 Programming Perl 3rd ed.

"The arrow is optional between brackets or braces, or between a closing
bracket or brace and a parenthesis for an indirect function call."

A. Sinan Unur

unread,

Nov 14, 2005, 9:28:56 PM11/14/05

to

Abigail <abi...@abigail.nl> wrote in
news:slrndni9qv....@alexandra.abigail.nl:

> A. Sinan Unur (1u...@llenroc.ude.invalid) wrote on MMMMCDLVIII
> September MCMXCIII in
> <URL:news:Xns970EBA81F5AA...@127.0.0.1>:
>:) "Francis Sylvester" <fra...@nospam.com> wrote in
>:) news:AH8ef.16551$Es4....@fe2.news.blueyonder.co.uk:

...

>:) > $document =~ /$searchstring(.+?)someidentifier/;
>:)
>:) The exact contents of $mymatch, $searchstring and whatever
>:) someidentifier might have something to do with what's actually

>:) being matched, no?

>:)
>:) > print "$1";
>:)
>:) You are not capturing anything, why do you expect there to be

>:) anything valid in $1?

> Not capturing? I'd say the parens in
> /$searchstring(.+?)someidentifier/ capture (if the match is
> succesful), or there's a bug in perl.

Arrgh! Thank you very much for catching that.

Gunnar Hjalmarsson

unread,

Nov 14, 2005, 10:30:18 PM11/14/05

to

Francis Sylvester wrote:
> I'm a Perl newbie and am having a nightmare trying to get the code below
> working. I'm trying to fetch a webpage and if a link within the page matches
> the search criterion - return the text after the link.
>

> use LWP::Simple;
> use HTML::TokeParser;

Yes, using a module for parsing an HTML document is a good idea.

> my $document = get("http://www.anexamplesite.com");
> my $mymatch = "searchstring";
>
> my $parser = HTML::TokeParser->new(\$document);
>
> while ($token = $parser->get_tag("a")) {
> if ($token->[1]->{"href"} =~ /$mymatch/) {
> # print $server.$token->[1]->{href}."\n";
> $document =~ /$searchstring(.+?)someidentifier/;

What's that? After you have possibly found your search string, you let
the program search the whole document using a simple regex. Doing so
makes no sense to me.

Either you'd better stick to a simple regex, and skip the parsing
module, or (better) taking advantage of the module you are using, and
doing something like:

while ( my $token = $parser->get_tag('a') ) {
if ($token->[1]{href} =~ /$mymatch/) {
print $parser->get_text('a')."\n";
}
}

(I'm not sure if that's what you're looking for, but hopefully you get
the idea.)

Francis Sylvester

unread,

Nov 15, 2005, 9:56:46 AM11/15/05

to

> Either you'd better stick to a simple regex, and skip the parsing module,
> or (better) taking advantage of the module you are using, and doing
> something like:
>
> while ( my $token = $parser->get_tag('a') ) {
> if ($token->[1]{href} =~ /$mymatch/) {
> print $parser->get_text('a')."\n";
> }
> }
>
> (I'm not sure if that's what you're looking for, but hopefully you get the
> idea.)
>

Many thanks for all your replies. I'm sorry, I should have been clearer -
the code executes without error messages but I sometimes get unwanted
results in $1. After closer inspection, I think it's because sometimes it's
returning $1 from the earlier pattern match ( if ($token->[1]->{"href"} =~
/$mymatch/) rather than the pattern match I wanted ($document =~
/$searchstring(.+?)someidentifier/;)
Is there a way to reset the value of $1?

Many thanks,
Francis

Tad McClellan

unread,

Nov 15, 2005, 10:52:05 AM11/15/05

to

Francis Sylvester <fra...@nospam.com> wrote:

>> if ($token->[1]{href} =~ /$mymatch/) {

> I sometimes get unwanted
> results in $1. After closer inspection, I think it's because sometimes it's
> returning $1 from the earlier pattern match ( if ($token->[1]->{"href"} =~
> /$mymatch/)

Note that that code ensures that the pattern match *succeeded*.

> rather than the pattern match I wanted ($document =~
> /$searchstring(.+?)someidentifier/;)

We don't really know, since you did not quote that part of the code,
but you should always ensure that the match succeeded before
using the dollar-digit variables, so:

Is _your_ pattern match being tested for success?

> Is there a way to reset the value of $1?

Yes. They are reset on every _successful_ pattern match.

--
Tad McClellan SGML consulting
ta...@augustmail.com Perl programming
Fort Worth, Texas

Gunnar Hjalmarsson

unread,

Nov 15, 2005, 12:23:20 PM11/15/05

to

Francis Sylvester wrote:
>>Either you'd better stick to a simple regex, and skip the parsing module,
>>or (better) taking advantage of the module you are using, and doing
>>something like:
>>
>> while ( my $token = $parser->get_tag('a') ) {
>> if ($token->[1]{href} =~ /$mymatch/) {
>> print $parser->get_text('a')."\n";
>> }
>> }
>>
>>(I'm not sure if that's what you're looking for, but hopefully you get the
>>idea.)
>

> the code executes without error messages but I sometimes get unwanted
> results in $1.

And that may well be a result of the fact that you don't actually make
use of the module you are using for parsing HTML...

Didn't you understand my objection to your code?
http://groups.google.com/group/comp.lang.perl.misc/msg/60f72a205520c4b1

Francis Sylvester

unread,

Nov 15, 2005, 3:04:09 PM11/15/05

to

>>>(I'm not sure if that's what you're looking for, but hopefully you get
>>>the idea.)
>>
>> the code executes without error messages but I sometimes get unwanted
>> results in $1.
>
> And that may well be a result of the fact that you don't actually make use
> of the module you are using for parsing HTML...
>
> Didn't you understand my objection to your code?
> http://groups.google.com/group/comp.lang.perl.misc/msg/60f72a205520c4b1
>
> --

Thanks Gunnar. I did understand your objection but thought I needed to
resort to pattern matching for a specific section of the text I'm retrieving
after the link. Having read your message and looking at the module docs
again now - I think I might be able to achieve the desired result without
the pattern match. I'm very grateful to you for the responses - you've
probably saved me hours!

Thanks again,
Francis