Can't find a syntax error, hoping a second set of eyes will help

Jason C

unread,

Sep 24, 2012, 12:09:43 AM9/24/12

to

Can someone look at this and tell me what I'm messing up? I've been coding all night, and my eyes have gone fuzzy :-)

while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
if ($2 =~ /^http/i) {
$text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi;
}
}

The error is on the while() line (at least, I remove it and no more error). The error just says:

syntax error at blah.cgi line 239, near "if"
syntax error at blah.cgi line 246, near "}"

The purpose of the function is to remove the <a href=...></a> code in submitted text, but only if the linked text begins with http.

TIA,

Jason

Ben Morrow

unread,

Sep 24, 2012, 12:52:55 AM9/24/12

to

Quoth Jason C <jwca...@gmail.com>:

> Can someone look at this and tell me what I'm messing up? I've been
> coding all night, and my eyes have gone fuzzy :-)
>
> while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {

^^ m

(I would suggest finding a highlighting editor. It makes this sort of
syntactic mistake much easier to spot.)

Ben

Uri Guttman

unread,

Sep 24, 2012, 1:22:39 AM9/24/12

to

>>>>> "JC" == Jason C <jwca...@gmail.com> writes:

JC> Can someone look at this and tell me what I'm messing up? I've been coding all night, and my eyes have gone fuzzy :-)
JC> while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {

why do you think the # marks the start of a regex? only if you use m//
can you change the regex delim from /.
and ^ will not invert a char class for \1 as \1 isn't a char class
element. so even if you fix the regex delim, that will fail. finally,
why are you parsing out urls with a regex when there are modules that do
it correctly?

uri

Jason C

unread,

Sep 24, 2012, 5:28:11 AM9/24/12

to

On Monday, September 24, 2012 1:03:03 AM UTC-4, Ben Morrow wrote:
>
> > while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
> ^^ m
>
> (I would suggest finding a highlighting editor. It makes this sort of
> syntactic mistake much easier to spot.)

Thanks, Ben. I didn't realize the m//; was required; since you can change the delimiter with s/// ad hoc, I thought you could here, too.

I'm using Notepad++, and while it helps me catch opening and ending brackets, it didn't do a lot in recognizing syntax errors (at least, not that I know of). What editor do you recommend?

Jason C

unread,

Sep 24, 2012, 5:35:19 AM9/24/12

to

On Monday, September 24, 2012 1:23:40 AM UTC-4, Uri Guttman wrote:

> why do you think the # marks the start of a regex? only if you use m//
> can you change the regex delim from /.

Thanks to you, too, Uri. Like I replied to Ben a second ago, I thought that since you could replace the delimiter in s/// ad hoc, that you could in m//, too. Learn something new every day! :-)

> and ^ will not invert a char class for \1 as \1 isn't a char class
> element. so even if you fix the regex delim, that will fail.

Oh. Now THAT I did NOT know at all! It does explain a few other errors I've had, though, and couldn't figure out.

> finally,
> why are you parsing out urls with a regex when there are modules that do
> it correctly?

Two reasons:

1. I've been working with regex for a year or two, and while it's by no means a strong point in my vocabulary (yet), I'm at least familiar enough with it to usually figure it out.

2. I briefly looked for a module that would handle this correctly, but wasn't sure what to look for. And, I'm not sure that it warrants the including of a full module if it could potentially be done in a simple regex. If you can recommend a module that would be more stable and/or faster than what I'm doing, though, then I would definitely appreciate the reference!

FWIW, this modification did work:

while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
$pattern = $1$2$3;
$repl = $2;

if ($2 =~ /^http/i) {
$text =~ s/$pattern/$repl/gsi;
}
}

Admittedly, I'm not sure why $2 is stored long enough for the if() statement, but inside of the if() statement it's empty. Storing them to a different variable worked for this purpose, but if there's a better way, I'm very much open to it.

Peter Makholm

unread,

Sep 24, 2012, 5:49:31 AM9/24/12

to

Jason C <jwca...@gmail.com> writes:

>> > while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
>> ^^ m
>

> Thanks, Ben. I didn't realize the m//; was required; since you can
> change the delimiter with s/// ad hoc, I thought you could here, too.

You can change the delimiter, but the m is only optional when you use
the // delimiters.

//Makholm

Marc Girod

unread,

Sep 24, 2012, 6:30:37 AM9/24/12

to

On Sep 24, 10:28 am, Jason C <jwcarl...@gmail.com> wrote:

> What editor do you recommend?

GNU emacs with cperl-mode

Marc

anotheranne

unread,

Sep 24, 2012, 8:19:23 AM9/24/12

to

Jason C wrote:

> Can someone look at this and tell me what I'm messing up? I've been coding all night, and my eyes have gone fuzzy :-)
>
> while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
> if ($2 =~ /^http/i) {
> $text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi;
> }
> }

Whatever other errors your regex may have, I would suggest that
you stick with the regular m// and s/// constructs. You should of
course then escape the '/' in </a> . Changing this should make it run.

Don't use # as an eye-easy replacement for / because a) it is the perl
character for a comment, and b) in a regex (at least with the /x
modifier) it is also a metacharacter. Trouble will come your way if
you use this.

If you do want to get away from // and /// then use balanced
delimiters like m{} and s{}{} . See p319 in Friedl MASTERING REGULAR
EXPRESSIONS. O'Reilly.

When use use any alternate to m// the m is then mandatory. Only when
using // can you omit the m. thus // or m{} are valid constructs.

Also you can remove the ';' after the gsi

hope this helps.

anotheranne

anotheranne

unread,

Sep 24, 2012, 8:42:09 AM9/24/12

to

Padre is a nice perl IDE.

http://padre.perlide.org/

anotheranne

Ben Morrow

unread,

Sep 24, 2012, 10:37:47 AM9/24/12

to

Quoth Peter Makholm <pe...@makholm.net>:

Or ??, but that has special semantics.

Ben

Ben Morrow

unread,

Sep 24, 2012, 10:53:15 AM9/24/12

to

Quoth anotheranne <anoth...@nowhere.com>:

> Jason C wrote:
>
> > Can someone look at this and tell me what I'm messing up? I've been
> coding all night, and my eyes have gone fuzzy :-)
> >
> > while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
> > if ($2 =~ /^http/i) {
> > $text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi;
> > }
> > }
>
> Whatever other errors your regex may have, I would suggest that
> you stick with the regular m// and s/// constructs. You should of
> course then escape the '/' in </a> . Changing this should make it run.

That's a bad idea. Perl has changable delimiters for a reason: to avoid
huge unreadable nests of /\/\\/.

> Don't use # as an eye-easy replacement for / because a) it is the perl
> character for a comment, and b) in a regex (at least with the /x
> modifier) it is also a metacharacter. Trouble will come your way if
> you use this.

Nonsense. Perl is perfectly capable of getting this right.

> Also you can remove the ';' after the gsi

...but that would probably also be a bad idea.

Ben

Ben Morrow

unread,

Sep 24, 2012, 10:48:28 AM9/24/12

to

Quoth Jason C <jwca...@gmail.com>:

> On Monday, September 24, 2012 1:23:40 AM UTC-4, Uri Guttman wrote:
>
> > finally,
> > why are you parsing out urls with a regex when there are modules that do
> > it correctly?
>
> Two reasons:
>
> 1. I've been working with regex for a year or two, and while it's by no
> means a strong point in my vocabulary (yet), I'm at least familiar
> enough with it to usually figure it out.
>
> 2. I briefly looked for a module that would handle this correctly, but
> wasn't sure what to look for.

HTML::LinkExtor, probably, depending on what you're trying to do.
Perhaps one of the other HTML::Parser-based modules.

> FWIW, this modification did work:
>
> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
> $pattern = $1$2$3;

^^ ^^
I think not...

> $repl = $2;
>
> if ($2 =~ /^http/i) {
> $text =~ s/$pattern/$repl/gsi;

This almost certainly doesn't do what you think. If nothing else, you
want to \Q $pattern. What are you trying to do here: strip tags? Why not
just do one s/// (or, you know, use a module)?

> }
> }
>
> Admittedly, I'm not sure why $2 is stored long enough for the if()
> statement, but inside of the if() statement it's empty. Storing them to
> a different variable worked for this purpose, but if there's a better
> way, I'm very much open to it.

The $N variables last until the next successful pattern match. In this
case, the '$2 =~ /^http/i' in the condition of the if clears them all
(even though it doesn't capture anything).

In general I prefer to assign captures to real variables right away:

while (my ($tag, $url) = m#(<a...>(.*?)</a>)#gsi) {

(notice also that captures can be nested, and DTRT).

Ben

Scott Bryce

unread,

Sep 24, 2012, 11:11:20 AM9/24/12

to

On 9/24/2012 3:28 AM, Jason C wrote:
> I'm using Notepad++,

I assume that means you are on a Windows box.

> What editor do you recommend?

I like UltraEdit.

Ben Morrow

unread,

Sep 24, 2012, 12:54:04 PM9/24/12

to

Quoth Jason C <jwca...@gmail.com>:

> >
> > (I would suggest finding a highlighting editor. It makes this sort of
> > syntactic mistake much easier to spot.)
>

> I'm using Notepad++, and while it helps me catch opening and ending
> brackets, it didn't do a lot in recognizing syntax errors (at least, not
> that I know of). What editor do you recommend?

Personally I use Vim, which runs on Unix/Mac/Windows, but it takes a
little getting used to. The GUI version (which is probably what you
would use on Windows) has menus and mouse support as you would expect,
and there is an 'easy' mode which makes it behave more like a
Windows-style point-and-type editor, but I'm not sure I see the point of
using a programmer's editor if you're not going to learn to use it
properly.

Ben

Uri Guttman

unread,

Sep 24, 2012, 3:43:42 PM9/24/12

to

>>>>> "JC" == Jason C <jwca...@gmail.com> writes:

JC> On Monday, September 24, 2012 1:23:40 AM UTC-4, Uri Guttman wrote:
>> why do you think the # marks the start of a regex? only if you use m//
>> can you change the regex delim from /.

JC> Thanks to you, too, Uri. Like I replied to Ben a second ago, I
JC> thought that since you could replace the delimiter in s/// ad hoc,
JC> that you could in m//, too. Learn something new every day! :-)

but s/// has the s to mark the next char. =~ ## has no leading marker so it
would just be a comment. also using # for the delimiter is just a bad
idea as it confuses many readers.

>> finally,
>> why are you parsing out urls with a regex when there are modules that do
>> it correctly?

JC> Two reasons:

JC> 1. I've been working with regex for a year or two, and while it's
JC> by no means a strong point in my vocabulary (yet), I'm at least
JC> familiar enough with it to usually figure it out.

good that you are studying them but it still is the wrong tool for
this. learning when regexes aren't a good solution is part of learning
regexes.

JC> 2. I briefly looked for a module that would handle this correctly,
JC> but wasn't sure what to look for. And, I'm not sure that it
JC> warrants the including of a full module if it could potentially be
JC> done in a simple regex. If you can recommend a module that would
JC> be more stable and/or faster than what I'm doing, though, then I
JC> would definitely appreciate the reference!

JC> FWIW, this modification did work:

JC> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {

it will fail if the opening quote is " and the string has a ' inside
it. perfectly legal html but you can't parse it that way.

JC> Admittedly, I'm not sure why $2 is stored long enough for the if()
JC> statement, but inside of the if() statement it's empty. Storing
JC> them to a different variable worked for this purpose, but if
JC> there's a better way, I'm very much open to it.

you need to read more about regexes and the $1 stuff. they live until
the next regex is run (they are global).

uri

Jason C

unread,

Sep 24, 2012, 8:54:33 PM9/24/12

to

On Monday, September 24, 2012 11:03:04 AM UTC-4, Ben Morrow wrote:
> > FWIW, this modification did work:
> >
> > while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
> > $pattern = $1$2$3;
> ^^ ^^
> I think not...

Blah, sorry; that's what I get for trying to type of dummy code at 5am. In practice, I put it in quotes:

$pattern = "$1$2$3";

> > if ($2 =~ /^http/i) {
> > $text =~ s/$pattern/$repl/gsi;
>
> This almost certainly doesn't do what you think. If nothing else, you
> want to \Q $pattern.

Excellent point about \Q. What do you mean, though, that it doesn't do what I think?

> What are you trying to do here: strip tags?

Yes and no. I'm using a contenteditable instead of a textarea, and I've discovered that when someone copy-and-pastes an URL from Chrome or FF, it's automatically making the URL a link. Eg:

<a href="http://www.google.com">http://www.google.com</a>

But of course, if you just type the address, then it doesn't. So on my end, I was using URI::Find to convert addresses to links, and ending up with a mess like:

<a href="<a href="http://www.google.com">http://www.google.com</a>"><a href="http://www.google.com">http://www.google.com</a></a>

So, my goal here is to remove the <a href> tag, but only if the linked text is an URL.

> Why not
> just do one s/// (or, you know, use a module)?

I had originally tried doing it with a simple s///, but couldn't figure out how to make it conditional. Like this:

$text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi
if ($3 =~ /^http/i);

This worked correctly if I removed the if() statement. In testing, I changed the replacement to:

1 - $1, 2 - $2, 3 - $3

just to make sure that $3 did begin with http, and it did, so I couldn't figure out why the if() wasn't catching it unless it was dropping the $3 value before reaching the if().

> > Admittedly, I'm not sure why $2 is stored long enough for the if()
> > statement, but inside of the if() statement it's empty. Storing them to
> > a different variable worked for this purpose, but if there's a better
> > way, I'm very much open to it.
>
> The $N variables last until the next successful pattern match. In this
> case, the '$2 =~ /^http/i' in the condition of the if clears them all
> (even though it doesn't capture anything).

Ahh, that makes sense. I mistakenly thought that, since I wasn't assigning $N, then they would retain the previous value.

> In general I prefer to assign captures to real variables right away:
>
> while (my ($tag, $url) = m#(<a...>(.*?)</a>)#gsi) {
>
> (notice also that captures can be nested, and DTRT).

Great to know! Thanks.

Jason C

unread,

Sep 24, 2012, 8:56:33 PM9/24/12

to

On Monday, September 24, 2012 11:03:04 AM UTC-4, Ben Morrow wrote:

> while (my ($tag, $url) = m#(<a...>(.*?)</a>)#gsi) {

In this, how does it know that we're testing $test? Or, did you mean to type something like:

while (my (tag, $url) = $text =~ m#(<a...>(.*?)</a>)#gsi)

Jason C

unread,

Sep 24, 2012, 9:04:17 PM9/24/12

to

On Monday, September 24, 2012 3:44:44 PM UTC-4, Uri Guttman wrote:

> JC> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
>
> it will fail if the opening quote is " and the string has a ' inside
> it. perfectly legal html but you can't parse it that way.

I'll probably discard this idea and pursue a module, like you guys suggested. But for the sake of learning...

I recognized this issue, too, which is why I was originally using [^\1], like so:

(["'])*([^\1>]*)\1

I think it was you that pointed out that I can't negate a backreference like that, though.

What would be the correct way to do this, if I can't negate a backreference as a character class?

Jim Gibson

unread,

Sep 24, 2012, 9:26:32 PM9/24/12

to

In article <6d53b708-9e94-4bc9...@googlegroups.com>,

Capture the leading delimiter and use a backreference that is not in a
character class:

while ($text =~ m{(<a[^>]* href=(["']).*?\2.*?>)(.*?)(</a>)}gsi) {
^^

--
Jim Gibson

Ben Morrow

unread,

Sep 25, 2012, 4:40:09 AM9/25/12

to

Quoth Jason C <jwca...@gmail.com>:

Just so :). Sorry...

Ben

Ben Morrow

unread,

Sep 25, 2012, 5:28:35 AM9/25/12

to

Quoth Jason C <jwca...@gmail.com>:

> On Monday, September 24, 2012 11:03:04 AM UTC-4, Ben Morrow wrote:
> > > FWIW, this modification did work:
> > >
> > > while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
> > > $pattern = $1$2$3;

<snip>

> > > if ($2 =~ /^http/i) {
> > > $text =~ s/$pattern/$repl/gsi;
> >
> > This almost certainly doesn't do what you think. If nothing else, you
> > want to \Q $pattern.
>
> Excellent point about \Q. What do you mean, though, that it doesn't do
> what I think?

Well, for one thing, this link

<a href="http://html5.org">HTML5</a>

will be stripped. I don't think that's what you meant.

> > What are you trying to do here: strip tags?
>
> Yes and no. I'm using a contenteditable instead of a textarea, and I've
> discovered that when someone copy-and-pastes an URL from Chrome or FF,
> it's automatically making the URL a link. Eg:
>
> <a href="http://www.google.com">http://www.google.com</a>
>
> But of course, if you just type the address, then it doesn't. So on my
> end, I was using URI::Find to convert addresses to links, and ending up
> with a mess like:
>
> <a href="<a href="http://www.google.com">http://www.google.com</a>"><a
> href="http://www.google.com">http://www.google.com</a></a>
>
> So, my goal here is to remove the <a href> tag, but only if the linked
> text is an URL.

You're doing this backwards. You want to use HTML::Parser (or perhaps
HTML::TokeParser) to separate tags from text, and then just apply
URI::Find to 'text' sections which aren't already inside an <a> element.

> > Why not
> > just do one s/// (or, you know, use a module)?
>
> I had originally tried doing it with a simple s///, but couldn't figure
> out how to make it conditional. Like this:
>
> $text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi
> if ($3 =~ /^http/i);
>
> This worked correctly if I removed the if() statement. In testing, I
> changed the replacement to:
>
> 1 - $1, 2 - $2, 3 - $3
>
> just to make sure that $3 did begin with http, and it did, so I couldn't
> figure out why the if() wasn't catching it unless it was dropping the $3
> value before reaching the if().

...No. Maybe it would be clearer if you wrote it like this:

if ($3 =~ /^http/i) {
$text = s#...#...#gsi;
}

(which is *exactly* equivalent)? The 'if' condition executes first, so
$3 is something completely random from the previous pattern match; and
in any case, the if covers the *whole* s///, not just one iteration.

You need to push the condition inside the s///. The obvious way of doing
that is

s#<a ...>http:.*?</a>#$2#gsi;

though in more difficult cases you can use s///ge and put a ?: or
equivalent in the RHS.

Ben

Ben Morrow

unread,

Sep 25, 2012, 5:53:32 AM9/25/12

to

Quoth Jim Gibson <jimsg...@gmail.com>:

That's not the same in general: .*? doesn't *want* to match a quote, but
it will if necessary to make the whole match succeed. In this particular
case it doesn't change anything because there is nothing between the \2
and the next .*?, but for instance these two

m{<a href="[^"]*">}
m{<a href=".*?">}

don't match the same thing. The second will match q{<a href="foo"">},
because the .*? will match a quote if forced, but the first will not.

The correct way to match 'everything until $rx' is (?:(?!$rx).)*, so in
this case

m{... href=(["'])(?:(?!\2).)*\2 ...}

(which would certainly benefit from /x).

Ben

Eli the Bearded

unread,

Sep 26, 2012, 5:09:34 PM9/26/12

to

In comp.lang.perl.misc, Jason C <jwca...@gmail.com> wrote:
> 2. I briefly looked for a module that would handle this correctly, but
> wasn't sure what to look for. And, I'm not sure that it warrants the
> including of a full module if it could potentially be done in a simple
> regex. If you can recommend a module that would be more stable and/or
> faster than what I'm doing, though, then I would definitely appreciate
> the reference!

Do you want to deal with human generated HTML? You'll find that a
"simple" regex will fail you.

http://www.panix.com/~eli/some.links.html

:r! cat $PHTML/some.links.html
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head>
<title>linky</title>
</head>
<body>
<h1>linky</h1>
<ul>
<li><a href = http://www.google.com/ > Space no quotes </a>
(this link gives validation errors)</li>
<li><a href = 'http://www.google.com/'> Space single quotes </a></li>
<li><a href='http://www.google.com/'> End space single quotes </a ></li>
<li><a
href
=
'http://www.google.com/'
> No spaces (newlines) single quotes </a
></li>
<li><a href="http://www.google.com/"> No spaces (tabs) double quotes </a ></li>
</ul>
</body>
</html>

That's not even trying to be an exhaustive way to break it.

Elijah
------
no javascript, for example

Kaz Kylheku

unread,

Sep 26, 2012, 6:54:15 PM9/26/12

to

On 2012-09-26, Eli the Bearded <*@eli.users.panix.com> wrote:
>:r! cat $PHTML/some.links.html

UUOC infects the the vi command line!

:r!cat <file> -> :r <file>

Eli the Bearded

unread,

Sep 26, 2012, 7:38:09 PM9/26/12

to

You got me. I tend to use :r! a lot in posts, and didn't optimize
it down to :r here.

Elijah
------
map * "yyy@y