>>>>> "JC" == Jason C <jwcarl...@gmail.com> writes:
JC> Can someone look at this and tell me what I'm messing up? I've been coding all night, and my eyes have gone fuzzy :-)
JC> while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
why do you think the # marks the start of a regex? only if you use m//
can you change the regex delim from /.
and ^ will not invert a char class for \1 as \1 isn't a char class
element. so even if you fix the regex delim, that will fail. finally,
why are you parsing out urls with a regex when there are modules that do
it correctly?
On Monday, September 24, 2012 1:03:03 AM UTC-4, Ben Morrow wrote:
> > while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
> ^^ m
> (I would suggest finding a highlighting editor. It makes this sort of
> syntactic mistake much easier to spot.)
Thanks, Ben. I didn't realize the m//; was required; since you can change the delimiter with s/// ad hoc, I thought you could here, too.
I'm using Notepad++, and while it helps me catch opening and ending brackets, it didn't do a lot in recognizing syntax errors (at least, not that I know of). What editor do you recommend?
On Monday, September 24, 2012 1:23:40 AM UTC-4, Uri Guttman wrote:
> why do you think the # marks the start of a regex? only if you use m//
> can you change the regex delim from /.
Thanks to you, too, Uri. Like I replied to Ben a second ago, I thought that since you could replace the delimiter in s/// ad hoc, that you could in m//, too. Learn something new every day! :-)
> and ^ will not invert a char class for \1 as \1 isn't a char class
> element. so even if you fix the regex delim, that will fail.
Oh. Now THAT I did NOT know at all! It does explain a few other errors I've had, though, and couldn't figure out.
> finally,
> why are you parsing out urls with a regex when there are modules that do
> it correctly?
Two reasons:
1. I've been working with regex for a year or two, and while it's by no means a strong point in my vocabulary (yet), I'm at least familiar enough with it to usually figure it out.
2. I briefly looked for a module that would handle this correctly, but wasn't sure what to look for. And, I'm not sure that it warrants the including of a full module if it could potentially be done in a simple regex. If you can recommend a module that would be more stable and/or faster than what I'm doing, though, then I would definitely appreciate the reference!
if ($2 =~ /^http/i) {
$text =~ s/$pattern/$repl/gsi;
}
}
Admittedly, I'm not sure why $2 is stored long enough for the if() statement, but inside of the if() statement it's empty. Storing them to a different variable worked for this purpose, but if there's a better way, I'm very much open to it.
Whatever other errors your regex may have, I would suggest that
you stick with the regular m// and s/// constructs. You should of
course then escape the '/' in </a> . Changing this should make it run.
Don't use # as an eye-easy replacement for / because a) it is the perl
character for a comment, and b) in a regex (at least with the /x
modifier) it is also a metacharacter. Trouble will come your way if
you use this.
If you do want to get away from // and /// then use balanced
delimiters like m{} and s{}{} . See p319 in Friedl MASTERING REGULAR
EXPRESSIONS. O'Reilly.
When use use any alternate to m// the m is then mandatory. Only when
using // can you omit the m. thus // or m{} are valid constructs.
Jason C wrote:
> On Monday, September 24, 2012 1:03:03 AM UTC-4, Ben Morrow wrote:
>> > while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
>> ^^ m
>> (I would suggest finding a highlighting editor. It makes this sort of
>> syntactic mistake much easier to spot.)
> Thanks, Ben. I didn't realize the m//; was required; since you can change the delimiter with s/// ad hoc, I thought you could here, too.
> I'm using Notepad++, and while it helps me catch opening and ending brackets, it didn't do a lot in recognizing syntax errors (at least, not that I know of). What editor do you recommend?
> Whatever other errors your regex may have, I would suggest that
> you stick with the regular m// and s/// constructs. You should of
> course then escape the '/' in </a> . Changing this should make it run.
That's a bad idea. Perl has changable delimiters for a reason: to avoid
huge unreadable nests of /\/\\/.
> Don't use # as an eye-easy replacement for / because a) it is the perl
> character for a comment, and b) in a regex (at least with the /x
> modifier) it is also a metacharacter. Trouble will come your way if
> you use this.
Nonsense. Perl is perfectly capable of getting this right.
> On Monday, September 24, 2012 1:23:40 AM UTC-4, Uri Guttman wrote:
> > finally,
> > why are you parsing out urls with a regex when there are modules that do
> > it correctly?
> Two reasons:
> 1. I've been working with regex for a year or two, and while it's by no
> means a strong point in my vocabulary (yet), I'm at least familiar
> enough with it to usually figure it out.
> 2. I briefly looked for a module that would handle this correctly, but
> wasn't sure what to look for.
HTML::LinkExtor, probably, depending on what you're trying to do.
Perhaps one of the other HTML::Parser-based modules.
> if ($2 =~ /^http/i) {
> $text =~ s/$pattern/$repl/gsi;
This almost certainly doesn't do what you think. If nothing else, you
want to \Q $pattern. What are you trying to do here: strip tags? Why not
just do one s/// (or, you know, use a module)?
> }
> }
> Admittedly, I'm not sure why $2 is stored long enough for the if()
> statement, but inside of the if() statement it's empty. Storing them to
> a different variable worked for this purpose, but if there's a better
> way, I'm very much open to it.
The $N variables last until the next successful pattern match. In this
case, the '$2 =~ /^http/i' in the condition of the if clears them all
(even though it doesn't capture anything).
In general I prefer to assign captures to real variables right away:
while (my ($tag, $url) = m#(<a...>(.*?)</a>)#gsi) {
(notice also that captures can be nested, and DTRT).
> > (I would suggest finding a highlighting editor. It makes this sort of
> > syntactic mistake much easier to spot.)
> I'm using Notepad++, and while it helps me catch opening and ending
> brackets, it didn't do a lot in recognizing syntax errors (at least, not
> that I know of). What editor do you recommend?
Personally I use Vim, which runs on Unix/Mac/Windows, but it takes a
little getting used to. The GUI version (which is probably what you
would use on Windows) has menus and mouse support as you would expect,
and there is an 'easy' mode which makes it behave more like a
Windows-style point-and-type editor, but I'm not sure I see the point of
using a programmer's editor if you're not going to learn to use it
properly.
>>>>> "JC" == Jason C <jwcarl...@gmail.com> writes:
JC> On Monday, September 24, 2012 1:23:40 AM UTC-4, Uri Guttman wrote:
>> why do you think the # marks the start of a regex? only if you use m//
>> can you change the regex delim from /.
JC> Thanks to you, too, Uri. Like I replied to Ben a second ago, I
JC> thought that since you could replace the delimiter in s/// ad hoc,
JC> that you could in m//, too. Learn something new every day! :-)
but s/// has the s to mark the next char. =~ ## has no leading marker so it
would just be a comment. also using # for the delimiter is just a bad
idea as it confuses many readers.
>> finally,
>> why are you parsing out urls with a regex when there are modules that do
>> it correctly?
JC> Two reasons:
JC> 1. I've been working with regex for a year or two, and while it's
JC> by no means a strong point in my vocabulary (yet), I'm at least
JC> familiar enough with it to usually figure it out.
good that you are studying them but it still is the wrong tool for
this. learning when regexes aren't a good solution is part of learning
regexes.
JC> 2. I briefly looked for a module that would handle this correctly,
JC> but wasn't sure what to look for. And, I'm not sure that it
JC> warrants the including of a full module if it could potentially be
JC> done in a simple regex. If you can recommend a module that would
JC> be more stable and/or faster than what I'm doing, though, then I
JC> would definitely appreciate the reference!
JC> FWIW, this modification did work:
JC> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
it will fail if the opening quote is " and the string has a ' inside
it. perfectly legal html but you can't parse it that way.
JC> Admittedly, I'm not sure why $2 is stored long enough for the if()
JC> statement, but inside of the if() statement it's empty. Storing
JC> them to a different variable worked for this purpose, but if
JC> there's a better way, I'm very much open to it.
you need to read more about regexes and the $1 stuff. they live until
the next regex is run (they are global).
> This almost certainly doesn't do what you think. If nothing else, you
> want to \Q $pattern.
Excellent point about \Q. What do you mean, though, that it doesn't do what I think?
> What are you trying to do here: strip tags?
Yes and no. I'm using a contenteditable instead of a textarea, and I've discovered that when someone copy-and-pastes an URL from Chrome or FF, it's automatically making the URL a link. Eg:
But of course, if you just type the address, then it doesn't. So on my end, I was using URI::Find to convert addresses to links, and ending up with a mess like:
So, my goal here is to remove the <a href> tag, but only if the linked text is an URL.
> Why not
> just do one s/// (or, you know, use a module)?
I had originally tried doing it with a simple s///, but couldn't figure out how to make it conditional. Like this:
$text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi
if ($3 =~ /^http/i);
This worked correctly if I removed the if() statement. In testing, I changed the replacement to:
1 - $1, 2 - $2, 3 - $3
just to make sure that $3 did begin with http, and it did, so I couldn't figure out why the if() wasn't catching it unless it was dropping the $3 value before reaching the if().
> > Admittedly, I'm not sure why $2 is stored long enough for the if()
> > statement, but inside of the if() statement it's empty. Storing them to
> > a different variable worked for this purpose, but if there's a better
> > way, I'm very much open to it.
> The $N variables last until the next successful pattern match. In this
> case, the '$2 =~ /^http/i' in the condition of the if clears them all
> (even though it doesn't capture anything).
Ahh, that makes sense. I mistakenly thought that, since I wasn't assigning $N, then they would retain the previous value.
> In general I prefer to assign captures to real variables right away:
> while (my ($tag, $url) = m#(<a...>(.*?)</a>)#gsi) {
> (notice also that captures can be nested, and DTRT).
> Yes and no. I'm using a contenteditable instead of a textarea, and I've
> discovered that when someone copy-and-pastes an URL from Chrome or FF,
> it's automatically making the URL a link. Eg:
> But of course, if you just type the address, then it doesn't. So on my
> end, I was using URI::Find to convert addresses to links, and ending up
> with a mess like:
> So, my goal here is to remove the <a href> tag, but only if the linked
> text is an URL.
You're doing this backwards. You want to use HTML::Parser (or perhaps
HTML::TokeParser) to separate tags from text, and then just apply
URI::Find to 'text' sections which aren't already inside an <a> element.
> > Why not
> > just do one s/// (or, you know, use a module)?
> I had originally tried doing it with a simple s///, but couldn't figure
> out how to make it conditional. Like this:
> $text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi
> if ($3 =~ /^http/i);
> This worked correctly if I removed the if() statement. In testing, I
> changed the replacement to:
> 1 - $1, 2 - $2, 3 - $3
> just to make sure that $3 did begin with http, and it did, so I couldn't
> figure out why the if() wasn't catching it unless it was dropping the $3
> value before reaching the if().
...No. Maybe it would be clearer if you wrote it like this:
if ($3 =~ /^http/i) {
$text = s#...#...#gsi;
}
(which is *exactly* equivalent)? The 'if' condition executes first, so
$3 is something completely random from the previous pattern match; and
in any case, the if covers the *whole* s///, not just one iteration.
You need to push the condition inside the s///. The obvious way of doing
that is
s#<a ...>http:.*?</a>#$2#gsi;
though in more difficult cases you can use s///ge and put a ?: or
equivalent in the RHS.
> In article <6d53b708-9e94-4bc9-8707-d9a130b2da2c@googlegroups.com>,
> Jason C <jwcarl...@gmail.com> wrote:
> > On Monday, September 24, 2012 3:44:44 PM UTC-4, Uri Guttman wrote:
> > > JC> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
> > > it will fail if the opening quote is " and the string has a ' inside
> > > it. perfectly legal html but you can't parse it that way.
> > I'll probably discard this idea and pursue a module, like you guys suggested.
> > But for the sake of learning...
> > I recognized this issue, too, which is why I was originally using [^\1], like
> > so:
> > (["'])*([^\1>]*)\1
> > I think it was you that pointed out that I can't negate a backreference like
> > that, though.
> > What would be the correct way to do this, if I can't negate a
> backreference as a character class?
> Capture the leading delimiter and use a backreference that is not in a
> character class:
> while ($text =~ m{(<a[^>]* href=(["']).*?\2.*?>)(.*?)(</a>)}gsi) {
That's not the same in general: .*? doesn't *want* to match a quote, but
it will if necessary to make the whole match succeed. In this particular
case it doesn't change anything because there is nothing between the \2
and the next .*?, but for instance these two
m{<a href="[^"]*">}
m{<a href=".*?">}
don't match the same thing. The second will match q{<a href="foo"">},
because the .*? will match a quote if forced, but the first will not.
The correct way to match 'everything until $rx' is (?:(?!$rx).)*, so in
this case
In comp.lang.perl.misc, Jason C <jwcarl...@gmail.com> wrote:
> 2. I briefly looked for a module that would handle this correctly, but
> wasn't sure what to look for. And, I'm not sure that it warrants the
> including of a full module if it could potentially be done in a simple
> regex. If you can recommend a module that would be more stable and/or
> faster than what I'm doing, though, then I would definitely appreciate
> the reference!
Do you want to deal with human generated HTML? You'll find that a
"simple" regex will fail you.
:r! cat $PHTML/some.links.html
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head>
<title>linky</title>
</head>
<body>
<h1>linky</h1>
<ul>
<li><a href = http://www.google.com/ > Space no quotes </a>
(this link gives validation errors)</li>
<li><a href = 'http://www.google.com/'> Space single quotes </a></li>
<li><a href='http://www.google.com/'> End space single quotes </a ></li>
<li><a
href
=
'http://www.google.com/'