Quoth Jason C <
jwca...@gmail.com>:
> I'm struggling with what I thought was a simple thing, and I'm hoping
> you guys can help.
>
> I have a string that may contain a ", ', or neither. So, I wrote this in
> the regex:
>
> ["|']*
You don't use | like that inside a character class. None of the normal
regex special characters have their special meanings, and the class
matches any one of the characters listed, so that class will match any
of ", ', or |.
You also probably don't want that *. AFAICS you want to match exactly
one quote, of either type, so you just want ["'].
(If you wanted to get fancy you could insist on matching quotes using \1
backreferences, but you may not think it's worth it.)
> But this doesn't match anything.
>
> Here's the complete code:
>
> # $text comes from a form, so this is just a sample
> $text = <<EOF;
> <img src="<a href='
http://www.example.com/whatever.jpg'
> target='_new'>
>
http://www.example.com/whatever.jpg</a>"
> width="300" height="300" border="0">
> EOF
>
> # Regex; line breaks added here for the sake of reading
If you use /x you can do this in your real source too, though you will
need to remember to escape spaces when you do want them to match
literally.
> $text =~ s/<img(.*?)src=
> ["|']*\s*<a.*? href=
> ["|']*\s*(.*?)
> ["|']*.*?>(.*?)<\/a>
> ["|']*(.*?)>
> /<img src="$2"$1$4>/gsi;
>
> If I change ["|']* to whatever I have hard coded, then it works fine, so
> I know the issue is with that pattern. So how do I correctly match them?
When I try this (after habing removed the line breaks) it does match
*something*, just not what you wanted it to match. $text ends up as
<img src=""
width="300" height="300" border="0">
which is happening because the second uncaptured .*? is picking up all
the text you wanted to get in $2. Everything between the 'href=' and the
'>' is *ed, so it can all match nothing if it wants to. The .*? in $2
wants to match as little as possible, and so does the one before the >,
and when two sections of the pattern are 'fighting' over what to match
the one earlier in the pattern wins.
In general, .*? is not a panacea in situations like this. You would
probably be better off using negated character classes, something like
$text =~ s{
<img ([^>]*) src=["'] \s*
<a [^>]* [ ] href=["'] \s* ([^'"]*) ["'] [^>]* >
([^<]*) </a>
['"] ([^>]*) >
}{<img src="$2"$1$4>}gsix;
(I've used /x to format it decently, which means the literal space needs
to be escaped somehow. I usually prefer putting it in a character class
to backslashing it, though either would work.)
Here each negated character class stops the match running off past the
next thing, so for instance $2 can't run past the end of the quotes.
This isn't perfect: it will not match at all if there are other tags
inside the <a>, and it's not terribly easy to modify it so it will.
(While it is possible to correctly match arbitrary HTML with Perl
regexes, it isn't entirely straightforward.)
Ben