On Mon, Mar 07, 2016 at 05:44:08PM -0800,
jorr...@gmail.com wrote:
> I'm trying to replace *[URL]
www.link.com[/URL]* with HTML with this regexp:
>
> topic.text = re.sub("(\[URL\])(.*)(\[\/URL\])", '<a href="$2">$2</a>', topic
> .text, flags=re.I)
>
> But it's giving me the following problems:
>
> 1. The $2 capture group is only able to be repeated once, so I get
I have my doubts – if you use the standard Python re library, then the
way to refer to captured groups is "\1", "\2", etc., not "$1". When I
try the code you posted above, I get the following result (i.e., not
even the first occurrence of "$2" gets substituted)::
>>> re.sub("(\[URL\])(.*)(\[\/URL\])", '<a href="$2">$2</a>', '[URL]
www.link.com[/URL]', flags=re.I)
'<a href="$2">$2</a>'
In order to make the substitution work for a single occurrence of
[URL]...[/URL], you can use the following, which uses "\2" (Also, when
writing regular expressions, or other strings that are supposed to
contain the backslash character, it is a good idea to write them as
raw string literals, i.e. prefix them with a "r", which I've done
below; that way, Python won't try to interpret the backslashes as
special characters – otherwise, "\2" would become a character with an
ASCII value of 2)::
>>> re.sub(r"(\[URL\])(.*)(\[\/URL\])", r'<a href="\2">\2</a>', '[URL]
www.link.com[/URL]', flags=re.I)
'<a href="
www.link.com">
www.link.com</a>'
> 2. Only the first *[URL]* is matched. Everything after the first *[/URL]*
> is simply deleted...
The solution above gets you halfway there – re.sub will replace all
matches by default, the problem here is that the "(.*)" part of your
regex will matches everything between the first "[URL]", and the last
"[/URL]"::
>>> re.sub(r"(\[URL\])(.*)(\[\/URL\])", r'<a href="\2">\2</a>', '[URL]
www.link1.com[/URL][URL]
www.link2.com[/URL][URL]
www.link3.com[/URL]', flags=re.I)
'<a href="
www.link1.com[/URL][URL]
www.link2.com[/URL][URL]
www.link3.com">
www.link1.com[/URL][URL]
www.link2.com[/URL][URL]
www.link3.com</a>'
The reason is that the asterisk operator in a regex is greedy, which
means a ".*" will try to match as much as possible. When you use the
non-greedy version of the operator (which you get by putting a
question mark after the asterisk), you get the result you want::
>>> re.sub(r"(\[URL\])(.*?)(\[\/URL\])", r'<a href="\2">\2</a>', '[URL]
www.link1.com[/URL][URL]
www.link2.com[/URL][URL]
www.link3.com[/URL]', flags=re.I)
'<a href="
www.link1.com">
www.link1.com</a><a href="
www.link2.com">
www.link2.com</a><a href="
www.link3.com">
www.link3.com</a>'
You can read an explanation of the difference between greedy and
non-greedy regular expressions in the Python docs:
https://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy
Good luck,
Michal
>
> I hope someone can help me with this. I'm using Python 2.7 if it makes a
> difference.
>
> --
> You received this message because you are subscribed to the Google Groups "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
django-users...@googlegroups.com.
> To post to this group, send email to
django...@googlegroups.com.
> Visit this group at
https://groups.google.com/group/django-users.
> To view this discussion on the web visit
https://groups.google.com/d/msgid/django-users/fce5a726-8a4c-455a-a978-6ee70d66464e%40googlegroups.com.
> For more options, visit
https://groups.google.com/d/optout.