RegExp problems

28 views
Skip to first unread message

jorr...@gmail.com

unread,
Mar 7, 2016, 8:44:08 PM3/7/16
to Django users
I'm trying to replace [URL]www.link.com[/URL] with HTML with this regexp:

topic.text = re.sub("(\[URL\])(.*)(\[\/URL\])", '<a href="$2">$2</a>', topic.text, flags=re.I)

But it's giving me the following problems:
  1. The $2 capture group is only able to be repeated once, so I get
    <a href="www.link.com">$2</a>
    instead of 
  2. Only the first [URL] is matched. Everything after the first [/URL] is simply deleted...
I hope someone can help me with this. I'm using Python 2.7 if it makes a difference.

Michal Petrucha

unread,
Mar 8, 2016, 3:41:11 AM3/8/16
to django...@googlegroups.com
On Mon, Mar 07, 2016 at 05:44:08PM -0800, jorr...@gmail.com wrote:
> I'm trying to replace *[URL]www.link.com[/URL]* with HTML with this regexp:
>
> topic.text = re.sub("(\[URL\])(.*)(\[\/URL\])", '<a href="$2">$2</a>', topic
> .text, flags=re.I)
>
> But it's giving me the following problems:
>
> 1. The $2 capture group is only able to be repeated once, so I get
> <a href="www.link.com">$2</a>
> instead of
> <a href="www.link.com">www.link.com</a>

I have my doubts – if you use the standard Python re library, then the
way to refer to captured groups is "\1", "\2", etc., not "$1". When I
try the code you posted above, I get the following result (i.e., not
even the first occurrence of "$2" gets substituted)::

>>> re.sub("(\[URL\])(.*)(\[\/URL\])", '<a href="$2">$2</a>', '[URL]www.link.com[/URL]', flags=re.I)
'<a href="$2">$2</a>'

In order to make the substitution work for a single occurrence of
[URL]...[/URL], you can use the following, which uses "\2" (Also, when
writing regular expressions, or other strings that are supposed to
contain the backslash character, it is a good idea to write them as
raw string literals, i.e. prefix them with a "r", which I've done
below; that way, Python won't try to interpret the backslashes as
special characters – otherwise, "\2" would become a character with an
ASCII value of 2)::

>>> re.sub(r"(\[URL\])(.*)(\[\/URL\])", r'<a href="\2">\2</a>', '[URL]www.link.com[/URL]', flags=re.I)
'<a href="www.link.com">www.link.com</a>'

> 2. Only the first *[URL]* is matched. Everything after the first *[/URL]*
> is simply deleted...

The solution above gets you halfway there – re.sub will replace all
matches by default, the problem here is that the "(.*)" part of your
regex will matches everything between the first "[URL]", and the last
"[/URL]"::

>>> re.sub(r"(\[URL\])(.*)(\[\/URL\])", r'<a href="\2">\2</a>', '[URL]www.link1.com[/URL][URL]www.link2.com[/URL][URL]www.link3.com[/URL]', flags=re.I)
'<a href="www.link1.com[/URL][URL]www.link2.com[/URL][URL]www.link3.com">www.link1.com[/URL][URL]www.link2.com[/URL][URL]www.link3.com</a>'

The reason is that the asterisk operator in a regex is greedy, which
means a ".*" will try to match as much as possible. When you use the
non-greedy version of the operator (which you get by putting a
question mark after the asterisk), you get the result you want::

>>> re.sub(r"(\[URL\])(.*?)(\[\/URL\])", r'<a href="\2">\2</a>', '[URL]www.link1.com[/URL][URL]www.link2.com[/URL][URL]www.link3.com[/URL]', flags=re.I)
'<a href="www.link1.com">www.link1.com</a><a href="www.link2.com">www.link2.com</a><a href="www.link3.com">www.link3.com</a>'


You can read an explanation of the difference between greedy and
non-greedy regular expressions in the Python docs:
https://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy

Good luck,

Michal

>
> I hope someone can help me with this. I'm using Python 2.7 if it makes a
> difference.
>
> --
> You received this message because you are subscribed to the Google Groups "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
> To post to this group, send email to django...@googlegroups.com.
> Visit this group at https://groups.google.com/group/django-users.
> To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/fce5a726-8a4c-455a-a978-6ee70d66464e%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

signature.asc

jorr...@gmail.com

unread,
Mar 9, 2016, 8:30:07 PM3/9/16
to Django users
Your improvements work great, thank you. And thank you for the very detailed explanations!
Reply all
Reply to author
Forward
0 new messages