Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Remove HTML tags (except anchor tag) from a string using regular expressions

16 views
Skip to first unread message

Nico Grubert

unread,
Feb 1, 2005, 7:03:31 AM2/1/05
to pytho...@python.org
Hello,

I want to remove all html tags from a string "content" except <a
...>xxx</a>.

My script reads like this:

###
import re
content = re.sub('<([^!>]([^>]|\n)*)>', '', content)
###

It works fine. It removes all html tags from "content".
Unfortunately, this also removes <a ...>xxx</a> occurancies.
Any idea, how to modify this to remove all html tags except <a ...>xxx</a>?

Thanks in advance,
Nico

Anand

unread,
Feb 1, 2005, 7:43:11 AM2/1/05
to
How about...

import re
content = re.sub('<([^!(a>)]([^(/a>)]|\n)*)>', '', content)
Seems to work for me.

HTH

-Anand

Anand

unread,
Feb 1, 2005, 8:27:05 AM2/1/05
to
I meant
content = re.sub ('<[^!(a>)]([^>]|\n)*[^!(/a)]>', '', content)

Sorry for the mistake.
However this seems to also print tags like <b>, <p> etc
also.

-Anand

Max M

unread,
Feb 1, 2005, 8:59:50 AM2/1/05
to Nico Grubert
Nico Grubert wrote:

If it's not to learn, and you simply want it to work, try out this library:

http://zope.org/Members/chrisw/StripOGram/readme


--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science

John Lenton

unread,
Feb 1, 2005, 11:03:43 AM2/1/05
to Nico Grubert, pytho...@python.org
On Tue, Feb 01, 2005 at 01:03:31PM +0100, Nico Grubert wrote:
> Hello,
>
> I want to remove all html tags from a string "content" except <a
> ...>xxx</a>.
>
> My script reads like this:
>
> ###
> import re

> content = re.sub('<([^!>]([^>]|\n)*)>', '', content)
> ###
>
> It works fine. It removes all html tags from "content".
> Unfortunately, this also removes <a ...>xxx</a> occurancies.
> Any idea, how to modify this to remove all html tags except <a ...>xxx</a>?

not sure what the outer parenthesis are there for, i.e. afaics

<([^!>]([^>]|\n)*)>

is the same as

<[^!>](?:[^>]|\n)*>

for doing a re.sub; the grouping parentheses are only needed if you
actually need the groups later on.

Try this:

<(?!(?:a\s|/a|!))[^>]*>

--
John Lenton (jo...@grulic.org.ar) -- Random fortune:
Slurm, n.:
The slime that accumulates on the underside of a soap bar when
it sits in the dish too long.
-- Rich Hall, "Sniglets"

signature.asc

Gabriel Cooper

unread,
Feb 2, 2005, 1:43:52 PM2/2/05
to Max M, pytho...@python.org

Max M wrote:

> If it's not to learn, and you simply want it to work, try out this
> library:
>
> http://zope.org/Members/chrisw/StripOGram/readme
>
>

>>> stripogram.html2safehtml('''first > last''',valid_tags=('i','a','br'))
'first > last'
>>> stripogram.html2safehtml('''first < last''',valid_tags=('i','a','br'))
'first first '


keeping in mind that bare ">" and "<" are invalid HTML (should be &gt;
and &lt;), why'd it leave the greater than and why are there two "first"'s ?

0 new messages