I want to remove all html tags from a string "content" except <a
...>xxx</a>.
My script reads like this:
###
import re
content = re.sub('<([^!>]([^>]|\n)*)>', '', content)
###
It works fine. It removes all html tags from "content".
Unfortunately, this also removes <a ...>xxx</a> occurancies.
Any idea, how to modify this to remove all html tags except <a ...>xxx</a>?
Thanks in advance,
Nico
import re
content = re.sub('<([^!(a>)]([^(/a>)]|\n)*)>', '', content)
Seems to work for me.
HTH
-Anand
Sorry for the mistake.
However this seems to also print tags like <b>, <p> etc
also.
-Anand
If it's not to learn, and you simply want it to work, try out this library:
http://zope.org/Members/chrisw/StripOGram/readme
--
hilsen/regards Max M, Denmark
http://www.mxm.dk/
IT's Mad Science
not sure what the outer parenthesis are there for, i.e. afaics
<([^!>]([^>]|\n)*)>
is the same as
<[^!>](?:[^>]|\n)*>
for doing a re.sub; the grouping parentheses are only needed if you
actually need the groups later on.
Try this:
<(?!(?:a\s|/a|!))[^>]*>
--
John Lenton (jo...@grulic.org.ar) -- Random fortune:
Slurm, n.:
The slime that accumulates on the underside of a soap bar when
it sits in the dish too long.
-- Rich Hall, "Sniglets"
> If it's not to learn, and you simply want it to work, try out this
> library:
>
> http://zope.org/Members/chrisw/StripOGram/readme
>
>
>>> stripogram.html2safehtml('''first > last''',valid_tags=('i','a','br'))
'first > last'
>>> stripogram.html2safehtml('''first < last''',valid_tags=('i','a','br'))
'first first '
keeping in mind that bare ">" and "<" are invalid HTML (should be >
and <), why'd it leave the greater than and why are there two "first"'s ?