Google 网上论坛不再支持新的 Usenet 帖子或订阅项。历史内容仍可供查看。

Remove HTML tags (except anchor tag) from a string using regular expressions

已查看 16 次
跳至第一个未读帖子

Nico Grubert

未读,
2005年2月1日 07:03:312005/2/1
收件人 pytho...@python.org
Hello,

I want to remove all html tags from a string "content" except <a
...>xxx</a>.

My script reads like this:

###
import re
content = re.sub('<([^!>]([^>]|\n)*)>', '', content)
###

It works fine. It removes all html tags from "content".
Unfortunately, this also removes <a ...>xxx</a> occurancies.
Any idea, how to modify this to remove all html tags except <a ...>xxx</a>?

Thanks in advance,
Nico

Anand

未读,
2005年2月1日 07:43:112005/2/1
收件人
How about...

import re
content = re.sub('<([^!(a>)]([^(/a>)]|\n)*)>', '', content)
Seems to work for me.

HTH

-Anand

Anand

未读,
2005年2月1日 08:27:052005/2/1
收件人
I meant
content = re.sub ('<[^!(a>)]([^>]|\n)*[^!(/a)]>', '', content)

Sorry for the mistake.
However this seems to also print tags like <b>, <p> etc
also.

-Anand

Max M

未读,
2005年2月1日 08:59:502005/2/1
收件人 Nico Grubert
Nico Grubert wrote:

If it's not to learn, and you simply want it to work, try out this library:

http://zope.org/Members/chrisw/StripOGram/readme


--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science

John Lenton

未读,
2005年2月1日 11:03:432005/2/1
收件人 Nico Grubert、pytho...@python.org
On Tue, Feb 01, 2005 at 01:03:31PM +0100, Nico Grubert wrote:
> Hello,
>
> I want to remove all html tags from a string "content" except <a
> ...>xxx</a>.
>
> My script reads like this:
>
> ###
> import re

> content = re.sub('<([^!>]([^>]|\n)*)>', '', content)
> ###
>
> It works fine. It removes all html tags from "content".
> Unfortunately, this also removes <a ...>xxx</a> occurancies.
> Any idea, how to modify this to remove all html tags except <a ...>xxx</a>?

not sure what the outer parenthesis are there for, i.e. afaics

<([^!>]([^>]|\n)*)>

is the same as

<[^!>](?:[^>]|\n)*>

for doing a re.sub; the grouping parentheses are only needed if you
actually need the groups later on.

Try this:

<(?!(?:a\s|/a|!))[^>]*>

--
John Lenton (jo...@grulic.org.ar) -- Random fortune:
Slurm, n.:
The slime that accumulates on the underside of a soap bar when
it sits in the dish too long.
-- Rich Hall, "Sniglets"

signature.asc

Gabriel Cooper

未读,
2005年2月2日 13:43:522005/2/2
收件人 Max M、pytho...@python.org

Max M wrote:

> If it's not to learn, and you simply want it to work, try out this
> library:
>
> http://zope.org/Members/chrisw/StripOGram/readme
>
>

>>> stripogram.html2safehtml('''first > last''',valid_tags=('i','a','br'))
'first > last'
>>> stripogram.html2safehtml('''first < last''',valid_tags=('i','a','br'))
'first first '


keeping in mind that bare ">" and "<" are invalid HTML (should be &gt;
and &lt;), why'd it leave the greater than and why are there two "first"'s ?

0 个新帖子