Storing News articles - retaining some HTML tags

treelife

unread,

Jan 15, 2006, 4:38:15 PM1/15/06

to Django users

Hi all,

I am creating a news database where some of the articles are uniquely
formatted, so I figured that the best thing to do is to retain some of
HTML formatting in the odd article that has special formatting and
generate the standard tags like <p>s and <h1>s etc.

I imagine that the standard tags can be generated with linebreaks()
filter or even just a python string methods string.split etc, but I am
not sure what is the best way to go about doing this.

Any input would be appreciated.

Sincerely

tonemcd

unread,

Jan 15, 2006, 6:22:31 PM1/15/06

to Django users

If your articles have HTML in them, you'll need to be careful that no
'dangerous' HTML is included (javascript is the most common). A good
library is stripogram -
http://www.zope.org/Members/chrisw/StripOGram/readme

I think you are right in that the only filter you need to worry about
after that is the linebreak filter (no '()').

Cheers,
Tone

Simon Willison

unread,

Jan 19, 2006, 4:40:38 AM1/19/06

to django...@googlegroups.com

On 15 Jan 2006, at 23:22, tonemcd wrote:

> If your articles have HTML in them, you'll need to be careful that no
> 'dangerous' HTML is included (javascript is the most common). A good
> library is stripogram -
> http://www.zope.org/Members/chrisw/StripOGram/readme

Stripogram is inadequate for protecting against XSS attacks. It
doesn't strip style="" attributes (which can contain executable code)
and has very simplistic code for filtering javascript: style links.
Here's their code for attribute filtering:

if lower(k[0:2]) != 'on' and lower(v[0:10]) != 'javascript':
self.result += ' %s="%s"' % (k, v)

And here are three ways off the top of my head to defeat that:

<a href=" javascript:alert('XSS')">Click me</a> (Note the leading space)

<a href="vbscript:alert('XSS')">Click me</a> (IE will run this)

<a href="java
script:alert('XSS')">Click me</a> (IE will run this too; it was part
of the MySpace worm: http://namb.la/popular/tech.html )

Filtering unsafe HTML is a deceptively hard problem - you need to be
aware not just of the HTML spec but also of the full details of all
of the common implementations and their bugs. Since the most
widespread of these is closed source, good luck!

Definitely don't use stripogram though. It will give you nothing more
than a false sense of security. I'm going to submit these bugs to the
library author.

The best Python stripping code I've seen is in Mark Pilgrim's
feedparser. You might want to try extracting it.

Cheers,

Simon

Simon Willison

unread,

Jan 19, 2006, 5:19:29 AM1/19/06

to django...@googlegroups.com

On 15 Jan 2006, at 23:22, tonemcd wrote:

> If your articles have HTML in them, you'll need to be careful that no
> 'dangerous' HTML is included (javascript is the most common). A good
> library is stripogram -
> http://www.zope.org/Members/chrisw/StripOGram/readme

While I still strongly advocate not using StripOGram for filtering
potentially hostile code, I should note that for the original
poster's purpose (stripping tags from content that they themselves
owned) it is probably a good solution - provided you are confident
that there is no deliberately malicious code in their own data
somewhere.

Cheers,

Simon

tonemcd

unread,

Jan 19, 2006, 7:32:16 AM1/19/06

to Django users

Yikes!

Didn't realise stripogram was so open to those sort of exploits (I've
only ever used it to get rid of the stuff that might mangle layout).
There's obviously more to this than meets the eye.

Thanks for the heads-up Simon.

Cheers,
Tone

Simon Willison

unread,

Jan 19, 2006, 8:32:47 AM1/19/06

to django...@googlegroups.com, django-d...@googlegroups.com

On 1/19/06, tonemcd <tony.m...@gmail.com> wrote:
> Didn't realise stripogram was so open to those sort of exploits (I've
> only ever used it to get rid of the stuff that might mangle layout).
> There's obviously more to this than meets the eye.

Here are some interesting resources on the challenges involved with
escaping dangerous HTML.

Cal Henderson (from Flickr) has developed a flitering library in PHP.
It's documented in two tutorials - the code is also available (with
unit tests):

http://iamcal.com/publish/articles/php/processing_html/
http://iamcal.com/publish/articles/php/processing_html_part_2/
http://code.iamcal.com/php/lib_filter/

The changelog for LiveJournal's HTML sanitizing stuff list dozens of
interesting vulnerabilities. The code is worth looking at too - lots
of interesting comments:

http://cvs.livejournal.org/browse.cgi/livejournal/cgi-bin/cleanhtml.pl

Mark Pilgrim's feedparser library has unit tests for the sanitizing component:

http://feedparser.org/tests/wellformed/sanitize/
http://feedparser.org/tests/illformed/sanitize/

Even PHP's strip_tags function (which doesn't attempt to sanitize, it
just removes anything that looks like a tag) has had its fair share of
problems:

http://bugs.php.net/search.php?cmd=display&search_for=strip_tags

Cheers,

Simon

Reply all

Reply to author

Forward