Simple allowing of HTML elements/attributes?

Leif K-Brooks

unread,

Feb 11, 2004, 6:02:17 AM2/11/04

to

I'm writing a site with mod_python which will have, among other things,
forums. I want to allow users to use some HTML (, , ,
etc.) on the forums, but I don't want to allow bad elements and
attributes (onclick, <script>, etc.). I would also like to do basic
validation (no overlapping elements like foo,
no missing end tags). I'm not asking anyone to write a script for me,
but does anyone have general ideas about how to do this quickly on an
active forum?

David M. Cooke

unread,

Feb 11, 2004, 5:42:53 PM2/11/04

to

You could require valid XML, and use a validating XML parser to
check conformance. You'd have to make sure the output is correctly
quoted (for instance, check that HTML tags in a CDATA block get quoted).

--
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke
|cookedm(at)physics(dot)mcmaster(dot)ca

Graham Fawcett

unread,

Feb 11, 2004, 11:58:48 PM2/11/04

to

cooked...@physics.mcmaster.ca (David M. Cooke) wrote in message news:<qnklln9...@arbutus.physics.mcmaster.ca>...

> At some point, Leif K-Brooks <eur...@ecritters.biz> wrote:
>
> > I'm writing a site with mod_python which will have, among other
> > things, forums. I want to allow users to use some HTML (,
> > , , etc.) on the forums, but I don't want to allow bad
> > elements and attributes (onclick, <script>, etc.). I would also like
> > to do basic validation (no overlapping elements like
> > foo, no missing end tags). I'm not asking
> > anyone to write a script for me, but does anyone have general ideas
> > about how to do this quickly on an active forum?
>
> You could require valid XML, and use a validating XML parser to
> check conformance. You'd have to make sure the output is correctly
> quoted (for instance, check that HTML tags in a CDATA block get quoted).

You could use Tidy (or tidylib) to convert error-ridden input into
valid HTML or XHTML, and then grab the BODY contents via an XML
parser, as David suggested. I imagine that the library version of tidy
is quick enough to meet your needs.

Or maybe you could use XSLT to cut the "bad stuff" out of your tidied
XHTML. (Not something I'm familiar with, but someone must have done
this before.)

There's a Python wrapper for tidylib at
http://utidylib.sourceforge.net/ .

-- Graham

Alan Kennedy

unread,

Feb 12, 2004, 8:00:30 AM2/12/04

to

[Leif K-Brooks]

>>> I'm writing a site with mod_python which will have, among other
>>> things, forums. I want to allow users to use some HTML (,
>>> , , etc.) on the forums, but I don't want to allow bad
>>> elements and attributes (onclick, <script>, etc.). I would also like
>>> to do basic validation (no overlapping elements like
>>> foo, no missing end tags). I'm not asking
>>> anyone to write a script for me, but does anyone have general ideas
>>> about how to do this quickly on an active forum?

"Quickly" being an important consideration for you, I'm presuming.

(David M. Cooke)

>> You could require valid XML, and use a validating XML parser to
>> check conformance. You'd have to make sure the output is correctly
>> quoted (for instance, check that HTML tags in a CDATA block get quoted).

Hmmm, I'd imagine that the average forum user isn't going to know what
well-formed XML is. Also, validating-XML support is one of the areas
where python is lacking. Lastly, wrapping HTML tags in a CDATA block
won't deliver much benefit. You still have to send that HTML to the
browser, which will probably render the contents of the CDATA block
anyway.

[Graham Fawcett]

> You could use Tidy (or tidylib) to convert error-ridden input into
> valid HTML or XHTML, and then grab the BODY contents via an XML
> parser, as David suggested. I imagine that the library version of tidy
> is quick enough to meet your needs.

This is a good idea. Tidy is always a good way to get easily
processable XML from badly-formed HTML. There are multiple ways to run
Tidy from python: use MAL's utidy library, use the command line
executable and pipes, or in jython use JTidy.

http://sourceforge.net/projects/jtidy

[Graham Fawcett]

> Or maybe you could use XSLT to cut the "bad stuff" out of your tidied
> XHTML. (Not something I'm familiar with, but someone must have done
> this before.)

However, this is not a good idea. XSLT requires an Object Model of the
document, meaning that you're going to use a lot of cpu-time and
memory. In extreme cases, e.g. where some black-hat attempts to upload
a 20 Mbyte HTML file, you're opening yourself up to a
Denial-Of-Service attack, when your server tries to build up a [D]OM
of that document.

The optimal solution, IMHO, is to tidy the HTML into XML, and then use
SAX to filter out the stuff you don't want. Here is some code that
does the latter. This should be nice and fast, and use a lot less
memory than object-model based approaches.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
import cStringIO as StringIO

permittedElements = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]

class cleaner(xml.sax.handler.ContentHandler):

def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outbuf = StringIO.StringIO()

def startElement(self, elemname, attrs):
if elemname in permittedElements:
attrstr = ""
for a in attrs.keys():
if a in permittedAttrs:
attrstr = "%s " % "%s='%s'" % (a, attrs[a])
self.outbuf.write("<%s%s>" % (elemname, attrstr))

def endElement(self, elemname):
if elemname in permittedElements:
self.outbuf.write("</%s>" % (elemname,))

def characters(self, s):
self.outbuf.write("%s" % (s,))

testdoc = """
<html>
<body>
This paragraph contains only permitted elements.
This paragraph contains disallowed
attributes.
<img src="http://www.blackhat.com/session_hijack.gif"/>
This paragraph contains
<a href="http://www.jscript-attack.com/">a potential script
attack</a>
</body>
</html>
"""

if __name__ == "__main__":
parser = xml.sax.make_parser()
mycleaner = cleaner()
parser.setContentHandler(mycleaner)
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
parser.feed(testdoc)
print mycleaner.outbuf.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Tidying the HTML to XML is left as an exercise to the reader ;-)

HTH,

--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan

Alan Kennedy

unread,

Feb 12, 2004, 2:13:52 PM2/12/04

to

[Alan Kennedy]

> The optimal solution, IMHO, is to tidy the HTML into XML, and then use
> SAX to filter out the stuff you don't want. Here is some code that
> does the latter. This should be nice and fast, and use a lot less
> memory than object-model based approaches.

Unfortunately, in my haste to post a demonstration of a technique
earlier on, I posted running code that is both buggy and *INSECURE*.
The following are problems with it

1. A bug in making up the attribute string results in loss of
permitted attributes.

2. The failure to escape character data (i.e. map '<' to '<' and
'>' to '>') as it is written out gives rise to the possibility of a
code injection attack. It's easy to circumvent the check for malicious
code: I'll leave to y'all to figure out how.

3. I have a feeling that the failure to escape the attribute values
also opens the possibility of a code injection attack. I'm not
certain: it depends on the browser environment in which the final HTML
is rendered.

Anyway, here's some updated code that closes the SECURITY HOLES in the
earlier-posted version :-(

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.saxutils import escape, quoteattr
import cStringIO as StringIO

permittedElements = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]

class cleaner(xml.sax.handler.ContentHandler):

def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outbuf = StringIO.StringIO()

def startElement(self, elemname, attrs):
if elemname in permittedElements:
attrstr = ""
for a in attrs.keys():
if a in permittedAttrs:

attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))

self.outbuf.write("<%s%s>" % (elemname, attrstr))

def endElement(self, elemname):
if elemname in permittedElements:
self.outbuf.write("</%s>" % (elemname,))

def characters(self, s):
self.outbuf.write("%s" % (escape(s),))

testdoc = """
<html>
<body>
This paragraph contains only permitted

elements.
This paragraph contains disallowed
attributes.
<img src="http://www.blackhat.com/session_hijack.gif"/>
This paragraph contains

if __name__ == "__main__":
parser = xml.sax.make_parser()
mycleaner = cleaner()
parser.setContentHandler(mycleaner)
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
parser.feed(testdoc)
print mycleaner.outbuf.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

regards,

Robert Brewer

unread,

Feb 12, 2004, 6:58:09 PM2/12/04

to Alan Kennedy, pytho...@python.org

Alan Kennedy wrote:
> The optimal solution, IMHO, is to tidy the HTML into XML, and then use
> SAX to filter out the stuff you don't want. Here is some code that
> does the latter. This should be nice and fast, and use a lot less
> memory than object-model based approaches.
>

> #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> import xml.sax

> import cStringIO as StringIO
>
> permittedElements = ['html', 'body', 'b', 'i', 'p']
> permittedAttrs = ['class', 'id', ]
>
> class cleaner(xml.sax.handler.ContentHandler):
>
> def __init__(self):
> xml.sax.handler.ContentHandler.__init__(self)
> self.outbuf = StringIO.StringIO()
>
> def startElement(self, elemname, attrs):
> if elemname in permittedElements:
> attrstr = ""
> for a in attrs.keys():
> if a in permittedAttrs:

> attrstr = "%s " % "%s='%s'" % (a, attrs[a])

> self.outbuf.write("<%s%s>" % (elemname, attrstr))

Very interesting, Alan! I rolled my own solution to this the other day,
relying more on regexes. This might be more usable.

One issue: the parser, as written, mangles well-formed xhtml tags like
 into . Any recommendations besides brute-force (keeping
a list of allowed empty tags) for dealing with this?

Robert Brewer
MIS
Amor Ministries
fuma...@amor.org

Alan Kennedy

unread,

Feb 13, 2004, 10:01:22 AM2/13/04

to

[Alan Kennedy]

>> The optimal solution, IMHO, is to tidy the HTML into XML, and then use
>> SAX to filter out the stuff you don't want. Here is some code that
>> does the latter. This should be nice and fast, and use a lot less
>> memory than object-model based approaches.

[Robert Brewer]

> I rolled my own solution to this the other day,
> relying more on regexes. This might be more usable.

Not sure what you're saying here Robert. Certainly, a regex based
solution has the potential to be faster than the XML+SAX technique,
but is likely to be *much* harder to verify correct and secure
operation.

> One issue: the parser, as written, mangles well-formed xhtml tags like
> into . Any recommendations besides brute-force (keeping
> a list of allowed empty tags) for dealing with this?

I'm assuming that when you say "the parser, as written" you mean my
code?

The issue of end-tags and empty elements is not a simple one.

If one wants to emit XML, then a start-tag + end-tag combination is
equivalent to an empty element tag, i.e. ' ' should be
interchangable with ' '.

http://www.w3.org/TR/2004/REC-xml-20040204/#dt-content

The statement "The representation of an empty element is either a
start-tag immediately followed by an end-tag, or an empty-element tag"
shows that, as far as XML is concerned, either must be acceptable to
XML processors.

Note the statement "For interoperability, the empty-element tag SHOULD
be used, and SHOULD only be used, for elements which are declared
EMPTY". The "SHOULD" implies that this is only recommended behaviour,
not required. Also, "declared empty" implies that this recommended
behaviour SHOULD only be enforced when a DTD is in use, implying the
use of a validating XML parser.

Since XHTML is simply an XML document type, all XHTML compliant
processors should accept both (i.e. ' ' and ' ') to
represent the same thing, an element with no children.

However, we're talking about HTML here, and there is a lot of
non-X[HT]ML aware software out there. Many of it (e.g. Netscape 4.x)
would not accept *either* option: they only render a line-break if
presented with ' ' (or maybe '<br' :-). So in order to generate
HTML that will work in these browsers, we should only output ' '
when generating for an older user-agent.

Which means that the filter I presented would have to different code
paths, depending on the user-agent it is rendering for.

The cleanest solution to this problem is to separate out the content
filtering from the output phase. So we do the output in a second
handler, which can be changed depending on the desired output, and our
content-filtering algorithm is then isolated from the output code.

Note that XSLT addresses this problem by allowing the user to declare
the "output method", so that they can indicate whether or not they
want well-formed XML in the output.

http://www.w3.org/TR/xslt#section-HTML-Output-Method

Here is an implementation of an XML output method.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

class xml_output(xml.sax.handler.ContentHandler):

def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outbuf = StringIO.StringIO()

def startElement(self, elemname, attrs):

attrstr = ""
for a in attrs.keys():

attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))

self.outbuf.write("<%s%s>" % (elemname, attrstr))

def endElement(self, elemname):
self.outbuf.write("</%s>" % (elemname,))

def characters(self, s):
self.outbuf.write("%s" % (escape(s),))

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

And we have to change the cleaning filter so that it now consumes and
generates SAX events, rather than simply outputting the data.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

class cleaning_filter(xml.sax.handler.ContentHandler):

def __init__(self, output_method):
xml.sax.handler.ContentHandler.__init__(self)
self.outmeth = output_method

def startElement(self, elemname, attrs):
if elemname in permittedElements:

newattrs = {}

for a in attrs.keys():
if a in permittedAttrs:

newattrs[a] = attrs[a]
self.outmeth.startElement(elemname, newattrs)

def endElement(self, elemname):
if elemname in permittedElements:

self.outmeth.endElement(elemname)

def characters(self, s):
self.outmeth.characters(s)

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Now we need an HTML output method. So we need to change the
startElement and endElement events of the SAX handler so that they
output empty elements in the old style. So we simply change the
startElement function to output the whole empty element. But in the
endElement function, we need some kind of flag which indicates whether
or not there have been any children since the corresponding
startElement call.

A simple way to store this kind of state information is in a stack. We
push a flag on the stack when we receive a start element call, which
is false, i.e. indicates that there are no children. We pop that flag
off the stack when in the endElement call, which tells us whether or
not to generate an end-tag. And whenever we receive any startElement
or characters call (or indeed any other type of child node), we modify
the flag on top of the stack, representing the child status of our
immediate parent, to indicate that children have been notified.

The code looks like this

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

HAS_NO_CHILDREN = 0
HAS_CHILDREN = 1

class html_output(xml.sax.handler.ContentHandler):

def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outbuf = StringIO.StringIO()

self.child_flags = []

def startElement(self, elemname, attrs):
if len(self.child_flags):
self.child_flags[-1] = HAS_CHILDREN
self.child_flags.append(HAS_NO_CHILDREN)

attrstr = ""
for a in attrs.keys():

attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))

self.outbuf.write("<%s%s>" % (elemname, attrstr))

def endElement(self, elemname):
if self.child_flags[-1] == HAS_CHILDREN:
self.outbuf.write("</%s>" % (elemname,))
self.child_flags.pop()

def characters(self, s):
self.child_flags[-1] = HAS_CHILDREN
self.outbuf.write("%s" % (escape(s),))

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

So combining all of the above, we end up with complete code like so

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.saxutils import escape, quoteattr
import cStringIO as StringIO

import random

permittedElements = ['html', 'body', 'b', 'i', 'p', 'hr', 'br',]
permittedAttrs = ['class', 'id', 'width',]

class xml_output(xml.sax.handler.ContentHandler):

def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outbuf = StringIO.StringIO()

def startElement(self, elemname, attrs):

attrstr = ""
for a in attrs.keys():

attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))

self.outbuf.write("<%s%s>" % (elemname, attrstr))

def endElement(self, elemname):
self.outbuf.write("</%s>" % (elemname,))

def characters(self, s):
self.outbuf.write("%s" % (escape(s),))

HAS_NO_CHILDREN = 0
HAS_CHILDREN = 1

class html_output(xml.sax.handler.ContentHandler):

def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outbuf = StringIO.StringIO()

self.child_flags = []

def startElement(self, elemname, attrs):
if len(self.child_flags):
self.child_flags[-1] = HAS_CHILDREN
self.child_flags.append(HAS_NO_CHILDREN)

attrstr = ""
for a in attrs.keys():

attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))

self.outbuf.write("<%s%s>" % (elemname, attrstr))

def endElement(self, elemname):
if self.child_flags.pop() == HAS_CHILDREN:
self.outbuf.write("</%s>" % (elemname,))

def characters(self, s):
self.child_flags[-1] = HAS_CHILDREN
self.outbuf.write("%s" % (escape(s),))

class cleaning_filter(xml.sax.handler.ContentHandler):

def __init__(self, output_method):
xml.sax.handler.ContentHandler.__init__(self)
self.outmeth = output_method

def startElement(self, elemname, attrs):
if elemname in permittedElements:

newattrs = {}

for a in attrs.keys():
if a in permittedAttrs:

newattrs[a] = attrs[a]
self.outmeth.startElement(elemname, newattrs)

def endElement(self, elemname):
if elemname in permittedElements:

self.outmeth.endElement(elemname)

def characters(self, s):
self.outmeth.characters(s)

testdoc = """
<html>
<body>
This paragraph contains only permitted
elements.
This paragraph contains disallowed
attributes.
<img src="http://www.blackhat.com/session_hijack.gif"/>
This paragraph contains
<script src="blackhat.js"/>a potential script
attack

<hr width="50%"/>
</body>
</html>
"""

if __name__ == "__main__":
is_old_user_agent = int(random.random() + 0.05)
if is_old_user_agent:
output_method = html_output()
else:
output_method = xml_output()
mycleaner = cleaning_filter(output_method)
parser = xml.sax.make_parser()
parser.setContentHandler(mycleaner)
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
parser.feed(testdoc)
print output_method.outbuf.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Note that it gets more complex if we want to output the empty-element
form from the xml_output handler: we can't decide in the startElement
function if we should output '/>' or '>' for the tag. But we can't
simply defer the decision until the endElement function, because if
there were intervening child elements, our output would be all messed
up. So we sould have to store "pending output" on the stack, or
"pending nodes", or something similar.

And now we get back to the statement in the XML spec: "the
empty-element tag SHOULD be used, and SHOULD only be used, for
elements which are declared EMPTY". What's basically stated here is
that it is recommended behaviour to output the empty element *when the
brute-force approach is possible*, i.e. when the processor *knows*
from the the content model, as declared in the DTD, that the element
is definitely empty. Which mostly only validating parsers do.
Requiring processors without a content model available to always
output the empty-element form would place too much of a burden of code
complexity on event-based-code writers, hence the "permissiveness" of
the XML spec in this regard.

It is also interesting to note that the original principle designer
for SAX, David Megginson, recognised the problem and created a
separate "emptyElement" handler in his "XMLWriter" java code, which
consumes SAX events and writes XML to output files.

http://www.megginson.com/Software/

would-love-to-have-time-to-write-an-xml-and-python-book-ly y'rs