I was rebuffed by a gentleman in the audience when I made the point
that you couldn't gaurentee wellformedness on a site that accepts
random input from strangers, which is one of the big reasons we went
with HTML instead of XHTML. Being a good opensource boy, I told Niq
to submit a patch that made it all work and we would move to XHTML.
Niq did one better and wrote up his thoughts on how this can be
accomplished and published it via the apachetutor site. Here is the
link:
http://www.apachetutor.org/dev/online-edit
Read it, and lets talk about it. If we can do this, lets do it.
As one of the many people who's argued against XHTML (and XML in
general) endlessly, I probably better partly defend my position. At a
basic level, almost everyone who actually produces XHTML and allows
any user input whatsoever (even if as minor as a search input) can be
easily broken.
Now, before I continue, I will make a big disclaimer: this post
assumes background knowledge of XML and Unicode. I haven't the time
(I'm in a café as my home internet access is broken) to explain basic
concepts around those two standards.
To take a look at the right up itself, it misses all kinds of basic
well-formedness constraints, not least the restrictions on the Char
production (U+FFFF breaks almost anything, any implementation of the
article included).
It suggests using libxml2 (which we can do through the DOM extension
in PHP) to parse HTML: this isn't very useful in the real world as it
fails to parse things like <b><i>foo</b>bar</i> as any browser would
(and content relies on very specific error handling). It also breaks
on things like <a@></a@> (this is important as you need to tell apart
<a@> and <a$> in something like <a@><a$></a$></a@>). HTML 5 (<http://w3.org/TR/html5
>) specifies error handling.
If you want to serialise HTML to XML you hit further issues with the
well-formedness constraints, with things such as <!-- foo -- bar -->
and <a@>.
The next point in the article mentions two basic approaches:
> There are two basic approaches:
>
> 1. Ensure only clean, safe markup gets stored on the server.
> 2. Clean up the markup as we serve it.
> Clearly (1) is the best solution where feasible, while (2) is a
> useful fallback for cases where we don't adequately control the
> contents.
>
I know that various other people who share my beliefs believe that 2
is the only way to do this (as in 1 if one small bit of content
manages to get in unfiltered, everything falls apart — you need checks
all over the source-base). Anne van Kesteren (<http://
annevankesteren,nl>) does 1 by requiring all comments to be valid
XHTML fragments, though he only serves everything as HTML.
Realistically, for security, something that does the filtering when
parsing the HTML anyway (to serialise it to XML) is the only way to
deal with things without sacrificing performance.
Now to sound arrogant: if anyone thinks they have a site that is XHTML
(and served as XHTML) and cannot be broken while letting user input,
do say. I will warn you that the probability of it succeeding (from
the number of sites that have failed in the past) is very low, though.
I wrote a post on getting Habari to use XHTML a while back, over at <http://gsnedders.com/making-habari-use-xhtml
>. If anyone has any questions, do feel free to post them either here
or as a comment on that post (depending on quite what they are in
response to), but be warned, due to my internet access state, I will
likely be slow replying.
--
Geoffrey Sneddon
<http://gsnedders.com/>
> That said, I think the best way to do this would be filtering on POST,
> which could lead to a very uncomfortable user experience if it's not
> done right. These scenarios come to mind:
>
> 1) User with little HTML know-how uses a rich text editor plugin to
> submit invalid HTML - server balks, but the user has no idea why.
> ("what's a <strong>?")
You mean a WYSIWYG editor? Well, we should make sure that the editor
we ship with Habari (as I believe we should ship one, even if as a
plugin) doesn't.
> 2) User -with- HTML experience gets annoyed when he posts HTML that
> might be invalid, but he's okay with.
Real possible example: many sites that use <canvas> do so under a
DOCTYPE that doesn't allow it (as it first appears in any HTML
specification in HTML 5).
> I guess what it comes down to for me, is that it seems like the only
> real gain we'd get is bragging rights, at a possible loss in user
> experience. Definitely, code would be cleaner as a result, but I'm not
> convinced it's worth it. I'm very eager to be wrong, though. I loves
> me some XHTML. ;)
I think, realistically, if we allow XHTML output, we need two options
for input: HTML or XML. Under the HTML input mode, we should parse the
input document and store it as XML, only giving the user an error if
the document cannot be serialised as XML (e.g., if it contains <!-- --
-->), though it may be a good idea to warn the user when any validity
constraints that are broken (though only a non-fatal warning — it
doesn't actually prohibit anything being done). Under the XML input
mode, we should refuse anything that isn't well-formed XML 1.0 (in
many ways the easiest way to do this would just be to try and parse it
as XML). Without question, HTML should be the default. It is probably
worthwhile to always store content in the database as XML, though.
> Also, the beauty of open source is that if someone really wants to
> scratch that itch, they can have at it. Any takers? ;)
As I've said before, I'll try and tackle it sometime, but because of
the complexity it has all kinds of dependancies (which all involve
very large specifications such as Unicode and HTML 5). There's a start
on Unicode support on PHP 5 over at <http://hg.gsnedders.com/Unicode/
>, but HTML 5 is going to be far more complex to implement: we will
hit issues with the DOM extension requiring content of the DOM to be
well-formed XML (which is mandated by the DOM Level 3 specification).
I guess I'll find someway around the issues, though. What I linked to
in my previous post gives far more detail about what is needed to
implement it.