Prototype for tiddler input validation/sanitation in TiddlyWeb

chris...@gmail.com

unread,

Jun 15, 2009, 2:22:30 PM6/15/09

to TiddlyWikiDev

trac ticket 866 <http://trac.tiddlywiki.org/ticket/866> discusses the
need for server-side sanitation and validation of tiddler content.

My latest tiddlyweb commit to github starts work on a prototype for
such things. I'd appreciate some comments from interested parties. The
commit can be viewed at:

<http://github.com/tiddlyweb/tiddlyweb/commit/
2e471d0bf8f6d81f9b3f91e37b2a535935ab683e>

One of my concerns was, as usual, to make this extensible and
flexible, while also making it possible to have it not there at all
(as I feel too much validation runs contrary to the Wiki way). I think
what I've built gets this, but please confirm or deny as needed. I
also tried to take into account some of the comments in earlier
threads on this topic.

The basic architecture goes like this:

* There is a new constraint on policies called "accept". Like most of
the other constraints this is a list which can take roles, usernames,
and the special values "NONE" and "ANY". "accept" in this context
means "for the people in this list" accept the content without
sanitation or validation. The empty list means accept for everyone.
"ANY" means accept for any authenticated user. "NONE" means never
accept for anyone.

* When content is not accepted, it is passed into a validator system.
The tiddler and the current WSGI environment are provided to a list of
methods which either modify the current tiddler (e.g. disabling any
javascript, or cancelling a strange content type, or removing curse
words, etc) or raises "InvalidTiddlerError" if the tiddler is not
worth having.

* If the exception is raised it is reraised as an HTTP 409 (conflict)
that is sent to the user agent.

* If no exception is raised the now modified tiddler is saved to the
store.

* The validator methods are kept in an extensible module level list
called TIDDLER_VALIDATORS. At the moment it is empty, but this will
change soon enough.

In the commit reference above are two validators in the new test files
that demonstrate in the simplest way possible how things can work.

I hope this message makes some sense and the implementation as well.
If there is no sense to be found here, do let me know so I can
straighten things out.

mahemoff

unread,

Jun 15, 2009, 7:16:38 PM6/15/09

to TiddlyWikiDev

Thanks for the update Chris. The flexible validation model makes sense
- in the medium-to-long term I would also like to see TiddlyWeb ship
with a default validator that allows for general HTML content, that
does the usual Javascript stripping, matching HTML tags, and so on.
Perhaps one that is easily configurable wrt which tags (and possibly
attributes) it can take. Last I checked, this is a fairly common thing
for sanitisation libraries to offer, so what I'm suggesting is a
validator that is simply an adaptor into an existing sanitisation
library.

chris...@gmail.com

unread,

Jun 16, 2009, 7:24:10 AM6/16/09

to TiddlyWikiDev

On Jun 16, 12:16 am, mahemoff <mahem...@gmail.com> wrote:
> Thanks for the update Chris. The flexible validation model makes sense
> - in the medium-to-long term I would also like to see TiddlyWeb ship
> with a default validator that allows for general HTML content, that
> does the usual Javascript stripping, matching HTML tags, and so on.
> Perhaps one that is easily configurable wrt which tags (and possibly
> attributes) it can take. Last I checked, this is a fairly common thing
> for sanitisation libraries to offer, so what I'm suggesting is a
> validator that is simply an adaptor into an existing sanitisation
> library.

Based on poking around on the tinternets, the preferred tool is
html5lib which has a subclassable sanitizer built in, so I'll be
looking into that.

However, its important to keep in mind that in the tiddler situation
we've got more than "general HTML content" to consider. Sometime we
want to remove systemConfig tags, or reject tiddlers that look like
plugins, that sort of thing.

And more confusingly we've got input coming in that may be posing as
an image (having content type image/png) but not actually being that
content. I assume there are some vectors in which someone could
deposit a binary tiddler and wreak havoc. Basically things are made
more complicated by content being stored on the end of a generic PUT
rather than a CGI form submission.

These things are not really in my ken, which is part of why I'm making
the system easy to extend. I'm hoping that people with a bit more
expertise in this area will contribute. I am doing some reading up,
however.

As an aside: it's my expectation that there will also be loops for
validating PUT recipes and bags as well (one area that will need
sanitization is recipe and bag descriptions).

Martin Budden

unread,

Jun 16, 2009, 9:48:58 AM6/16/09

to Tiddly...@googlegroups.com

Chris,

a (perhaps) stupid question

Why do you call the policy constraint "accept"? You call the
constraint "accept", and validate if it is not set.

Why not call the policy constraint "validate", and only validate when it is set?

I'm not arguing about which way the default is, just about how much
information I have to keep in my head. If you call the constraint
"accept" then I need to remember that this is what governs the
validation policy. If you call the constraint "validate", then it is
obvious what it does.

Not a big deal, but I am curious as to your underlying reasoning.

Martin

2009/6/16 cd...@peermore.com <chris...@gmail.com>:

chris...@gmail.com

unread,

Jun 16, 2009, 10:28:27 AM6/16/09

to TiddlyWikiDev

On Jun 16, 2:48 pm, Martin Budden <mjbud...@gmail.com> wrote:
> a (perhaps) stupid question

I thought we all agreed a long time ago that there are no stupid
questions?

> Why do you call the policy constraint "accept"? You call the
> constraint "accept", and validate if it is not set.
>
> Why not call the policy constraint "validate", and only validate when it is set?

Basically because:

* We are theoretically able to enumerate those users for who we think
content should be _accept_ without validation. Martin and Chris get
their content through without change, everyone else (the infinite list
of everyone else) gets their content validated. Or ANY user which is
authenticated gets their content through without validation, the
infinite unknowable everyone else does not.

* Where we can enumerate both those users that don't need validation,
and those that do, it is presumed safer to whitelist rather than
blacklist, as with blacklisting, the risk from making a mistake in the
list is higher: damage is done to content. If I fail to include Martin
in the whitelist, the only thing that happens is that Martin can't do
what he wants to do, but the integrity of the system is maintained.

That make sense?

Martin Budden

unread,

Jun 16, 2009, 10:45:22 AM6/16/09

to Tiddly...@googlegroups.com

Makes sense. It's just a bit of a pity that we have to choose an
non-intuitive name to get the defaults the right way round.

Martin

2009/6/16 cd...@peermore.com <chris...@gmail.com>:

Oveek

unread,

Jun 23, 2009, 3:16:08 AM6/23/09

to TiddlyWikiDev

My progress was recently stymied for a while by a server related
issue, but I've got that sorted now.

Minor hiccup with the addition of the validation code. After its
introduction, the policies table (assuming sqlstore) is expected to
have the new 'accept' column. But when using an older sqlstore
instance without an 'accept' column, the latest tiddlyweb master
crashes on list_recipes because the 'accept' column doesn't exist.
Don't know if the crash is due to postgres' strictness.

I jumped in the database, manually added the column to the policies
table, and was on my way.

Just curious if you encountered the issue on an older store? I know
this problem won't affect many people, and is limited to pre-existing
stores, and might be further limited to just postgres.

I don't know if the underlying issue of new versions of tiddlyweb
demanding columns that don't exist in older store versions is going to
occur frequently enough to warrant attention.

Well this is just a problem I ran into in passing. Planning on posting
about some other stuff soonish.

On Jun 16, 10:45 pm, Martin Budden <mjbud...@gmail.com> wrote:
> Makes sense. It's just a bit of a pity that we have to choose an
> non-intuitive name to get the defaults the right way round.
>
> Martin
>

> 2009/6/16 cd...@peermore.com <chris.d...@gmail.com>:

chris...@gmail.com

unread,

Jun 23, 2009, 5:09:38 AM6/23/09

to TiddlyWikiDev

On Jun 23, 8:16 am, Oveek <mov...@gmail.com> wrote:
> Minor hiccup with the addition of the validation code. After its
> introduction, the policies table (assuming sqlstore) is expected to
> have the new 'accept' column. But when using an older sqlstore
> instance without an 'accept' column, the latest tiddlyweb master
> crashes on list_recipes because the 'accept' column doesn't exist.
> Don't know if the crash is due to postgres' strictness.

I didn't have this problem with the sqlite based store used for
tiddlyweb.peermore.com, but I was assuming it would happen.

> I don't know if the underlying issue of new versions of tiddlyweb
> demanding columns that don't exist in older store versions is going to
> occur frequently enough to warrant attention.

My hope is that it is not. One of the reasons (among) I've been
holding off on making a 1.0 release is that I want the data model to
as ironed out as possible before then. I'd prefer changes there to be
very rare. Such changes cascade throughout the code and become a real
PITA. For example, the recent discussion of doing recipes in recipes
is harder and messier than I because of some earlier assumptions.

Reply all

Reply to author

Forward