Re-thinking model validation

mrts

unread,

Oct 5, 2008, 5:11:55 PM10/5/08

to Django developers

Looking at the path Honza has taken [1], it looks that it both
complicates things and causes overhead -- for every constraint on
an model object, an extra db call for checking it is made,
instead of relying on the constraint checks enforced by the db
backend and corresponding exceptions during `save()`
(e.g. for uniqueness validation, one db call is needed to check
the uniqueness constraint and a second one to actually save the
object).

The problem is conceptual: how aware should validation level be
of storage level. Python is known for the "It's easier to ask
forgiveness than permission" motto, the proposal is ideologically
based on that.

In short, db validation is shifted to `ModelForm.save()`. Outside of
ModelForms, model validation has to be done manually as
presently. The changes are less disruptive and fewer lines of
code need to be updated.

Here's the proposed idiom in code, explanation follows:

class FooForm(ModelForm):
class Meta:
model = Foo # model with unique constraints

def view_foo(request):
if request.method == 'POST':
form = FooForm(request.POST)
if form.is_valid():
try:
form.save()
return render_to_response('done.html')
except DBValidationError:
pass
else:
form = FooForm()

return render_to_response('form.html', { 'form' : form })

`ModelForm.is_valid()` remains unaware of db constraints.

`ModelForm.save()`
* examines all constraints the model has
* wraps Model.save() in a try/except block to catch all know
constraints violations
* if an exception is caught,
+ _errors is updated with relevant information (generally
just _errors[field] = exception.message)
+ a DBValidationError subclass is re-raised

As a result, db state remains unchanged and the form contains all
the necessary error infromation for the user.

What needs to be done:
* add DBValidationError and corresponding exception hierarchy
(UniqueValidationError, UniqueTogetherValidationError etc)
* update model forms and formsets .save() as outlined above
* update contrib.admin to take advantage of the

The current constraints are the following:
* `unique`, `unique_together` and `unique_for_*`
* `null=False`
* `max_length` and `max_digits`
* `choices`

Out of these, `null`, `max_length`, `max_digits` and `choices`
can be checked in `is_valid()`. It follows that only uniqueness
constraints remain in scope for this.

mrts

unread,

Oct 5, 2008, 5:14:58 PM10/5/08

to Django developers

mrts wrote:
> Looking at the path Honza has taken [1], it looks that it both

And the missing reference is
[1] http://code.djangoproject.com/ticket/6845

Malcolm Tredinnick

unread,

Oct 5, 2008, 7:03:43 PM10/5/08

to django-d...@googlegroups.com

On Sun, 2008-10-05 at 14:11 -0700, mrts wrote:
> Looking at the path Honza has taken [1], it looks that it both
> complicates things and causes overhead -- for every constraint on
> an model object, an extra db call for checking it is made,
> instead of relying on the constraint checks enforced by the db
> backend and corresponding exceptions during `save()`
> (e.g. for uniqueness validation, one db call is needed to check
> the uniqueness constraint and a second one to actually save the
> object).

Except that very few validation requirements correspond to database
constraints (for example any reg-exp matching field, or limits on an
integer, etc). We aren't about to require database check constraints for
all of those. So it's not really a one-to-one in practice. You've
misread the patch quite badly, it sounds like: only the unique
requirements have to check the database (and a pre-emptive check is
reasonable there, since it's early error detection and there's only a
small class of applications that are going to have such highly
concurrent overlapping write styles that they will pass that test and
fail at the save time).

We might even be able to trim all the unique checks down to a single
database call. That's something that can be looked at further down the
line, since it depends on how the code falls out as well as a bunch of
profiling and some intelligent guesses about use-case likelihoods.

> The problem is conceptual: how aware should validation level be
> of storage level. Python is known for the "It's easier to ask
> forgiveness than permission" motto, the proposal is ideologically
> based on that.
>
> In short, db validation is shifted to `ModelForm.save()`. Outside of
> ModelForms, model validation has to be done manually as
> presently. The changes are less disruptive and fewer lines of
> code need to be updated.

This isn't a very good idea and the reason for the separation been
discussed before. There needs to be a clear phase prior to saving when
you can detect validation errors so that they can be correctly presented
back to the user. You see this already with forms where we check for
validity and, if it's not valid, we present the errors. If it is valid,
we move onto doing whatever we want with the data (saving it or
whatever).

The only time there's any kind of overlap is when there's a database
constraint such as uniqueness which we cannot guarantee will remain true
between the validation step and the saving step. So there's a chance
that save() will raise some kind of database integrity error. But that's
actually the edge-case and in a wide number of use-cases it's
practically zero since you know that your application is the only thing
working with the database. So it's an acceptable trade-off.

Validation for models will look a lot like validation for forms in the
end. Honza is working on a slightly different approach to the last one,
taking into account what we've learnt from that approach. So take a pew
for a bit and wait until he has some code ready. A wild redesign isn't
required, though and what you're proposing would actually make a lot of
code harder to write, since custom save methods really work much more
easily when you know that the data is in a consistent format and valid.
Then there's one point (the final "talk to the database" save line) that
can raise an IntegrityError, depending on your data modelling and
problem domain and that's it. Everything splits fairly nicely between
validation and saving with a space in the middle to operate on the
results of validation.

Regards,
Malcolm

mrts

unread,

Oct 6, 2008, 6:27:03 PM10/6/08

to Django developers

On Oct 6, 2:03 am, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:

> On Sun, 2008-10-05 at 14:11 -0700, mrts wrote:
> > Looking at the path Honza has taken [1], it looks that it both
> > complicates things and causes overhead -- for every constraint on
> > an model object, an extra db call for checking it is made,
> > instead of relying on the constraint checks enforced by the db
> > backend and corresponding exceptions during `save()`
> > (e.g. for uniqueness validation, one db call is needed to check
> > the uniqueness constraint and a second one to actually save the
> > object).
>
> Except that very few validation requirements correspond to database
> constraints (for example any reg-exp matching field, or limits on an
> integer, etc). We aren't about to require database check constraints for
> all of those. So it's not really a one-to-one in practice. You've
> misread the patch quite badly,

Now that I re-read the first section of my post I do see that I made
the mistake of being over-emphatic and made an incorrect
generalization ("for every constraint on an model object, an extra db
call for checking it is made") -- sorry for that. My point was that
validating non-db constraints (like regex matching of an EmailField)
could remain the responsibility of form validation as it is now. So,
as presently, forms layer could take care of everything that can be
taken care of without touching the database (including max_length, not
null and choices) and model forms take additionally care of
IntegrityErrors by catching them, augmenting _errors[] with relevant
data and re-throwing them. And as IntegrityErrors can only be reliably
caught during save(), model forms have to be handled differently.

I had mostly backwards-compatibility and simplicity in fixing e.g.
#8882 in mind (as validate_unique() in forms/models.py lets
IntegrityErrors through anyway as of 1.0 and can not be relied on).
But if you think this is a bad idea, so be it.

- copied from below: -

> There needs to be a clear phase prior to saving when
> you can detect validation errors so that they can be correctly presented
> back to the user. You see this already with forms where we check for
> validity and, if it's not valid, we present the errors. If it is valid,
> we move onto doing whatever we want with the data (saving it or
> whatever).

General model validation is of course conceptually correct and good --
if the responsibility assignment between form and model validation is
sorted out in non-duplicated and elegant manner, unique checks handled
efficiently and if the code lands in a month or two timeframe, then
all is well. Meanwhile, people who want to use inline formsets with
unique validation are stuck on #8882 (as it looks that it takes quite
an effort to fix that with the current codebase -- and I'd like to be
wrong here).

> it sounds like: only the unique
> requirements have to check the database (and a pre-emptive check is
> reasonable there, since it's early error detection and there's only a
> small class of applications that are going to have such highly
> concurrent overlapping write styles that they will pass that test and
> fail at the save time).

- copied from below: -

> The only time there's any kind of overlap is when there's a database
> constraint such as uniqueness which we cannot guarantee will remain true
> between the validation step and the saving step. So there's a chance
> that save() will raise some kind of database integrity error. But that's
> actually the edge-case and in a wide number of use-cases it's
> practically zero since you know that your application is the only thing
> working with the database.

And your take is just to ignore that case and actually let e.g. admin
display 500 pages? It's an ordinary race condition (some of which are
indeed rare in practice but nevertheless usually taken care of) and
you yourself have generally advocated pushing this kind of problems to
the relevant backends -- which is nice and proper. Applying the same
principle to this case as well (and I can't think of anything other
but try save/catch IntegrityError) seems to me the only viable way to
avoid the race.

And I don't buy the "your application is the only thing working with
the database" argument -- it is, but it happens to be multi-process or
-thread. How am I supposed to avoid that thread or process A and B do
a validate_unique() for some x concurrently, finding both out that it
doesn't exist and proceed happily with saving, so that one of them
chokes on an IntegrityError?

Malcolm Tredinnick

unread,

Oct 7, 2008, 5:08:49 AM10/7/08

to django-d...@googlegroups.com

On Mon, 2008-10-06 at 15:27 -0700, mrts wrote:
> On Oct 6, 2:03 am, Malcolm Tredinnick <malc...@pointy-stick.com>
> wrote:

[...]

> > The only time there's any kind of overlap is when there's a database
> > constraint such as uniqueness which we cannot guarantee will remain true
> > between the validation step and the saving step. So there's a chance
> > that save() will raise some kind of database integrity error. But that's
> > actually the edge-case and in a wide number of use-cases it's
> > practically zero since you know that your application is the only thing
> > working with the database.
>
> And your take is just to ignore that case and actually let e.g. admin
> display 500 pages?

Obviously not, since we're building a proper framework here. How did you
get from "we should do checking as part of the validation step" to "we
should not handle database errors in save()"? They're different steps
and the former is a good idea and not done at the expense of the latter.

You were proposing letting the database be the only place to catch
unique errors which has flaws like waiting until save time to raise a
validation problem that could be caught earlier (at a point when we can
report it to the user) as well as not catching all the problems (when an
IntegrityError is raised, you won't find out that two fields failed
uniqueness constraints -- you'll get one database constraint violation).

Doing uniqueness detection in the validation phase is still a necessary
component, as is handling errors at save time.

> And I don't buy the "your application is the only thing working with
> the database" argument -- it is, but it happens to be multi-process or
> -thread. How am I supposed to avoid that thread or process A and B do
> a validate_unique() for some x concurrently, finding both out that it
> doesn't exist and proceed happily with saving, so that one of them
> chokes on an IntegrityError?

No, you've missed the point. Quite often one knows (which is why I said
a wide number of use-cases) that simultaneous updates to the same bit of
data even within the application are very low probability. Even on sites
that are largely read-write as opposed to write only. I'm not making
this stuff up out of thin air. I have experience working on such systems
with multiple-master database replication and watching the traffic
patterns to work out things like likelihood of the need for conflict
resolution on updates. The multiple simultaneous writes to the exact
same piece of data is only one corner of the full domain space. We plan
for the worst, but often optimise for other cases. Again, I'm not
dismissing anything.

Regards,
Malcolm

mrts

unread,

Oct 7, 2008, 6:58:23 AM10/7/08

to Django developers

On Oct 7, 12:08 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:

Ack, sounds good. Over and out for this, thanks for elaborating.

Reply all

Reply to author

Forward