In the block for saving an existing model object, it uses the non_pks
list which is generated by testing for the primary_key flag on fields,
however the block for inserting a new row tests if the field is an
AutoField, not if it has primary_key set.
Wouldn't this cause it to insert the primary key field regardless of
if you'd specified it or not?
As I think about it, a non-auto primary_key probably has to be
specified in most situations, but it seems like it should still
check :)
I'm also curious for thoughts on an issue with the ticket I mentioned
above; should the specified list of fields be used for inserting as
well as updating? And if it should, should the pk be used if specified
in the model regardless, or should you have to specify it in the field
list? (or should I move this question to the ticket itself?)
Thanks for any insights (or corrections) :)
--
Collin Grady
I think you might be getting two things confused here, or, at least, I'm
not really sure that there is a problem.
The non_pks list is necessary in the first block because we can't
differentiate between a primary key field that has been set manually and
one that has been set because the object is already saved in the
database (and we retrieved the pk value from the database).
In the second block, when the primary key hasn't been set or we
otherwise know this is a new record (that second part is important), we
want to update/insert all non-auto fields, so the list filtration looks
correct to me. If you don't specify a manual primary key field, then it
may or may not be an error. Your database may still have a serial value
on that field, for example, or there may be a trigger that does the
value computation. Both of these are a little unrealistic, but not out
of the bounds of possibility (because people do some pretty weird stuff
with databases and then complain when you take away their toys). The
salient point is that save() method does not do validation. This is
important to realise. Trying to save a model with a required
manually-set primary key specified is a validation error -- it might
result in a database integrity error (e.g. inserting no value into a
non-NULL field), but that should be caught when validating the data or
constructing the model in the first place.
This will be more explicit when we also have model-aware validation
(when validating manual primary keys in a good way becomes something
we can solve), but the principle still holds at the moment. In the
future you should be able to call something like my_model.validate()
prior to save() to check all that logic explicitly.
> Wouldn't this cause it to insert the primary key field regardless of
> if you'd specified it or not?
>
> As I think about it, a non-auto primary_key probably has to be
> specified in most situations, but it seems like it should still
> check :)
Validation is not the save() method's responsibility because it's not
always obvious what the right answer is. Took me a while to wrap my
brain around that when I starting working in that part of the code, but
it's an idea that works quite well when we apply it consistently.
> I'm also curious for thoughts on an issue with the ticket I mentioned
> above; should the specified list of fields be used for inserting as
> well as updating? And if it should, should the pk be used if specified
> in the model regardless, or should you have to specify it in the field
> list? (or should I move this question to the ticket itself?)
There's going to be some interesting validation issues in that case.
You'll be able to save a model with only some of the fields set
correctly. I'm not sure that's a good idea. It's also a mark against the
whole concept, in some respects.
As a meta-issue...
Thinking about that ticket a bit more yesterday evening and just now,
I'm becoming less enthusiastic about the idea suggested there. I'd like
to see some real-world timing examples to show it's really more
efficient.
It's not universally true that not updating the same fields won't lead
to problems, because the two functions could have overlapping WHERE
clause constraints (or the WHERE clause of one could overlap the updated
columns of the other). So that part of the justification doesn't support
the inclusion of the patch. None of the databases we support widely have
column-level locking, as far as I'm aware (it wouldn't make a lot of
sense under the covers). What speed-ups are you seeing on what sort of
use-cases with this change?
Regards,
Malcolm
In those instances though, wouldn't you want to leave the primary key
field out of the insert? Or do non-specified fields get left out no
matter what since there's no data there?
> Validation is not the save() method's responsibility because it's not
> always obvious what the right answer is. Took me a while to wrap my
> brain around that when I starting working in that part of the code, but
> it's an idea that works quite well when we apply it consistently.
I'm probably missing something simple, but I don't see how validation
ties into this - I'm not suggesting it needs to make sure there are
valid values set in anything, it just seems that it's using the wrong
metric to determine if a field is a primary key or not :)
> There's going to be some interesting validation issues in that case.
> You'll be able to save a model with only some of the fields set
> correctly. I'm not sure that's a good idea. It's also a mark against the
> whole concept, in some respects.
Wouldn't it be fine though if the fields you leave out have valid
defaults in the database?
You'd have to be careful not to leave out a field you really want to
save, but that's not much different than remembering to set a field in
the first place :)
> It's not universally true that not updating the same fields won't lead
> to problems, because the two functions could have overlapping WHERE
> clause constraints (or the WHERE clause of one could overlap the updated
> columns of the other). So that part of the justification doesn't support
> the inclusion of the patch. None of the databases we support widely have
> column-level locking, as far as I'm aware (it wouldn't make a lot of
> sense under the covers).
While not universal, I think there are enough cases where it does work
to make it a valid consideration, even if it isn't a full
justification on its own :)
> What speed-ups are you seeing on what sort of
> use-cases with this change?
I haven't done any testing yet - the performance comment was based on
things others have said to me on how SELECTs involving columns with a
lot of data in them can be slower, so I figured that UPDATEs with
large blocks of text could be similarly impaired - if that's
incorrect, then scratch the performance idea :)
--
Collin Grady
Ah, okay .. I thought you were worried that it wasn't checking the
existence of something.
The code uses different checks because it's looking for different
things. In the second half of Model.save(), the code doesn't care if the
field is a primary key or not. It's already establish the record doesn't
exist and it needs to do an insert and just wants to insert every field
that isn't the serial field we might be using for the primary key (the
latter being updated by the database automatically). It's only in the
first half of that function that we care about the difference between
primary and non-primary keys because the code is trying to work out if
the record already exists in the database.
> > There's going to be some interesting validation issues in that case.
> > You'll be able to save a model with only some of the fields set
> > correctly. I'm not sure that's a good idea. It's also a mark against the
> > whole concept, in some respects.
>
> Wouldn't it be fine though if the fields you leave out have valid
> defaults in the database?
Yes, it would work. Why is it necessary to introduce the complexity (and
risk of screwing it up), though? We have the default attribute on models
for this type of thing. You're really leaking abstractions here.
Conceptually, Django works with Python models that just happen to be
stored (sometimes) in an SQL database. You are trying to bring the SQL
database more into the foreground here (which is admittedly something
that happens elsewhere too -- not ideal in any of those cases). I'm not
sure that you're solving a problem that needs solving here.
> You'd have to be careful not to leave out a field you really want to
> save, but that's not much different than remembering to set a field in
> the first place :)
>
> > It's not universally true that not updating the same fields won't lead
> > to problems, because the two functions could have overlapping WHERE
> > clause constraints (or the WHERE clause of one could overlap the updated
> > columns of the other). So that part of the justification doesn't support
> > the inclusion of the patch. None of the databases we support widely have
> > column-level locking, as far as I'm aware (it wouldn't make a lot of
> > sense under the covers).
>
> While not universal, I think there are enough cases where it does work
> to make it a valid consideration, even if it isn't a full
> justification on its own :)
>
> > What speed-ups are you seeing on what sort of
> > use-cases with this change?
>
> I haven't done any testing yet - the performance comment was based on
> things others have said to me on how SELECTs involving columns with a
> lot of data in them can be slower, so I figured that UPDATEs with
> large blocks of text could be similarly impaired - if that's
> incorrect, then scratch the performance idea :)
As an aside, the information about SELECT statements is not really
accurate. In a moderately complex query, filtering the data is a much
larger component of the time budget than reading the results from disk,
regardless of the columns retrieved. One reason for this is that the
database doesn't just read the columns you want from disk. It has to
read in multiples of some contiguous block size (typically 8K or more),
so it will read a large portion of the contents of each row whenever you
access any random selection of data in the row -- ergo reading the data
for a table is relatively constant and working out which rows to
actually read is more important (disk access is sloooowww). You *can*
blow this rule out of the water by using horribly complex computed
columns or trivial constraints, but the first case is rare, particularly
in Django scenarios, and the second case is so fast anyway that the
proportion of the time budget spent on each part is effectively
irrelevant.
I don't have any similar information for updates, although I suspect
there's going to be similar conditions at play. Particularly because the
Django usage patterns will involve updating only a single row in each
table at a time -- so there aren't any complex constraints used to
filter the rows -- which again falls into the situation where the total
elapsed time is so small as not be worth micro-optimising. Timing the
performance will be necessary.
Regards,
Malcolm
I'm not saying we need to have it on INSERTs, I mainly want it for
UPDATE :)
> Conceptually, Django works with Python models that just happen to be
> stored (sometimes) in an SQL database. You are trying to bring the SQL
> database more into the foreground here (which is admittedly something
> that happens elsewhere too -- not ideal in any of those cases). I'm not
> sure that you're solving a problem that needs solving here.
Where else could it be solved though? The only other solution I can
see is locking the entire row to prevent conflicts, but that can
definitely slow things down :)
> As an aside, the information about SELECT statements is not really
> accurate. In a moderately complex query, filtering the data is a much
> larger component of the time budget than reading the results from disk,
> regardless of the columns retrieved. One reason for this is that the
> database doesn't just read the columns you want from disk. It has to
> read in multiples of some contiguous block size (typically 8K or more),
> so it will read a large portion of the contents of each row whenever you
> access any random selection of data in the row -- ergo reading the data
> for a table is relatively constant and working out which rows to
> actually read is more important (disk access is sloooowww). You *can*
> blow this rule out of the water by using horribly complex computed
> columns or trivial constraints, but the first case is rare, particularly
> in Django scenarios, and the second case is so fast anyway that the
> proportion of the time budget spent on each part is effectively
> irrelevant.
>
> I don't have any similar information for updates, although I suspect
> there's going to be similar conditions at play. Particularly because the
> Django usage patterns will involve updating only a single row in each
> table at a time -- so there aren't any complex constraints used to
> filter the rows -- which again falls into the situation where the total
> elapsed time is so small as not be worth micro-optimising. Timing the
> performance will be necessary.
Ok, I stand corrected on that aspect of performance :)
--
Collin Grady