I've run into a bit of an odd problem with the ORM. I have a model
without an order_by in its Meta. When I create a queryset and iter
through it, I get the items in a sensical fashion. However, when I
access the the queryset items by index (qs[0], qs[1], etc), not only
do I get the items back in a different order from iter, I occasionally
get the same item multiple times!
So I dug into the QuerySet class, and I think I understand why this is
happening. On __getitem__ (i.e. qs[k]), a limit/offset is applied to
a clone of the queryset. While I think the intent is to retrieve the
kth item from the queryset just as it would have been if reached in
the process of an iter (in other words, list(qs)[k] == qs[k] for all
k), the limit/offset method of retrieval is inherently different due
to a lack of ordering in the database.
Lest you criticize accessing all queryset items by index, Django is
currently doing just that in
django.forms.models.BaseModelFormSet._construct_form. This means that
when creating an inline for a model without an order_by, Django can
(and has for me) paired forms in the inline formset with the wrong
model instance!
As a solution to this problem, I propose that all querysets include an
order by on their primary key. This will not only guarantee that
duplicate items are not retrieved from getitem; it will also mean that
the answer from iter will match getitem's answer.
For those turned off by the idea of adding an order_by to all
querysets, perhaps a compromise can be reached by only adding it to
the cloned queryset created by getitem. That way, all items returned
by iter(qs) would also be returned in qs[k] for all k, even if we
could not guarantee that it would match iter's ordering.
In any case, I think it's necessary to ensure qs[k] consistently
returns the same object for a given qs. I've opened a ticket for the
issue:
On Fri, Apr 24, 2009 at 10:09 AM, David Gouldin <dgoul...@gmail.com> wrote:
> I've run into a bit of an odd problem with the ORM. I have a model
> without an order_by in its Meta. When I create a queryset and iter
> through it, I get the items in a sensical fashion. However, when I
> access the the queryset items by index (qs[0], qs[1], etc), not only
> do I get the items back in a different order from iter, I occasionally
> get the same item multiple times!
> So I dug into the QuerySet class, and I think I understand why this is
> happening. On __getitem__ (i.e. qs[k]), a limit/offset is applied to
> a clone of the queryset. While I think the intent is to retrieve the
> kth item from the queryset just as it would have been if reached in
> the process of an iter (in other words, list(qs)[k] == qs[k] for all
> k), the limit/offset method of retrieval is inherently different due
> to a lack of ordering in the database.
> Lest you criticize accessing all queryset items by index, Django is
> currently doing just that in
> django.forms.models.BaseModelFormSet._construct_form. This means that
> when creating an inline for a model without an order_by, Django can
> (and has for me) paired forms in the inline formset with the wrong
> model instance!
> As a solution to this problem, I propose that all querysets include an
> order by on their primary key. This will not only guarantee that
> duplicate items are not retrieved from getitem; it will also mean that
> the answer from iter will match getitem's answer.
> For those turned off by the idea of adding an order_by to all
> querysets, perhaps a compromise can be reached by only adding it to
> the cloned queryset created by getitem. That way, all items returned
> by iter(qs) would also be returned in qs[k] for all k, even if we
> could not guarantee that it would match iter's ordering.
> In any case, I think it's necessary to ensure qs[k] consistently
> returns the same object for a given qs. I've opened a ticket for the
> issue:
This is only ever an issue if you are using an unsorted queryset, and by
definition if it's unsorted there is no defined ordereding. Indeed
BaseModelFormset was changed just yesterday to ensure there is always an
ordering on the queryset to avoid just this issue.
Alex
-- "I disapprove of what you say, but I will defend to the death your right to
say it." --Voltaire
"The people's good is the highest law."--Cicero
There's a more subtle way to introduce the same bug. You can order on
a field that is not unique and run into the same issue: an undefined
return order due to duplicate values on the order column.
On Apr 24, 9:14 am, Alex Gaynor <alex.gay...@gmail.com> wrote:
> On Fri, Apr 24, 2009 at 10:09 AM, David Gouldin <dgoul...@gmail.com> wrote:
> > I've run into a bit of an odd problem with the ORM. I have a model
> > without an order_by in its Meta. When I create a queryset and iter
> > through it, I get the items in a sensical fashion. However, when I
> > access the the queryset items by index (qs[0], qs[1], etc), not only
> > do I get the items back in a different order from iter, I occasionally
> > get the same item multiple times!
> > So I dug into the QuerySet class, and I think I understand why this is
> > happening. On __getitem__ (i.e. qs[k]), a limit/offset is applied to
> > a clone of the queryset. While I think the intent is to retrieve the
> > kth item from the queryset just as it would have been if reached in
> > the process of an iter (in other words, list(qs)[k] == qs[k] for all
> > k), the limit/offset method of retrieval is inherently different due
> > to a lack of ordering in the database.
> > Lest you criticize accessing all queryset items by index, Django is
> > currently doing just that in
> > django.forms.models.BaseModelFormSet._construct_form. This means that
> > when creating an inline for a model without an order_by, Django can
> > (and has for me) paired forms in the inline formset with the wrong
> > model instance!
> > As a solution to this problem, I propose that all querysets include an
> > order by on their primary key. This will not only guarantee that
> > duplicate items are not retrieved from getitem; it will also mean that
> > the answer from iter will match getitem's answer.
> > For those turned off by the idea of adding an order_by to all
> > querysets, perhaps a compromise can be reached by only adding it to
> > the cloned queryset created by getitem. That way, all items returned
> > by iter(qs) would also be returned in qs[k] for all k, even if we
> > could not guarantee that it would match iter's ordering.
> > In any case, I think it's necessary to ensure qs[k] consistently
> > returns the same object for a given qs. I've opened a ticket for the
> > issue:
> This is only ever an issue if you are using an unsorted queryset, and by
> definition if it's unsorted there is no defined ordereding. Indeed
> BaseModelFormset was changed just yesterday to ensure there is always an
> ordering on the queryset to avoid just this issue.
> Alex
> --
> "I disapprove of what you say, but I will defend to the death your right to
> say it." --Voltaire
> "The people's good is the highest law."--Cicero
On Fri, Apr 24, 2009 at 9:09 AM, David Gouldin <dgoul...@gmail.com> wrote: > Lest you criticize accessing all queryset items by index, Django is > currently doing just that in > django.forms.models.BaseModelFormSet._construct_form. This means that > when creating an inline for a model without an order_by, Django can > (and has for me) paired forms in the inline formset with the wrong > model instance!
On Fri, Apr 24, 2009 at 10:24 AM, David Gouldin <dgoul...@gmail.com> wrote:
> There's a more subtle way to introduce the same bug. You can order on > a field that is not unique and run into the same issue: an undefined > return order due to duplicate values on the order column.
This is true, and that could lead to your model formset processing during POST incorrectly matching up POST data with model instances, still, even with the recent fix in place. The recent fix only addresses a subset (hopefully a large one) of the problem.
If we are going to consider fixing remaining problems in this area, I just want to note that there is an entirely different way to trigger mismatching POST data with model instances when using model formsets, that will continue to exist no matter how perfectly we solve this ordering issue. That is if the affected QuerySet changes between the time the formset is originally constructed during GET and when it is eventually POSTed back.
Say, for example, I am going to add my rating for a movie in some hypothetical Django app that lets people rate movies. I get the movie's rating page, which happens to have a model formset containing all existing ratings for that movie. Say there are already 5 ratings. However before I manage to fill in my data (maybe I'm interrupted by a question, go out to lunch, whatever) and hit "Save", some number of (lets say 4) other people have added their ratings.
Now when I come back and hit "Save" my POST data may not match up properly with the instance data from the current QuerySet. I post data for 6 entries (the 5 that existed plus mine), the QuerySet computed during POST and passed in to construct the formset will have 9. Things may turn out OK (though I haven't actually tried this) -- the data I post may be matched up properly with instances, if the ordering is set such that the 'new' ratings that have no matching POST data come last in the QuerySet. If not, though, my POSTed data may get matched to entirely different instances.
I believe the situation is worse if items are deleted in the interim. Maybe my app is tracking unresolved issues, and while I was off having lunch someone updated things such that when I get back there are only 2 unresolved issues plus the one I was adding, whereas when I got the page originally there were 5. In this case (though again I haven't tried it) I think the model formset creation code will index off the end of the QuerySet and cause a server error.
Essentially the model formset code seems to assume the QuerySet using during get and post processing is going to be exactly the same, but I think applications will have a hard time guaranteeing that. Unfortunately the failure modes when that assumption doesn't hold are likely to be pretty cryptic and hard to track down, at least at the moment.
I don't see how to solve the general problem at the framework level, but there may be some additional sanity checks we could add to help out here. For instance we could make the formset validation return False if the number of items in the passed QuerySet is not what we expect based on the data dictionary, and/or check (during validation?) that the pk values in the post data match up to the instance pk values after we've matched them up, etc. To anyone with a better knowledge of model fomsets than me: does this sort of additional validation sound reasonable?
Sorry for the tangent -- but this issue with indexing and ordering I think has only come up as many times as it has because of the model formset's use of this construct. If people are running into this index/order problem I fear they will also start running into these other potential problems with how model formsets match up data to instances. Also, if we look at addressing this other problem it may not be as necessary to worry about fixing whatever subset of cases isn't covered by the recent ordering fix -- that is, if the resulting error was clear enough, we could leave it up to the programmer to fix the situation (provide a unique ordering) rather than having to figure out inside model formset creation whether the ordering that exists is unique.
On Fri, Apr 24, 2009 at 2:01 PM, Karen Tracey <kmtra...@gmail.com> wrote:
> I don't see how to solve the general problem at the framework level, but > there may be some additional sanity checks we could add to help out here. > For instance we could make the formset validation return False if the number > of items in the passed QuerySet is not what we expect based on the data > dictionary, and/or check (during validation?) that the pk values in the post > data match up to the instance pk values after we've matched them up, etc. > To anyone with a better knowledge of model fomsets than me: does this sort > of additional validation sound reasonable?
I might as well put this out there in case someone crazy like Alex wants to try it. :) My thought for the hard way to fix the issue for formsets is to use the primary key as part of the form prefix and/or put the primary keys into hidden fields then use the data from the form to look up the objects. It's obviously going to be more complicated than that (was the object already deleted? is it allowed to be changed?), but it's an approach that could work I think.
> On Fri, Apr 24, 2009 at 2:01 PM, Karen Tracey <kmtra...@gmail.com> wrote:
>> I don't see how to solve the general problem at the framework level, but
>> there may be some additional sanity checks we could add to help out here.
>> For instance we could make the formset validation return False if the number
>> of items in the passed QuerySet is not what we expect based on the data
>> dictionary, and/or check (during validation?) that the pk values in the post
>> data match up to the instance pk values after we've matched them up, etc.
>> To anyone with a better knowledge of model fomsets than me: does this sort
>> of additional validation sound reasonable?
> I might as well put this out there in case someone crazy like Alex wants to
> try it. :) My thought for the hard way to fix the issue for formsets is to
> use the primary key as part of the form prefix and/or put the primary keys
> into hidden fields then use the data from the form to look up the objects.
> It's obviously going to be more complicated than that (was the object
> already deleted? is it allowed to be changed?), but it's an approach that
> could work I think.
> Joseph
We already have most of the pieces in place for this, we need to play with
the architecture a bit though. Any form in a ModelFormset already stores
the pk for it's object:
http://code.djangoproject.com/browser/django/trunk/django/forms/model....
However we don't use that when providing the instance variable:
http://code.djangoproject.com/browser/django/trunk/django/forms/model....
The issue is that that pk field doens't exist on the form until it's
been
constructed, so we need to duplicate a tiny bit of the logic to pull the pk
out of the data before the form has been constructed. I can take a look at
it this weekend(time permitting) unless someone else wants to do it. There
shouldn't be too much work that needs doing, but writing tests for this
might be a bit messy.
Alex
-- "I disapprove of what you say, but I will defend to the death your right to
say it." --Voltaire
"The people's good is the highest law."--Cicero
On Fri, Apr 24, 2009 at 4:41 PM, Alex Gaynor <alex.gay...@gmail.com> wrote: > On Fri, Apr 24, 2009 at 3:24 PM, Joseph Kocherhans <jkocherh...@gmail.com>wrote:
>> On Fri, Apr 24, 2009 at 2:01 PM, Karen Tracey <kmtra...@gmail.com> wrote:
>>> I don't see how to solve the general problem at the framework level, but >>> there may be some additional sanity checks we could add to help out here. >>> For instance we could make the formset validation return False if the number >>> of items in the passed QuerySet is not what we expect based on the data >>> dictionary, and/or check (during validation?) that the pk values in the post >>> data match up to the instance pk values after we've matched them up, etc. >>> To anyone with a better knowledge of model fomsets than me: does this sort >>> of additional validation sound reasonable?
>> I might as well put this out there in case someone crazy like Alex wants >> to try it. :) My thought for the hard way to fix the issue for formsets is >> to use the primary key as part of the form prefix and/or put the primary >> keys into hidden fields then use the data from the form to look up the >> objects. It's obviously going to be more complicated than that (was the >> object already deleted? is it allowed to be changed?), but it's an approach >> that could work I think.
We already have most of the pieces in place for this, we need to play with
> the architecture a bit though. Any form in a ModelFormset already stores > the pk for it's object: > http://code.djangoproject.com/browser/django/trunk/django/forms/model.... However we don't use that when providing the instance variable: > http://code.djangoproject.com/browser/django/trunk/django/forms/model.... The issue is that that pk field doens't exist on the form until it's been > constructed, so we need to duplicate a tiny bit of the logic to pull the pk > out of the data before the form has been constructed. I can take a look at > it this weekend(time permitting) unless someone else wants to do it. There > shouldn't be too much work that needs doing, but writing tests for this > might be a bit messy.
Note if we go with this approach (if I'm understanding it correctly) we are slightly changing the way in which we expect model formsets to be used. We currently document that the same QuerySet should be passed in during POST as GET:
Switching to an approach that uses the pk ids in the POST data for lookup means we'd:
- silently ignore a queryset argument when we're given a data dictionary? - issue some sort of deprecation warning about the now-unnecessary and unused parameter?
Also, it would be nice if we could accomplish the lookup in one DB call rather than looking up each individual item in a separate query. Having a model formset with hundreds of entries suddenly require hundreds of DB hits during POST processing instead of one isn't too nice. Not sure if you were planning that?
These two concerns pushed me more towards thinking that leaving the DB lookup as-is and adding some validation sanity checks for when the passed queryset doesn't match up with the POST data might be a "good enough" approach.
On Fri, Apr 24, 2009 at 5:49 PM, Karen Tracey <kmtra...@gmail.com> wrote:
> On Fri, Apr 24, 2009 at 4:41 PM, Alex Gaynor <alex.gay...@gmail.com>wrote:
>> On Fri, Apr 24, 2009 at 3:24 PM, Joseph Kocherhans <jkocherh...@gmail.com
>> > wrote:
>>> On Fri, Apr 24, 2009 at 2:01 PM, Karen Tracey <kmtra...@gmail.com>wrote:
>>>> I don't see how to solve the general problem at the framework level, but
>>>> there may be some additional sanity checks we could add to help out here.
>>>> For instance we could make the formset validation return False if the number
>>>> of items in the passed QuerySet is not what we expect based on the data
>>>> dictionary, and/or check (during validation?) that the pk values in the post
>>>> data match up to the instance pk values after we've matched them up, etc.
>>>> To anyone with a better knowledge of model fomsets than me: does this sort
>>>> of additional validation sound reasonable?
>>> I might as well put this out there in case someone crazy like Alex wants
>>> to try it. :) My thought for the hard way to fix the issue for formsets is
>>> to use the primary key as part of the form prefix and/or put the primary
>>> keys into hidden fields then use the data from the form to look up the
>>> objects. It's obviously going to be more complicated than that (was the
>>> object already deleted? is it allowed to be changed?), but it's an approach
>>> that could work I think.
> We already have most of the pieces in place for this, we need to play with
>> the architecture a bit though. Any form in a ModelFormset already stores
>> the pk for it's object:
>> http://code.djangoproject.com/browser/django/trunk/django/forms/model.... However we don't use that when providing the instance variable:
>> http://code.djangoproject.com/browser/django/trunk/django/forms/model.... The issue is that that pk field doens't exist on the form until it's been
>> constructed, so we need to duplicate a tiny bit of the logic to pull the pk
>> out of the data before the form has been constructed. I can take a look at
>> it this weekend(time permitting) unless someone else wants to do it. There
>> shouldn't be too much work that needs doing, but writing tests for this
>> might be a bit messy.
> Note if we go with this approach (if I'm understanding it correctly) we are
> slightly changing the way in which we expect model formsets to be used. We
> currently document that the same QuerySet should be passed in during POST as
> GET:
> Switching to an approach that uses the pk ids in the POST data for lookup
> means we'd:
> - silently ignore a queryset argument when we're given a data dictionary?
My approach would be to use only items in the iniital queryset argument.
> - issue some sort of deprecation warning about the now-unnecessary and
> unused parameter?
> Also, it would be nice if we could accomplish the lookup in one DB call
> rather than looking up each individual item in a separate query. Having a
> model formset with hundreds of entries suddenly require hundreds of DB hits
> during POST processing instead of one isn't too nice. Not sure if you were
> planning that?
My appraoch was basically going to be
pk = something
kwargs['instance'] = [o for o in self.queryset if o.pk == pk][0]
Which would only be 1 query, and basically has the same behavior as we do
now, but correctly :).
> These two concerns pushed me more towards thinking that leaving the DB
> lookup as-is and adding some validation sanity checks for when the passed
> queryset doesn't match up with the POST data might be a "good enough"
> approach.
> Karen
Alex
-- "I disapprove of what you say, but I will defend to the death your right to
say it." --Voltaire
"The people's good is the highest law."--Cicero
On Fri, Apr 24, 2009 at 4:49 PM, Karen Tracey <kmtra...@gmail.com> wrote:
> Note if we go with this approach (if I'm understanding it correctly) we are > slightly changing the way in which we expect model formsets to be used. We > currently document that the same QuerySet should be passed in during POST as > GET:
> Switching to an approach that uses the pk ids in the POST data for lookup > means we'd:
> - silently ignore a queryset argument when we're given a data dictionary? > - issue some sort of deprecation warning about the now-unnecessary and > unused parameter?
> Also, it would be nice if we could accomplish the lookup in one DB call > rather than looking up each individual item in a separate query. Having a > model formset with hundreds of entries suddenly require hundreds of DB hits > during POST processing instead of one isn't too nice. Not sure if you were > planning that?
> These two concerns pushed me more towards thinking that leaving the DB > lookup as-is and adding some validation sanity checks for when the passed > queryset doesn't match up with the POST data might be a "good enough" > approach.
My thought on this was to continue to use the queryset argument and do something like qs.filter(pk__in=(list_of_pks)) for access restriction in case someone decided it would be fun to play with the POST data and try to change objects they shouldn't be allowed to.
On Fri, Apr 24, 2009 at 5:52 PM, Alex Gaynor <alex.gay...@gmail.com> wrote: > On Fri, Apr 24, 2009 at 5:49 PM, Karen Tracey <kmtra...@gmail.com> wrote:
> Also, it would be nice if we could accomplish the lookup in one DB call >> rather than looking up each individual item in a separate query. Having a >> model formset with hundreds of entries suddenly require hundreds of DB hits >> during POST processing instead of one isn't too nice. Not sure if you were >> planning that?
> My appraoch was basically going to be
> pk = something > kwargs['instance'] = [o for o in self.queryset if o.pk == pk][0]
> Which would only be 1 query, and basically has the same behavior as we do > now, but correctly :).
Ah, OK I see now (I think). So you would still use the queryset arg, just make sure the results got matched up by pk correctly. (And still index off the end if an item had been deleted in between GET and POST....) The n repetitions of a linear search through the n-item queryset to find the right item doesn't thrill me, but I suppose it is better than n DB hits.
On Fri, Apr 24, 2009 at 6:02 PM, Karen Tracey <kmtra...@gmail.com> wrote:
> On Fri, Apr 24, 2009 at 5:52 PM, Alex Gaynor <alex.gay...@gmail.com>wrote:
>> On Fri, Apr 24, 2009 at 5:49 PM, Karen Tracey <kmtra...@gmail.com> wrote:
>> Also, it would be nice if we could accomplish the lookup in one DB call
>>> rather than looking up each individual item in a separate query. Having a
>>> model formset with hundreds of entries suddenly require hundreds of DB hits
>>> during POST processing instead of one isn't too nice. Not sure if you were
>>> planning that?
>> My appraoch was basically going to be
>> pk = something
>> kwargs['instance'] = [o for o in self.queryset if o.pk == pk][0]
>> Which would only be 1 query, and basically has the same behavior as we do
>> now, but correctly :).
> Ah, OK I see now (I think). So you would still use the queryset arg, just
> make sure the results got matched up by pk correctly. (And still index off
> the end if an item had been deleted in between GET and POST....) The n
> repetitions of a linear search through the n-item queryset to find the right
> item doesn't thrill me, but I suppose it is better than n DB hits.
> Karen
Yeah, that was a pretty lazy O(n^2) algorithm, the smart way to do it would
be:
if not hasattr(self, '_queryset_dict'):
self._queryset_dict = dict([(o.pk, o) for o in self.queryset])
pk = something
kwargs['instance'] = self._queryset_dict[pk]
Alex
-- "I disapprove of what you say, but I will defend to the death your right to
say it." --Voltaire
"The people's good is the highest law."--Cicero
> On Fri, Apr 24, 2009 at 4:49 PM, Karen Tracey <kmtra...@gmail.com> wrote:
>> Note if we go with this approach (if I'm understanding it correctly) we >> are slightly changing the way in which we expect model formsets to be used. >> We currently document that the same QuerySet should be passed in during POST >> as GET:
>> Switching to an approach that uses the pk ids in the POST data for lookup >> means we'd:
>> - silently ignore a queryset argument when we're given a data dictionary? >> - issue some sort of deprecation warning about the now-unnecessary and >> unused parameter?
>> Also, it would be nice if we could accomplish the lookup in one DB call >> rather than looking up each individual item in a separate query. Having a >> model formset with hundreds of entries suddenly require hundreds of DB hits >> during POST processing instead of one isn't too nice. Not sure if you were >> planning that?
>> These two concerns pushed me more towards thinking that leaving the DB >> lookup as-is and adding some validation sanity checks for when the passed >> queryset doesn't match up with the POST data might be a "good enough" >> approach.
> My thought on this was to continue to use the queryset argument and do > something like qs.filter(pk__in=(list_of_pks)) for access restriction in > case someone decided it would be fun to play with the POST data and try to > change objects they shouldn't be allowed to.
It'd be nice if that construct were guaranteed to return the objects is the order in list_of_pks...but I don't think that's guaranteed?
At any rate you're both continuing to use the queryset arg and doing the lookup in one query, so my concerns as to this approach were unfounded.
On Fri, Apr 24, 2009 at 6:05 PM, Alex Gaynor <alex.gay...@gmail.com> wrote: > Yeah, that was a pretty lazy O(n^2) algorithm, the smart way to do it would > be:
> if not hasattr(self, '_queryset_dict'): > self._queryset_dict = dict([(o.pk, o) for o in self.queryset]) > pk = something > kwargs['instance'] = self._queryset_dict[pk]
OK, I'm convinced you can make it more efficient for large n. (Still don't want to deal with items that have been deleted though, huh?) Anyway I opened a ticker for this -- #10922. Not sure it needs to be solved for 1.1...we've got a bunch of other open tickets still to fix and I'm not sure how common running into this problem is....