For QuerySet qs, list(qs)[k] does not always equal qs[k]

25 views
Skip to first unread message

David Gouldin

unread,
Apr 24, 2009, 10:09:45 AM4/24/09
to Django developers
I've run into a bit of an odd problem with the ORM. I have a model
without an order_by in its Meta. When I create a queryset and iter
through it, I get the items in a sensical fashion. However, when I
access the the queryset items by index (qs[0], qs[1], etc), not only
do I get the items back in a different order from iter, I occasionally
get the same item multiple times!

So I dug into the QuerySet class, and I think I understand why this is
happening. On __getitem__ (i.e. qs[k]), a limit/offset is applied to
a clone of the queryset. While I think the intent is to retrieve the
kth item from the queryset just as it would have been if reached in
the process of an iter (in other words, list(qs)[k] == qs[k] for all
k), the limit/offset method of retrieval is inherently different due
to a lack of ordering in the database.

Lest you criticize accessing all queryset items by index, Django is
currently doing just that in
django.forms.models.BaseModelFormSet._construct_form. This means that
when creating an inline for a model without an order_by, Django can
(and has for me) paired forms in the inline formset with the wrong
model instance!

As a solution to this problem, I propose that all querysets include an
order by on their primary key. This will not only guarantee that
duplicate items are not retrieved from getitem; it will also mean that
the answer from iter will match getitem's answer.

For those turned off by the idea of adding an order_by to all
querysets, perhaps a compromise can be reached by only adding it to
the cloned queryset created by getitem. That way, all items returned
by iter(qs) would also be returned in qs[k] for all k, even if we
could not guarantee that it would match iter's ordering.

In any case, I think it's necessary to ensure qs[k] consistently
returns the same object for a given qs. I've opened a ticket for the
issue:

http://code.djangoproject.com/ticket/10915

Alex Gaynor

unread,
Apr 24, 2009, 10:14:26 AM4/24/09
to django-d...@googlegroups.com
This is only ever an issue if you are using an unsorted queryset, and by definition if it's unsorted there is no defined ordereding.  Indeed BaseModelFormset was changed just yesterday to ensure there is always an ordering on the queryset to avoid just this issue.

Alex

--
"I disapprove of what you say, but I will defend to the death your right to say it." --Voltaire
"The people's good is the highest law."--Cicero

David Gouldin

unread,
Apr 24, 2009, 10:24:33 AM4/24/09
to Django developers
There's a more subtle way to introduce the same bug. You can order on
a field that is not unique and run into the same issue: an undefined
return order due to duplicate values on the order column.

On Apr 24, 9:14 am, Alex Gaynor <alex.gay...@gmail.com> wrote:

Jacob Kaplan-Moss

unread,
Apr 24, 2009, 11:05:39 AM4/24/09
to django-d...@googlegroups.com
On Fri, Apr 24, 2009 at 9:09 AM, David Gouldin <dgou...@gmail.com> wrote:
> Lest you criticize accessing all queryset items by index, Django is
> currently doing just that in
> django.forms.models.BaseModelFormSet._construct_form.  This means that
> when creating an inline for a model without an order_by, Django can
> (and has for me) paired forms in the inline formset with the wrong
> model instance!

I actually fixed that yesterday: http://code.djangoproject.com/changeset/10625

1.1 now includes a ``QuerySet.ordered`` property which FormSet uses to
artificially apply ordering, if needed. You can -- and should -- do
the same.

Jacob

Karen Tracey

unread,
Apr 24, 2009, 3:01:59 PM4/24/09
to django-d...@googlegroups.com
On Fri, Apr 24, 2009 at 10:24 AM, David Gouldin <dgou...@gmail.com> wrote:

There's a more subtle way to introduce the same bug.  You can order on
a field that is not unique and run into the same issue: an undefined
return order due to duplicate values on the order column.

This is true, and that could lead to your model formset processing during POST incorrectly matching up POST data with model instances, still, even with the recent fix in place.  The recent fix only addresses a subset (hopefully a large one) of the problem.

If we are going to consider fixing remaining problems in this area, I just want to note that there is an entirely different way to trigger mismatching POST data with model instances when using model formsets, that will continue to exist no matter how perfectly we solve this ordering issue.  That is if the affected QuerySet changes between the time the formset is originally constructed during GET and when it is eventually POSTed back. 

Say, for example, I am going to add my rating for a movie in some hypothetical Django app that lets people rate movies.  I get the movie's rating page, which happens to have a model formset containing all existing ratings for that movie.  Say there are already 5 ratings.  However before I manage to fill in my data (maybe I'm interrupted by a question, go out to lunch, whatever) and hit "Save", some number of (lets say 4) other people have added their ratings. 

Now when I come back and hit "Save" my POST data may not match up properly with the instance data from the current QuerySet. I post data for 6 entries (the 5 that existed plus mine), the QuerySet computed during POST and passed in to construct the formset will have 9.  Things may turn out OK (though I haven't actually tried this) -- the data I post may be matched up properly with instances, if the ordering is set such that the 'new' ratings that have no matching POST data come last in the QuerySet.  If not, though, my POSTed data may get matched to entirely different instances.

I believe the situation is worse if items are deleted in the interim.  Maybe my app is tracking unresolved issues, and while I was off having lunch someone updated things such that when I get back there are only 2 unresolved issues plus the one I was adding, whereas when I got the page originally there were 5.  In this case (though again I haven't tried it) I think the model formset creation code will index off the end of the QuerySet and cause a server error.

Essentially the model formset code seems to assume the QuerySet using during get and post processing is going to be exactly the same, but I think applications will have a hard time guaranteeing that.  Unfortunately the failure modes when that assumption doesn't hold are likely to be pretty cryptic and hard to track down, at least at the moment. 

I don't see how to solve the general problem at the framework level, but there may be some additional sanity checks we could add to help out here.  For instance we could make the formset validation return False if the number of items in the passed QuerySet is not what we expect based on the data dictionary, and/or check (during validation?) that the pk values in the post data match up to the instance pk values after we've matched them up, etc.  To anyone with a better knowledge of model fomsets than me: does this sort of additional validation sound reasonable? 

Sorry for the tangent -- but this issue with indexing and ordering I think has only come up as many times as it has because of the model formset's use of this construct.  If people are running into this index/order problem I fear they will also start running into these other potential problems with how model formsets match up data to instances.  Also, if we look at addressing this other problem it may not be as necessary to worry about fixing whatever subset of cases isn't covered by the recent ordering fix -- that is, if the resulting error was clear enough, we could leave it up to the programmer to fix the situation (provide a unique ordering) rather than having to figure out inside model formset creation whether the ordering that exists is unique.

Karen

Joseph Kocherhans

unread,
Apr 24, 2009, 3:24:55 PM4/24/09
to django-d...@googlegroups.com
On Fri, Apr 24, 2009 at 2:01 PM, Karen Tracey <kmtr...@gmail.com> wrote:

I don't see how to solve the general problem at the framework level, but there may be some additional sanity checks we could add to help out here.  For instance we could make the formset validation return False if the number of items in the passed QuerySet is not what we expect based on the data dictionary, and/or check (during validation?) that the pk values in the post data match up to the instance pk values after we've matched them up, etc.  To anyone with a better knowledge of model fomsets than me: does this sort of additional validation sound reasonable? 

I might as well put this out there in case someone crazy like Alex wants to try it. :) My thought for the hard way to fix the issue for formsets is to use the primary key as part of the form prefix and/or put the primary keys into hidden fields then use the data from the form to look up the objects. It's obviously going to be more complicated than that (was the object already deleted? is it allowed to be changed?), but it's an approach that could work I think.

Joseph

Alex Gaynor

unread,
Apr 24, 2009, 4:41:04 PM4/24/09
to django-d...@googlegroups.com
We already have most of the pieces in place for this, we need to play with the architecture a bit though.  Any form in a ModelFormset already stores the pk for it's object: http://code.djangoproject.com/browser/django/trunk/django/forms/models.py#L489 .  However we don't use that when providing the instance variable: http://code.djangoproject.com/browser/django/trunk/django/forms/models.py#L398 .  The issue is that that pk field doens't exist on the form until it's been constructed, so we need to duplicate a tiny bit of the logic to pull the pk out of the data before the form has been constructed.  I can take a look at it this weekend(time permitting) unless someone else wants to do it.  There shouldn't be too much work that needs doing, but writing tests for this might be a bit messy.

Karen Tracey

unread,
Apr 24, 2009, 5:49:56 PM4/24/09
to django-d...@googlegroups.com
On Fri, Apr 24, 2009 at 4:41 PM, Alex Gaynor <alex....@gmail.com> wrote:
On Fri, Apr 24, 2009 at 3:24 PM, Joseph Kocherhans <jkoch...@gmail.com> wrote:
On Fri, Apr 24, 2009 at 2:01 PM, Karen Tracey <kmtr...@gmail.com> wrote:

I don't see how to solve the general problem at the framework level, but there may be some additional sanity checks we could add to help out here.  For instance we could make the formset validation return False if the number of items in the passed QuerySet is not what we expect based on the data dictionary, and/or check (during validation?) that the pk values in the post data match up to the instance pk values after we've matched them up, etc.  To anyone with a better knowledge of model fomsets than me: does this sort of additional validation sound reasonable? 

I might as well put this out there in case someone crazy like Alex wants to try it. :) My thought for the hard way to fix the issue for formsets is to use the primary key as part of the form prefix and/or put the primary keys into hidden fields then use the data from the form to look up the objects. It's obviously going to be more complicated than that (was the object already deleted? is it allowed to be changed?), but it's an approach that could work I think.
 
We already have most of the pieces in place for this, we need to play with the architecture a bit though.  Any form in a ModelFormset already stores the pk for it's object: http://code.djangoproject.com/browser/django/trunk/django/forms/models.py#L489 .  However we don't use that when providing the instance variable: http://code.djangoproject.com/browser/django/trunk/django/forms/models.py#L398 .  The issue is that that pk field doens't exist on the form until it's been constructed, so we need to duplicate a tiny bit of the logic to pull the pk out of the data before the form has been constructed.  I can take a look at it this weekend(time permitting) unless someone else wants to do it.  There shouldn't be too much work that needs doing, but writing tests for this might be a bit messy.

Note if we go with this approach (if I'm understanding it correctly) we are slightly changing the way in which we expect model formsets to be used.  We currently document that the same QuerySet should be passed in during POST as GET:

http://docs.djangoproject.com/en/dev/topics/forms/modelforms/#using-a-custom-queryset

Switching to an approach that uses the pk ids in the POST data for lookup means we'd:

- silently ignore a queryset argument when we're given a data dictionary?
- issue some sort of deprecation warning about the now-unnecessary and unused parameter?

Also, it would be nice if we could accomplish the lookup in one DB call rather than looking up each individual item in a separate query.  Having a model formset with hundreds of entries suddenly require hundreds of DB hits during POST processing instead of one isn't too nice.  Not sure if you were planning that?

These two concerns pushed me more towards thinking that leaving the DB lookup as-is and adding some validation sanity checks for when the passed queryset doesn't match up with the POST data might be a "good enough" approach.

Karen

Alex Gaynor

unread,
Apr 24, 2009, 5:52:38 PM4/24/09
to django-d...@googlegroups.com
On Fri, Apr 24, 2009 at 5:49 PM, Karen Tracey <kmtr...@gmail.com> wrote:
On Fri, Apr 24, 2009 at 4:41 PM, Alex Gaynor <alex....@gmail.com> wrote:
On Fri, Apr 24, 2009 at 3:24 PM, Joseph Kocherhans <jkoch...@gmail.com> wrote:
On Fri, Apr 24, 2009 at 2:01 PM, Karen Tracey <kmtr...@gmail.com> wrote:

I don't see how to solve the general problem at the framework level, but there may be some additional sanity checks we could add to help out here.  For instance we could make the formset validation return False if the number of items in the passed QuerySet is not what we expect based on the data dictionary, and/or check (during validation?) that the pk values in the post data match up to the instance pk values after we've matched them up, etc.  To anyone with a better knowledge of model fomsets than me: does this sort of additional validation sound reasonable? 

I might as well put this out there in case someone crazy like Alex wants to try it. :) My thought for the hard way to fix the issue for formsets is to use the primary key as part of the form prefix and/or put the primary keys into hidden fields then use the data from the form to look up the objects. It's obviously going to be more complicated than that (was the object already deleted? is it allowed to be changed?), but it's an approach that could work I think.
 
We already have most of the pieces in place for this, we need to play with the architecture a bit though.  Any form in a ModelFormset already stores the pk for it's object: http://code.djangoproject.com/browser/django/trunk/django/forms/models.py#L489 .  However we don't use that when providing the instance variable: http://code.djangoproject.com/browser/django/trunk/django/forms/models.py#L398 .  The issue is that that pk field doens't exist on the form until it's been constructed, so we need to duplicate a tiny bit of the logic to pull the pk out of the data before the form has been constructed.  I can take a look at it this weekend(time permitting) unless someone else wants to do it.  There shouldn't be too much work that needs doing, but writing tests for this might be a bit messy.

Note if we go with this approach (if I'm understanding it correctly) we are slightly changing the way in which we expect model formsets to be used.  We currently document that the same QuerySet should be passed in during POST as GET:

http://docs.djangoproject.com/en/dev/topics/forms/modelforms/#using-a-custom-queryset

Switching to an approach that uses the pk ids in the POST data for lookup means we'd:

- silently ignore a queryset argument when we're given a data dictionary?

My approach would be to use only items in the iniital queryset argument.
 

- issue some sort of deprecation warning about the now-unnecessary and unused parameter?

Also, it would be nice if we could accomplish the lookup in one DB call rather than looking up each individual item in a separate query.  Having a model formset with hundreds of entries suddenly require hundreds of DB hits during POST processing instead of one isn't too nice.  Not sure if you were planning that?

My appraoch was basically going to be

pk = something
kwargs['instance'] = [o for o in self.queryset if o.pk == pk][0]
 

Which would only be 1 query, and basically has the same behavior as we do now, but correctly :).



These two concerns pushed me more towards thinking that leaving the DB lookup as-is and adding some validation sanity checks for when the passed queryset doesn't match up with the POST data might be a "good enough" approach.

Karen


Joseph Kocherhans

unread,
Apr 24, 2009, 5:59:35 PM4/24/09
to django-d...@googlegroups.com
On Fri, Apr 24, 2009 at 4:49 PM, Karen Tracey <kmtr...@gmail.com> wrote:

Note if we go with this approach (if I'm understanding it correctly) we are slightly changing the way in which we expect model formsets to be used.  We currently document that the same QuerySet should be passed in during POST as GET:

http://docs.djangoproject.com/en/dev/topics/forms/modelforms/#using-a-custom-queryset

Switching to an approach that uses the pk ids in the POST data for lookup means we'd:

- silently ignore a queryset argument when we're given a data dictionary?
- issue some sort of deprecation warning about the now-unnecessary and unused parameter?

Also, it would be nice if we could accomplish the lookup in one DB call rather than looking up each individual item in a separate query.  Having a model formset with hundreds of entries suddenly require hundreds of DB hits during POST processing instead of one isn't too nice.  Not sure if you were planning that?

These two concerns pushed me more towards thinking that leaving the DB lookup as-is and adding some validation sanity checks for when the passed queryset doesn't match up with the POST data might be a "good enough" approach.

My thought on this was to continue to use the queryset argument and do something like qs.filter(pk__in=(list_of_pks)) for access restriction in case someone decided it would be fun to play with the POST data and try to change objects they shouldn't be allowed to.

Joseph 

Karen Tracey

unread,
Apr 24, 2009, 6:02:11 PM4/24/09
to django-d...@googlegroups.com
On Fri, Apr 24, 2009 at 5:52 PM, Alex Gaynor <alex....@gmail.com> wrote:
On Fri, Apr 24, 2009 at 5:49 PM, Karen Tracey <kmtr...@gmail.com> wrote:

Also, it would be nice if we could accomplish the lookup in one DB call rather than looking up each individual item in a separate query.  Having a model formset with hundreds of entries suddenly require hundreds of DB hits during POST processing instead of one isn't too nice.  Not sure if you were planning that?

My appraoch was basically going to be

pk = something
kwargs['instance'] = [o for o in self.queryset if o.pk == pk][0]
 

Which would only be 1 query, and basically has the same behavior as we do now, but correctly :).

Ah, OK I see now (I think). So you would still use the queryset arg, just make sure the results got matched up by pk correctly.  (And still index off the end if an item had been deleted in between GET and POST....)  The n repetitions of a linear search through the n-item queryset to find the right item doesn't thrill me, but I suppose it is better than n DB hits.

Karen

Alex Gaynor

unread,
Apr 24, 2009, 6:05:46 PM4/24/09
to django-d...@googlegroups.com
Yeah, that was a pretty lazy O(n^2) algorithm, the smart way to do it would be:

if not hasattr(self, '_queryset_dict'):
    self._queryset_dict = dict([(o.pk, o) for o in self.queryset])
pk = something
kwargs['instance'] = self._queryset_dict[pk]

Karen Tracey

unread,
Apr 24, 2009, 6:06:41 PM4/24/09
to django-d...@googlegroups.com

It'd be nice if that construct were guaranteed to return the objects is the order in list_of_pks...but I don't think that's guaranteed?

At any rate you're both continuing to use the queryset arg and doing the lookup in one query, so my concerns as to this approach were unfounded.

Karen

 

Karen Tracey

unread,
Apr 24, 2009, 6:27:16 PM4/24/09
to django-d...@googlegroups.com
On Fri, Apr 24, 2009 at 6:05 PM, Alex Gaynor <alex....@gmail.com> wrote:
Yeah, that was a pretty lazy O(n^2) algorithm, the smart way to do it would be:

if not hasattr(self, '_queryset_dict'):
    self._queryset_dict = dict([(o.pk, o) for o in self.queryset])
pk = something
kwargs['instance'] = self._queryset_dict[pk]


OK, I'm convinced you can make it more efficient for large n.  (Still don't want to deal with items that have been deleted though, huh?)  Anyway I opened a ticker for this -- #10922.  Not sure it needs to be solved for 1.1...we've got a bunch of other open tickets still to fix and I'm not sure how common running into this problem is....

Karen
Reply all
Reply to author
Forward
0 new messages