Issue happens when prefetching a model relation from another one, linked
by a M2M field. Sample case:
{{{#!python
class Picture(Model):
# some fields
class Collection(Model):
pictures = ManyToManyField(Picture, blank=True,
related_name='collections')
# some fields
class CollectionPermission(Model):
content_object = ForeignKey(Collection,
related_name='permissions')
# fk to users and fk to permissions
}}}
When prefetching collection permissions from a picture queryset, like
this:
{{{#!python
Picture.objects.prefetch_related('collections__permissions')
}}}
The third query (the second level of prefetching) looks like this:
{{{#!sql
SELECT "gallery_collectionpermission"."id",
"gallery_collectionpermission"."user_id",
"gallery_collectionpermission"."permission_id",
"gallery_collectionpermission"."content_object_id"
FROM "gallery_collectionpermission"
WHERE "gallery_collectionpermission"."content_object_id" IN (
11, 11, 11, 11,
11, 11, 11, 11,
11, 11, 10, 10,
10, 10, 10, 10,
10, 10, 10, 10,
10, 10, 10, 10,
10, 10, 10, 10,
10, 10, 10, 10,
10, 10, 9, 9,
9, 9, 9 )
}}}
The issue arises in two steps:
* As prefetch has no identity mapping, the first prefetching step will
build a different Collection instance for every picture that uses it.
* Then the second prefetching step will gather all instances, and dump
their ids to the database without looking for duplicates.
Ideally, this could be fixed at the first step: when a queryset prefetches
the same row several times, it would make sense to return a single
instance. In this example, all Pictures' cache should point to the same
Gallery instances.
The second step could also be fixed: the related manager could eliminate
duplicates when building the query. For some reason the
ReverseSingleRelatedObjectDescriptor does it, but none of the other
descriptors or managers do.
The cost seems huge. Typically, I have collections of hundreds of
pictures, but each picture will only appear in 4 or 5 collections. The
query thus sends the db server tens of thousands of ids instead of a few
dozens.
Is there something I missed? Or is this use case out of the normal scope
of `prefetch_related`?
--
Ticket URL: <https://code.djangoproject.com/ticket/25544>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* needs_better_patch: => 0
* stage: Unreviewed => Accepted
* type: Uncategorized => Cleanup/optimization
* needs_tests: => 0
* needs_docs: => 0
Comment:
I'm not too familiar with this, but accepting on the basis for further
investigation.
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:1>
* owner: nobody => Ian-Foote
* status: new => assigned
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:2>
* stage: Accepted => Ready for checkin
Comment:
Patch is ready, IMO. Waiting for actual merge until the CI catches up...
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:3>
* needs_better_patch: 0 => 1
* has_patch: 0 => 1
* stage: Ready for checkin => Accepted
Comment:
`postgres_tests.test_array.TestQuerying.test_in` fails.
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:4>
Comment (by carljm):
@IanFoote Oops, my fault. I didn't consider the possibility that non-
hashable values (e.g. lists for an ArrayField) could be passed to `__in`
lookups. I don't really see a way to preserve this fix in the face of that
possibility (going through the value and trying to convert everything to
something hashable doesn't seem like a good plan), so unless you have any
ideas I'm not seeing, I think we may need to revert to your previous fix
at the `prefetch_related` layer.
Sorry for the bad direction!
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:5>
Comment (by charettes):
@carljm, what do you think of simply special casing `ArrayField`'s `__in`
lookup which is something we already do for `__exact` anyway.
Something like the following comes to mind.
{{{#!python
@ArrayField.register_lookup
class ArrayInLookup(InLookup):
def get_prep_lookup(self):
values = super(ArrayInLookup, self).get_prep_lookup()
return [tuple(value) for value in values]
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:6>
Comment (by carljm):
@charettes I think that's a good idea! As long as we're confident that a)
there are no other valid cases outside ArrayField where an unhashable
value might be passed in an `__in` lookup rhs, and b) every value passed
to an ArrayField lookup should always be iterable. But I can't come up
with any counter-examples to either of those.
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:7>
Comment (by akaariai):
A possibility is to try-except the deduping of the values. If it doesn't
succeed, just use the original values.
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:8>
Comment (by spectras):
I do not know if it is related, but ̀ArrayField.get_db_prep_value` always
converts tuples to lists.
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:9>
Comment (by charettes):
Might be worth pinging Mark about this.
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:10>
* needs_better_patch: 1 => 0
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:11>
* stage: Accepted => Ready for checkin
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:12>
* needs_better_patch: 0 => 1
* stage: Ready for checkin => Accepted
Comment:
Latest comments on the PR indicates a few more tests should be added.
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:13>
* needs_better_patch: 1 => 0
* version: 1.8 => master
Comment:
The comments have been addressed.
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:14>
* status: assigned => closed
* resolution: => fixed
Comment:
In [changeset:"86eccdc8b67728d84440a46e5bf62c78f2eddf6d" 86eccdc8]:
{{{
#!CommitTicketReference repository=""
revision="86eccdc8b67728d84440a46e5bf62c78f2eddf6d"
Fixed #25544 -- Removed duplicate ids in prefetch_related() queries.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:15>
Comment (by Nick Pope):
The fix to this resulted in a regression where parameters passed to `__in`
were reordered due to the use of `set()`. See #29503.
--
Ticket URL: <https://code.djangoproject.com/ticket/25544#comment:16>