{{{
class Movie(EditableBase, TimestampBase):
year = models.PositiveSmallIntegerField(u"ano", null=True)
country = models.CharField(u"país", max_length=100, blank=True,
null=True)
release_date = models.DateField(u"data de estréia", blank=True,
null=True)
length = models.PositiveSmallIntegerField(u"duração", blank=True,
null=True)
rating = models.PositiveIntegerField(u"avaliação", blank=True,
null=True, choices=RATING_CHOICES)
suitable_for = models.PositiveSmallIntegerField(u"censura",
blank=True, null=True, choices=AGE_CHOICES)
website_original = models.URLField(u"website", max_length=255,
blank=True, null=True)
website_national = models.URLField(u"website (nacional)",
max_length=255, blank=True, null=True)
synopsis = models.TextField(u"sinopse", blank=True)
cover = models.ImageField(u"poster",
upload_to=u'uploads/fichatecnica/', blank=True, null=True)
genres = models.ManyToManyField(MovieGenre,
verbose_name=u"gêneros")
class MovieTheaterSession(models.Model):
start_date = models.DateField(u"início")
end_date = models.DateField(u"fim")
movie = models.ForeignKey(Movie, verbose_name=u"filme",
related_name='sessions')
}}}
If we annotate the query to count the number of sessions:
{{{
>>>
Movie.objects.all().annotate(sessions_count=Count('sessions')).order_by('release_date')[:60]
}}}
Django generates pretty inneficient SQL by GROUPing BY on all fields from
the parent:
{{{
SELECT "movies_movie"."id", "movies_movie"."creation_date",
"movies_movie"."modification_date", "movies_movie"."year",
"movies_movie"."country", "movies_movie"."release_date",
"movies_movie"."length", "movies_movie"."rating",
"movies_movie"."suitable_for", "movies_movie"."website_original",
"movies_movie"."website_national", "movies_movie"."synopsis",
"movies_movie"."cover", COUNT("theaters_movietheatersession"."id") AS
"sessions_count" FROM "movies_movie" LEFT OUTER JOIN
"theaters_movietheatersession" ON ("movies_movie"."id" =
"theaters_movietheatersession"."movie_id") GROUP BY "movies_movie"."id",
"movies_movie"."creation_date", "movies_movie"."modification_date",
"movies_movie"."year", "movies_movie"."country",
"movies_movie"."release_date", "movies_movie"."length",
"movies_movie"."rating", "movies_movie"."suitable_for",
"movies_movie"."website_original", "movies_movie"."website_national",
"movies_movie"."synopsis", "movies_movie"."cover" ORDER BY
"movies_movie"."release_date" ASC LIMIT 60
Time: 892,122 ms
}}}
EXPLAIN shows the database is sorting all fields from the GROUP BY clause:
{{{
Limit (cost=2250.37..2250.52 rows=60 width=502)
-> Sort (cost=2250.37..2256.15 rows=2311 width=502)
Sort Key: (count(theaters_movietheatersession.id)),
movies_movie.release_date
-> GroupAggregate (cost=1975.47..2170.56 rows=2311
width=502)
-> Sort (cost=1975.47..1986.94 rows=4586 width=502)
Sort Key: movies_movie.id,
movies_movie.creation_date, movies_movie.modification_date,
movies_movie.year, movies_movie.country, movies_movie.release_date,
movies_movie.length, movies_movie.rating, movies_movie.suitable_for,
movies_movie.website_original, movies_movie.website_national,
movies_movie.synopsis, movies_movie.cover
-> Merge Left Join (cost=0.00..660.58 rows=4586
width=502)
Merge Cond: (movies_movie.id =
theaters_movietheatersession.movie_id)
-> Index Scan using movies_movie_pkey on
movies_movie (cost=0.00..283.14 rows=2311 width=498)
-> Index Scan using
theaters_movietheatersession_movie_id on theaters_movietheatersession
(cost=0.00..314.34 rows=4586 width=8)
}}}
It suffices to GROUP BY by the PK field:
{{{
SELECT "movies_movie"."id", "movies_movie"."creation_date",
"movies_movie"."modification_date", "movies_movie"."year",
"movies_movie"."country", "movies_movie"."release_date",
"movies_movie"."length", "movies_movie"."rating",
"movies_movie"."suitable_for", "movies_movie"."website_original",
"movies_movie"."website_national", "movies_movie"."synopsis",
"movies_movie"."cover", COUNT("theaters_movietheatersession"."id") AS
"sessions_count" FROM "movies_movie" LEFT OUTER JOIN
"theaters_movietheatersession" ON ("movies_movie"."id" =
"theaters_movietheatersession"."movie_id") GROUP BY "movies_movie"."id"
ORDER BY "movies_movie"."release_date" ASC LIMIT 60
Time: 16,285 ms
}}}
EXPLAIN shows the database doesn't need to sort anymore:
{{{
Limit (cost=786.42..786.57 rows=60 width=502)
-> Sort (cost=786.42..792.20 rows=2311 width=502)
Sort Key: movies_movie.release_date
-> GroupAggregate (cost=0.00..706.62 rows=2311 width=502)
-> Merge Left Join (cost=0.00..660.58 rows=4586
width=502)
Merge Cond: (movies_movie.id =
theaters_movietheatersession.movie_id)
-> Index Scan using movies_movie_pkey on
movies_movie (cost=0.00..283.14 rows=2311 width=498)
-> Index Scan using
theaters_movietheatersession_movie_id on theaters_movietheatersession
(cost=0.00..314.34 rows=4586 width=8)
}}}
I've tried fixing it by setting the undocumented `group_by` attribute, but
Django doesn't seem to pick it up:
{{{
class MovieManager(models.Manager):
def with_sessions(self):
qs = self.get_query_set().annotate(
sessions_count=Count('sessions'))
qs.group_by = ['id']
return qs
}}}
I'm using Django 1.4 and PostgreSQL 9.1.
--
Ticket URL: <https://code.djangoproject.com/ticket/19259>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* needs_better_patch: => 0
* needs_tests: => 0
* needs_docs: => 0
Comment:
We already have logic to do this in MySQL, it's just a matter of making
the postgresql backend set the right flag (note that this is new in a
recent version of postgresql, 9.1?) so we need to make sure it's only set
there.
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:1>
Comment (by hcarvalhoalves):
Would be the case of adding `allows_group_by_pk = True` to
`DatabaseFeatures` at
https://github.com/django/django/blob/master/django/db/backends/postgresql_psycopg2/base.py#L111
?
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:2>
Comment (by Alex):
That's the right flag, not sure that's the right place off the top of my
head.
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:3>
* has_patch: 0 => 1
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:4>
Comment (by hcarvalhoalves):
Thanks. I've tried your patch but it doesn't solve the issue, Django still
adds all fields to `GROUP BY` clause (the SQL is identical to the original
one).
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:5>
Comment (by akaariai):
In PostgreSQL you must include all the primary keys in the query, so if
you have joins that means you will need to add those PKs too. This isn't
true for MySQL, just one pk is enough.
This has a pre-existing ticket, #18016. I am not sure which ticket to
close as duplicate at this point.
Also, the MySQL code isn't currently working, see #17144. I should really
commit the patch in that ticket...
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:6>
Comment (by Harm Geerts <hgeerts@…>):
I've been discussing this on irc with hcarvalhoalves and I hit my head on
#17144 as well.
In addition the patch attached to this ticket does not work because the
sql can be constructed before the connection to the database is
established.
This can be worked around by forcing a connection when a new
DatabaseWrapper instance is created but that's a bit dirty and I don't
think a patch like that would be accepted.
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:7>
Comment (by hcarvalhoalves):
Replying to [comment:6 akaariai]:
> In PostgreSQL you must include all the primary keys in the query, so if
you have joins that means you will need to add those PKs too. This isn't
true for MySQL, just one pk is enough.
>
> This has a pre-existing ticket, #18016. I am not sure which ticket to
close as duplicate at this point.
>
> Also, the MySQL code isn't currently working, see #17144. I should
really commit the patch in that ticket...
You might want to close #18016 as dup to keep the discussion here since
there's an example use case and the output of the query planner so we can
compare.
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:8>
Comment (by akaariai):
#18016 was closed as duplicate.
I will commit the #17144 patch soon unless something unexpected is found.
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:9>
* cc: hcarvalhoalves@… (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:10>
Comment (by akaariai):
For the version dependant check you could probably use a cached_property
(django.utils.functional IIRC). The idea is that when accessed the getter
will make sure version information is available. Once the version
information has been checked, the value can be cached and there is no
longer need to do version information checks. This could mean opening a
connection in the getter.
Another approach is to generate inefficient SQL if the version information
isn't yet available.
I am leaning towards just generating inefficient SQL if the query happens
to be ran first thing in a new ConnectionWrapper (not just as first thing
in a connection - first thing for the object). The reason is that
otherwise we have to close the connection used for version checking and
that is somewhat expensive. I am not feeling strongly at all about this...
The harder problem is that you can't group by the main PK alone. You will
need the primary keys of joined tables too, and that information isn't
directly available in group by clause generation. (Actually, it might be
in select and related_select_cols as "second" part of the tuple, not sure
of this). In any case the MySQL code will not work directly.
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:11>
Comment (by hcarvalhoalves):
Replying to [comment:11 akaariai]:
> For the version dependant check you could probably use a cached_property
(django.utils.functional IIRC). The idea is that when accessed the getter
will make sure version information is available. Once the version
information has been checked, the value can be cached and there is no
longer need to do version information checks. This could mean opening a
connection in the getter.
>
> Another approach is to generate inefficient SQL if the version
information isn't yet available.
>
> I am leaning towards just generating inefficient SQL if the query
happens to be ran first thing in a new ConnectionWrapper (not just as
first thing in a connection - first thing for the object). The reason is
that otherwise we have to close the connection used for version checking
and that is somewhat expensive. I am not feeling strongly at all about
this...
Since it's a chicken and egg problem, is it really necessary to
"autodetect" the feature then? I couldn't find any other example where
Django enables/disables features based on the database version. What about
just introducing a 'django.db.backends.postgresql9_psycopg2' backend? It's
a trivial patch.
> The harder problem is that you can't group by the main PK alone. You
will need the primary keys of joined tables too, and that information
isn't directly available in group by clause generation. (Actually, it
might be in select and related_select_cols as "second" part of the tuple,
not sure of this). In any case the MySQL code will not work directly.
That's a bigger problem, I'm afraid I'm too crud on the ORM code to figure
it out.
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:12>
* type: Uncategorized => Cleanup/optimization
* stage: Unreviewed => Design decision needed
Comment:
Adding a new backend might be trivial in code, but not in maintenance,
docs, user support, etc.
Marking as DDN because the problem is valid but there isn't a consensus on
how to fix it.
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:13>
* stage: Design decision needed => Accepted
Comment:
Moving back to accepted: we should fix it, at least once we figure out how
:)
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:14>
Comment (by akaariai):
I think the GROUP BY clause could be constructed using the select and
related_select_cols' field information. If a select column's field is
available, then include the field only if field.primary_key is True. If
the field isn't available, include it always in the GROUP BY.
For the version dependency there are other example of disabling/enabling
functionality based on version, at least regex searches on Oracle are
such. In any case, a separate backend for this is a no-go.
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:15>
* cc: dev@… (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:16>
Comment (by shai):
Replying to [comment:15 akaariai]:
> I think the GROUP BY clause could be constructed using the select and
related_select_cols' field information. If a select column's field is
available, then include the field only if field.primary_key is True. If
the field isn't available, include it always in the GROUP BY.
>
I think #20971 is related. In particular, when building the GROUP BY
clause, also exclude the field if it is deferred.
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:17>
* needs_better_patch: 0 => 1
* needs_tests: 0 => 1
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:18>
* cc: charettes (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:19>
* version: 1.4 => master
Comment:
I started to work on a [https://github.com/django/django/pull/4397 PR].
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:20>
* owner: nobody => charettes
* status: new => assigned
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:21>
* needs_better_patch: 1 => 0
* needs_tests: 1 => 0
Comment:
Patch passes the full test suite including the adjusted existing tests for
`allows_group_by_pk`.
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:22>
* stage: Accepted => Ready for checkin
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:23>
* status: assigned => closed
* resolution: => fixed
Comment:
In [changeset:"dc27f3ee0c3eb9bb17d6cb764788eeaf73a371d7" dc27f3e]:
{{{
#!CommitTicketReference repository=""
revision="dc27f3ee0c3eb9bb17d6cb764788eeaf73a371d7"
Fixed #19259 -- Added group by selected primary keys support.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19259#comment:24>