[Django] #30130: Django .values().distinct() returns a lot more records than .values().distinct().count()

7 views
Skip to first unread message

Django

unread,
Jan 24, 2019, 3:55:00 PM1/24/19
to django-...@googlegroups.com
#30130: Django .values().distinct() returns a lot more records than
.values().distinct().count()
-------------------------------------+-------------------------------------
Reporter: James Lin | Owner: nobody
Type: Bug | Status: new
Component: Database | Version: 2.1
layer (models, ORM) |
Severity: Normal | Keywords:
Triage Stage: | Has patch: 0
Unreviewed |
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
-------------------------------------+-------------------------------------
I have a table virtualmachineresources, which has 100k+ rows, it has
columns 'machine` and `cluster`, some rows the cluster field is empty. it
has repeating rows of machine + with/without cluster, hence I want to use
the distinct() method.

using .values().distinct().count(), it return 2k rows


{{{
In [6]: VirtualMachineResources.objects.all().values('machine', 'cluster')
...: .distinct().count()
Out[6]: 2247
}}}


When I loop through the distinct query

{{{
for resource in VirtualMachineResources.objects.all().values('machine',
'cluster').distinct():
print(resource['machine'], resource['cluster'])
}}}


I observed it return 100k rows, with repeating rows that the same
'machine` with/without the cluster.

Here is the corresponding stackoverflow question
https://stackoverflow.com/questions/54354462/django-distinct-returns-more-
records-than-count

--
Ticket URL: <https://code.djangoproject.com/ticket/30130>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
Jan 24, 2019, 3:56:45 PM1/24/19
to django-...@googlegroups.com
#30130: Django .values().distinct() returns a lot more records than
.values().distinct().count()
-------------------------------------+-------------------------------------
Reporter: James Lin | Owner: nobody
Type: Bug | Status: new
Component: Database layer | Version: 2.1
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Description changed by James Lin:

Old description:

> I have a table virtualmachineresources, which has 100k+ rows, it has
> columns 'machine` and `cluster`, some rows the cluster field is empty. it
> has repeating rows of machine + with/without cluster, hence I want to use
> the distinct() method.
>
> using .values().distinct().count(), it return 2k rows
>

> {{{
> In [6]: VirtualMachineResources.objects.all().values('machine',
> 'cluster')
> ...: .distinct().count()
> Out[6]: 2247
> }}}
>

> When I loop through the distinct query
>
> {{{
> for resource in VirtualMachineResources.objects.all().values('machine',
> 'cluster').distinct():
> print(resource['machine'], resource['cluster'])
> }}}
>

> I observed it return 100k rows, with repeating rows that the same
> 'machine` with/without the cluster.
>
> Here is the corresponding stackoverflow question
> https://stackoverflow.com/questions/54354462/django-distinct-returns-
> more-records-than-count

New description:

I have a table virtualmachineresources, which has 100k+ rows, it has
columns `machine` and `cluster`, some rows the cluster field is empty. it
has repeating rows of machine + with/without cluster, hence I want to use
the distinct() method.

using .values().distinct().count(), it return 2k rows


{{{
In [6]: VirtualMachineResources.objects.all().values('machine', 'cluster')
...: .distinct().count()
Out[6]: 2247
}}}


When I loop through the distinct query

{{{
for resource in VirtualMachineResources.objects.all().values('machine',
'cluster').distinct():
print(resource['machine'], resource['cluster'])
}}}


I observed it return 100k rows, with repeating rows that the same
'machine` with/without the cluster.

Here is the corresponding stackoverflow question
https://stackoverflow.com/questions/54354462/django-distinct-returns-more-
records-than-count

--

--
Ticket URL: <https://code.djangoproject.com/ticket/30130#comment:1>

Django

unread,
Jan 24, 2019, 3:57:18 PM1/24/19
to django-...@googlegroups.com
#30130: Django .values().distinct() returns a lot more records than
.values().distinct().count()
-------------------------------------+-------------------------------------
Reporter: James Lin | Owner: nobody
Type: Bug | Status: new
Component: Database layer | Version: 2.1
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Description changed by James Lin:

Old description:

> I have a table virtualmachineresources, which has 100k+ rows, it has


> columns `machine` and `cluster`, some rows the cluster field is empty. it
> has repeating rows of machine + with/without cluster, hence I want to use
> the distinct() method.
>
> using .values().distinct().count(), it return 2k rows
>

> {{{
> In [6]: VirtualMachineResources.objects.all().values('machine',
> 'cluster')
> ...: .distinct().count()
> Out[6]: 2247
> }}}
>

> When I loop through the distinct query
>
> {{{
> for resource in VirtualMachineResources.objects.all().values('machine',
> 'cluster').distinct():
> print(resource['machine'], resource['cluster'])
> }}}
>

> I observed it return 100k rows, with repeating rows that the same
> 'machine` with/without the cluster.
>
> Here is the corresponding stackoverflow question
> https://stackoverflow.com/questions/54354462/django-distinct-returns-
> more-records-than-count

New description:

I have a table virtualmachineresources, which has 100k+ rows, it has
columns `machine` and `cluster`, some rows the cluster field is empty. it
has repeating rows of machine + with/without cluster, hence I want to use
the distinct() method.

using .values().distinct().count(), it returned 2k rows


{{{
In [6]: VirtualMachineResources.objects.all().values('machine', 'cluster')
...: .distinct().count()
Out[6]: 2247
}}}


When I loop through the distinct query

{{{
for resource in VirtualMachineResources.objects.all().values('machine',
'cluster').distinct():
print(resource['machine'], resource['cluster'])
}}}


I observed it returned 100k rows, with repeating rows that the same
'machine` with/without the cluster.

Here is the corresponding stackoverflow question
https://stackoverflow.com/questions/54354462/django-distinct-returns-more-
records-than-count

--

--
Ticket URL: <https://code.djangoproject.com/ticket/30130#comment:2>

Django

unread,
Jan 24, 2019, 5:01:19 PM1/24/19
to django-...@googlegroups.com
#30130: Django .values().distinct() returns a lot more records than
.values().distinct().count()
-------------------------------------+-------------------------------------
Reporter: James Lin | Owner: nobody
Type: Bug | Status: new
Component: Database layer | Version: 2.1
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Simon Charette):

Do you happen to have a `VirtualMachineResources.Meta.ordering` defined?

--
Ticket URL: <https://code.djangoproject.com/ticket/30130#comment:3>

Django

unread,
Jan 24, 2019, 7:40:45 PM1/24/19
to django-...@googlegroups.com
#30130: Django .values().distinct() returns a lot more records than
.values().distinct().count()
-------------------------------------+-------------------------------------
Reporter: James Lin | Owner: nobody
Type: Bug | Status: new
Component: Database layer | Version: 2.1
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Simon Charette):

I'm pretty sure that's what's at play here as I managed to reproduce
locally. You have a defined `Meta.ordering` (or are performing an
`order_by`) which makes your non-count query something along the lines of

{{{#!sql
SELECT DISTINCT "machine", "cluster", "ordering" FROM "resources" ORDER BY
"ordering" ASC
}}}

The `ordering` column is added to the `SELECT` clause to prevent the
generation of invalid SQL.

{{{#!sql
SELECT DISTINCT "machine", "cluster" FROM "resources" ORDER BY "ordering"
ASC
> ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select
list
}}}

The `.count()` query happens to be the following

{{{#!sql
SELECT COUNT(*) FROM (SELECT DISTINCT "ticket_30130_foo"."bat" AS Col1,
"ticket_30130_foo"."baz" AS Col2 FROM "ticket_30130_foo") subquery
}}}

It looks like the issue is that
[https://github.com/django/django/blob/7e978fdc4228eb44cf97cb4243adc7b7bfd586c7/django/db/models/sql/query.py#L439-L443
we clear the ordering when we shouldn't] on the `.count()`. I think the
logic needs to be adjusted to branch of `not self.distinct` instead of
`self.distinct_fields`.

--
Ticket URL: <https://code.djangoproject.com/ticket/30130#comment:4>

Django

unread,
Jan 24, 2019, 8:01:34 PM1/24/19
to django-...@googlegroups.com
#30130: Django .values().distinct() returns a lot more records than
.values().distinct().count()
-------------------------------------+-------------------------------------
Reporter: James Lin | Owner: nobody
Type: Bug | Status: closed

Component: Database layer | Version: 2.1
(models, ORM) |
Severity: Normal | Resolution: duplicate
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Simon Charette):

* status: new => closed
* resolution: => duplicate


Comment:

There doesn't seem to be a consensus on how to deal with these type of
queries right now so I'm just going to close as a duplicate of #14357,
#16025, #28560.

--
Ticket URL: <https://code.djangoproject.com/ticket/30130#comment:5>

Reply all
Reply to author
Forward
0 new messages