Implement QuerySet.__contains__?

207 views
Skip to first unread message

Johan Schiff

unread,
Jun 2, 2020, 4:57:31 AM6/2/20
to Django developers (Contributions to Django itself)
Is there a specific reason Djangos QuerySet does not implement __contains__? It doesn't seem very complicated to me, but maybe I'm missing something.

When checking if an object is in e queryset I always use code lite this:
if queryset.filter(pk=obj.pk).exists():

The pythonic way would be:
if obj in queryset:

The way I understand it this latter version will fetch all objects from queryset, while the first one is a much leaner db query.

So, is there a reason QuerySet doesn't have a __contains__ method, or would it be a good addition? My guess is that a lot of people use the "obj in queryset" syntax, believing it does "the right thing".

//Johan

Aymeric Augustin

unread,
Jun 2, 2020, 5:14:11 AM6/2/20
to django-d...@googlegroups.com
Hello Johan,

You explained the upside. There's one downside to be aware of. If you already fetched the queryset, `if obj in queryset` will make a new database query, which will be slower than walking through the already fetched data (unless the queryset is really large and the database really close in terms of network latency). Making this change may throw off some assertNumQueries tests. However, people who consider this a regression can do `obj in list(queryset)` or perhaps `obj in iter(queryset)` to reuse the cached list.

Judgment call: I think the upside is worth the downside and backwards-incompatibility, so I'm in favor of the change.

Best regards,

-- 
Aymeric.



--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/88c83b8d-658c-47cc-9162-fcfebebe9c4a%40googlegroups.com.

Adam Johnson

unread,
Jun 2, 2020, 5:28:34 AM6/2/20
to django-d...@googlegroups.com
If you already fetched the queryset, `if obj in queryset` will make a new database query

I don't see why we couldn't implement __contains__ to do a walk through _result_cache if it has been fetched?



--
Adam

Florian Apolloner

unread,
Jun 2, 2020, 7:30:03 AM6/2/20
to Django developers (Contributions to Django itself)
On Tuesday, June 2, 2020 at 11:28:34 AM UTC+2, Adam Johnson wrote:
If you already fetched the queryset, `if obj in queryset` will make a new database query

I don't see why we couldn't implement __contains__ to do a walk through _result_cache if it has been fetched?

we are doing the same for len() IIRC. I have to say it sometimes makes it a little bit hard to debug, but if it is documented as such it should be okay.

Aymeric Augustin

unread,
Jun 2, 2020, 7:43:25 AM6/2/20
to django-d...@googlegroups.com
These are good points.

As far as I can tell, the current implementation of __len__ is to fetch the whole queryset:

    def __len__(self):
        self._fetch_all()
        return len(self._result_cache)

I'm almost sure doing an automatic .count() was discussed in the past. I don't remember why we rejected it — perhaps because in most cases, when you need the length of a queryset, you will also iterate it?

The suggestion for __contains__ is similar: doing an automatic .filter(pk=obj.pk).exists(). We could do it only if the queryset isn't fetched yet, but I'm afraid this may be too magical. Is it reasonable to have `total = len(queryset); found = obj in queryset` make one SQL query while `found = obj in queryset; total = len(queryset)` makes two queries?

Now I'm worried that we may be making inconsistent decisions between __len__ and __contains__... Let's make sure we're consistent across dunder methods.

-- 
Aymeric


Tim Graham

unread,
Jun 2, 2020, 8:53:07 AM6/2/20
to Django developers (Contributions to Django itself)
It may help to know that QuerySet.__contains__() was implemented until Django 1.6 when chunked reads were removed from QuerySet iteration:

Roger Gammans

unread,
Jun 2, 2020, 9:20:56 AM6/2/20
to django-d...@googlegroups.com
Hi,

Here are some performance numbers against a local SQLite in case any one is interested. GL has about 40,000 records for reference.
324 was just a random number chosen, of different 'in's


>>> sys.version
'3.7.2rc1 (default, Dec 12 2018, 06:25:49) \n[GCC 8.2.0]'
>>> django.__version__
'3.0.6'
>>> import myapp.models as m
>>> x = m.GL.objects.all()[324]
>>> x in m.GL.objects.all()
True

>>> import timeit
>>> timeit.timeit('m.GL.objects.filter(pk=x.pk)', setup='import myapp.models as m;x = m.GL.objects.all()[324]', number=100)
0.05818330496549606
>>> timeit.timeit('x in qs', setup='import myapp.models as m;qs = m.GL.objects.all(); x=qs[324]', number=100)
1.5688817161135375
--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.

Tim Graham

unread,
Jun 2, 2020, 1:42:17 PM6/2/20
to Django developers (Contributions to Django itself)
And here's some past discussion:
https://code.djangoproject.com/ticket/24141 - contains() method for QuerySets (closed as needsinfo due to no mailing list discussion to find a consensus)
https://github.com/django/django/pull/3906 - Efficient QuerySet.__contains__ (closed as wontfix due to the behavior change)

Javier Buzzi

unread,
Jun 2, 2020, 2:31:47 PM6/2/20
to Django developers (Contributions to Django itself)
My 2 cents, I think @johan's suggestion makes sense.

if obj in queryset:

It's very pythonic. it should do what __len__ does and cache it, if you want the single quick db query you can always use exists().


ps @roger


>>> timeit.timeit('m.GL.objects.filter(pk=x.pk)', setup='import myapp.models as m;x = m.GL.objects.all()[324]', number=100)
0.05818330496549606
is not doing anything, add a `.exists()` or `len(..)` or something to evaluate the queryset. That number is WAY too low.

Aymeric Augustin

unread,
Jun 2, 2020, 2:58:53 PM6/2/20
to django-d...@googlegroups.com
On 2 Jun 2020, at 19:42, Tim Graham <timog...@gmail.com> wrote:

And here's some past discussion:
https://code.djangoproject.com/ticket/24141 - contains() method for QuerySets (closed as needsinfo due to no mailing list discussion to find a consensus)
https://github.com/django/django/pull/3906 - Efficient QuerySet.__contains__ (closed as wontfix due to the behavior change)

Thanks. Your memory is amazing :-)

Since the issue keeps coming back, I think we should do one of the following:

1. make simultaneous and consistent changes to __nonzero__, __len__, and __contains__ to eliminate the "fetch the whole table" behavior when calling `if qs`,  `len(qs)` and `if obj in qs` — and check if any other special methods require similar treatment
2. failing that, add .contains(), as suggested in https://code.djangoproject.com/ticket/24141

We're talking about a trade-off between preserving optimisations in existing code bases and expertise of advanced users versus doing the right thing by default for less experienced users. I think we're increasingly willing to consider changes that make Django safer to use, so perhaps what Carl said in 2015 could be open for discussion. If I forget what I know about querysets for a minute, the Principle of Least Astonishment says we should go for option 1.

if qs:  # test in qs isn't empty
if list(qs):  # fetch and cache qs, then test if it isn't empty
len(qs): # count the number of objects in qs
len(list(qs)):  # fetch and cache qs, then count the number of objects
if obj in qs:  # test if qs contains obj
if obj in list(qs):  # fetch and cache qs, then test if qs contains obj

objects = list(qs) is a well known pattern when you need to trigger the database query, perhaps because you'll do further manipulations in Python.

In https://code.djangoproject.com/ticket/18702#comment:8 Anssi says he was in favor of committing the optimized version (my option 1). Just before finalizing his patch, he changed his mind because:

- "This approach optimizes use cases that are rare": well, since they keep coming up, in fact they aren't so rare
- "Special casing some uses makes the code complex": sure, but the whole point of Django is to include _some_ complexity so that users have less to worry about.

I'm interested in what others think :-)

-- 
Aymeric.




Roger Gammans

unread,
Jun 2, 2020, 5:15:46 PM6/2/20
to django-d...@googlegroups.com
On Tue, 2020-06-02 at 11:31 -0700, Javier Buzzi wrote:


ps @roger

>>> timeit.timeit('m.GL.objects.filter(pk=x.pk)', setup='import myapp.models as m;x = m.GL.objects.all()[324]', number=100)
0.05818330496549606
is not doing anything, add a `.exists()` or `len(..)` or something to evaluate the queryset. That number is WAY too low.

Yes, of course. That makes me feel silly, on both counts - the evaluations and the number of iterations. I made number of iterations low;because I started with a test which a single iteration was taking multiple seconds and it that case the default number was way to high for the quick test I was offering. More critically the x in qs; depends on where in the qs the object is and it doesn't matter how big if you only look at the beginning.

I reran it; but which shows different results.. See the following re run. Construction here is more complicated to show avg microsecs; per iteration, note the vary iteration lengths, actual runtime was aimed to be between 20secs and a minute or so.

>>> # Searching for first in queryset
>>> n=10_000_000 ;timeit.timeit('x in qs', setup='import myapp.models as m;qs = m.GL.objects.all(); qs2=qs;i=len(qs2);x=qs[0]', number=n)/(n/1_000_000)
0.4136358265765011
>>> n = 100_000 ; timeit.timeit('m.GL.objects.filter(pk=x.pk)', setup='import myapp.models as m;x = list(m.GL.objects.all())[0]',  number=n)/(n/1_000_000)
124.99445254914463
>>> # Searching for last in queryset.
>>> n=1000; timeit.timeit('x in qs', setup='import myapp.models as m;qs = m.GL.objects.all(); qs2=qs;i=len(qs2);x=qs[i-1]', number=n)/(n/1_000_000)
50741.098549216986
>>> n = 100_000 ; timeit.timeit('m.GL.objects.filter(pk=x.pk)', setup='import myapp.models as m;x = list(m.GL.objects.all())[-1]',  number=n)/(n/1_000_000)
118.99649207945913



-- 
Roger Gammans <rgam...@gammascience.co.uk>

Johan Schiff

unread,
Jun 3, 2020, 4:52:44 AM6/3/20
to django-d...@googlegroups.com
Thanks for great info.

First, I'm leaning towards Aymeric's proposition here. I do recognize that there is a lot to consider.

This seems to be important:
  1. Developers must be able to explicitly choose methods to optimize for their environment. (Considering database latency, dataset size, prefetch preferred, etc.)
  2. Behaviour should preferably be consistent across methods. (len(qs), bool(qs), obj in qs)
  3. Syntax should be pythonic.
I think what Aymeric is proposing would fulfill those demands.

I have done some timeit-tests on a model with 100 000+ records, based on Rogers work, on a single host setup using postgresql. My finding is that a separate database query is generally faster than current obj in qs behaviour, even on a prefetched queryset (unless your dataset is really small). This tells me that we should prioritize .exists() query unless explicitly stated (i.e. obj in iter(qs)), even when we have prefetched data. For len(qs) and bool(qs) that should not be the case.

It would be interesting to get timings from other setups, so I'm including the code I used. (Also, am I doing this correctly?)

import timeit

def time_fn(title, stmt, number=10000, prefetch=False):
    setup = [
        'from tickets.models import Ticket',
        'qs=Ticket.objects.all()',
        'obj_first = qs.first()',
        'obj_mid = qs[50000]',
        'obj_last = qs.last()',
    ]
    if prefetch:
        setup.append('list(qs)')
    result_time = timeit.timeit(stmt, setup='\n'.join(setup), number=number)
    print(f'{title}: {result_time}')

time_fn('Database fetch', 'list(qs)')
time_fn('Prefetched obj_first in qs', 'obj_first in qs', prefetch=True)
time_fn('Prefetched obj_mid in qs', 'obj_mid in qs', prefetch=True)
time_fn('Prefetched obj_last in qs', 'obj_last in qs', prefetch=True)
time_fn('Database obj_first exists()', 'qs.filter(pk=obj_first.pk).exists()')
time_fn('Database obj_mid exists()', 'qs.filter(pk=obj_mid.pk).exists()')
time_fn('Database obj_last exists()', 'qs.filter(pk=obj_last.pk).exists()')

My results:
Database fetch: 25.667968138001015
Prefetched obj_first in qs: 0.027538340998944477
Prefetched obj_mid in qs: 1051.1691511649988
Prefetched obj_last in qs: 2369.5217889660016
Database obj_first exists(): 6.442390248001175
Database obj_mid exists(): 6.213641551999899
Database obj_last exists(): 5.831252460000542

--
You received this message because you are subscribed to a topic in the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/NZaMq9BALrs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-develop...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/3928039038bac9b52279294f7efcac318dc80388.camel%40gammascience.co.uk.


--
Vänligen, Johan Schiff
Radkompaniet AB
072-229 61 19

Johan Schiff

unread,
Jun 3, 2020, 3:46:49 PM6/3/20
to django-d...@googlegroups.com
To answer my own question: No, I wasn't doing it correctly. I should have done a sanity check before posting.

New timeit code and results at bottom, now using a smaller dataset (~900 objects).

Notable:
  • An .exists() query is about 1/100 the time of full fetch in this case. This difference would ofc be much bigger if I did it on the 100 000+ objects.
  • If already prefetched, it should be worth it to search prefetched data if the dataset is <700 objects. It seems rare to prefetch that many objects. If so, a developer could easily be explicit and use .exists() method.
Based on this exact scenario: 100 times slower for using obj in queryset is quite a performance hit. On the other hand, an extra 1 % query time for a queryset that is later evaluated is marginal.

If a developer knows they want to evaluate the queryset later, it would be nice to have a .evaluate() that does self._fetch_all() and returns self. I think that's preferable over list(queryset)-method, because we don't need to create an extra list.
# Old explicit fetch method
queryset = Stuff.objects.all()
if obj in list(queryset):
    pass

# My suggestion
queryset = Stuff.objects.all().evaluate()
if obj in queryset:
    pass


Updated timeit function and results:

import timeit

def time_fn(title, stmt, number=10_000):
    setup = [
        'from movies.models import Movie',
        'qs=Movie.objects.all()',
        'obj_first = qs.first()',
        'obj_mid = qs[300]',
        'obj_last = qs.last()',
        'list(qs)',
    ]

    result_time = timeit.timeit(stmt, setup='\n'.join(setup), number=number)
    print(f'{title}: {result_time}')

time_fn('Database fetch', 'qs._result_cache=None\nlist(qs)')
time_fn('Prefetched obj_first in qs', 'obj_first in qs')
time_fn('Prefetched obj_mid in qs', 'obj_mid in qs')
time_fn('Prefetched obj_last in qs', 'obj_last in qs')

time_fn('Database obj_first exists()', 'qs.filter(pk=obj_first.pk).exists()')
time_fn('Database obj_mid exists()', 'qs.filter(pk=obj_mid.pk).exists()')
time_fn('Database obj_last exists()', 'qs.filter(pk=obj_last.pk).exists()')

# Results
Database fetch: 616.097227364
Prefetched obj_first in qs: 0.030961711003328674
Prefetched obj_mid in qs: 6.6988333979970776
Prefetched obj_last in qs: 24.189914419999695
Database obj_first exists(): 6.468764332996216
Database obj_mid exists(): 6.167532913001196
Database obj_last exists(): 6.0190791100030765

Shai Berger

unread,
Jun 5, 2020, 2:04:26 AM6/5/20
to django-d...@googlegroups.com
On Tue, 2 Jun 2020 20:57:54 +0200
Aymeric Augustin <aymeric....@polytechnique.org> wrote:

>
> We're talking about a trade-off between preserving optimisations in
> existing code bases and expertise of advanced users versus doing the
> right thing by default for less experienced users.

I disagree.

The suggestion is to make

if obj in qs:
# do things which don't fetch qs

more performant, at the expense of

if obj in qs:
# do things that fetch qs

As well as

for obj in container:
if obj in qs:
# whatever

So, it is not a net optimization even in terms of its own.

Also, changing the performance characteristics of thousands, if not
millions of existing lines of code, and making a decade of online
comments and advice wrong (not just obsolete "there's better ways now"
but wrong "this points you away from truth and good"), is IMO a
non-starter.

Strong -1 on Aymeric's option 1. Strong +1 on adding ".contains()".

My 2 cents,
Shai.

Adam Johnson

unread,
Jun 5, 2020, 4:05:12 AM6/5/20
to django-d...@googlegroups.com
I'm with Shai, against changing bool() and len() behaviour. I think the "fetch the whole queryset" behaviour is normally helpful for beginners, who write things like:

if qs:
   for x in qs:
       ...

# really common for anyone new to Python:
for i in len(qs):
    do_something(qs[i])

We have highlighted this behaviour for a long time in the 'database access optimization' page really well: https://docs.djangoproject.com/en/3.0/topics/db/optimization/#use-queryset-count . There are alternatives if you want different behaviour - exists() and count() .

+1 to .contains() since it would follow that same pattern.

--
You received this message because you are subscribed to the Google Groups "Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.


--
Adam

אורי

unread,
Jun 5, 2020, 5:42:21 AM6/5/20
to Django developers (Contributions to Django itself)
Hi,

I'm thinking, maybe instead of:

if obj in queryset:

Users should write:

if obj in list(queryset):

Which I understand is already working? Why does the first line (without list()) should work anyway?

And if users want performance, why not write:

if queryset.filter(pk=obj.pk).exists():


--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.

Johan Schiff

unread,
Jun 5, 2020, 7:59:17 AM6/5/20
to django-d...@googlegroups.com
Shai does make a good point about changing a well documented behaviour. That argument wins me over.

Adding .contains() and updating the documentation goes a long way to make it easier for developers.

Best case would be that Dajngo does not make assumptions about what the developer wants, but to implement __len__(), __bool__() and __contains__(), we have to assume one method or the other. I still think the wrong call was made here, but changing it now is too much pain. (Assuming queryset should be evaluated is probably correct in most cases, but sometimes adds a big performance hit when the code goes into production and the dataset grows - code that looks reasonable and works like a charm in dev will cause trouble in production. Assuming the opposite would have added a tiny performance hit in most cases, but avoided the big one.)

Since I'm new here:
If people agree QuerySet.contains() should be added, how do we move forward?
I'd be willing to write some code with tests and add a ticket, if that's helpful.

‪Le ven. 5 juin 2020 à 11:42, ‫אורי‬‎ <u...@speedy.net> a écrit :‬
You received this message because you are subscribed to a topic in the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/NZaMq9BALrs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-develop...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CABD5YeE7_ORUfJcnt6f4aB4J0j-%3DyDK_BowEt_fefcaFMGdB1g%40mail.gmail.com.

Aymeric Augustin

unread,
Jun 5, 2020, 9:06:48 AM6/5/20
to django-d...@googlegroups.com
If people agree QuerySet.contains() should be added, how do we move forward?

Yes, if there's no major new argument in a couple days, we can assume consensus on adding .contains().

* Create a ticket in Trac. Point to this discussion and to the previous ticket where .contains() was suggested.
* Submit a pull request in GitHub with code, tests, docs. This is worth mentioning in the changelog.

There's a lot of details in the docs, in the Contributing Guide, about how to do allD this right.

-- 
Aymeric.



Tim Graham

unread,
Jun 5, 2020, 1:04:51 PM6/5/20
to Django developers (Contributions to Django itself)
(You can reopen https://code.djangoproject.com/ticket/24141 rather than creating a new ticket.)
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsub...@googlegroups.com.

-- 
You received this message because you are subscribed to a topic in the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/NZaMq9BALrs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-developers+unsub...@googlegroups.com.


-- 
Vänligen, Johan Schiff
Radkompaniet AB
072-229 61 19

-- 
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsub...@googlegroups.com.

Shai Berger

unread,
Jun 5, 2020, 3:43:20 PM6/5/20
to django-d...@googlegroups.com
On Fri, 5 Jun 2020 13:58:47 +0200
Johan Schiff <jo...@radkompaniet.se> wrote:

> I still think the wrong call was made here [...]
> *(Assuming queryset should be evaluated is probably correct in
> most cases, but sometimes adds a big performance hit when the code
> goes into production and the dataset grows - code that looks
> reasonable and works like a charm in dev will cause trouble in
> production. Assuming the opposite would have added a tiny performance
> hit in most cases, but avoided the big one.)*
>

FWIW, when that call was made, the default behavior was to fetch only
100 records at a time, so the performance hit for the cases you
describe was quite limited. This was changed later, as Tim mentioned in
this thread:
Reply all
Reply to author
Forward
0 new messages