Should ModelChoiceIterator cache its QuerySet?

159 views
Skip to first unread message

François Freitag

unread,
Apr 2, 2018, 1:36:20 PM4/2/18
to django-d...@googlegroups.com
Hi,

I'm having mixed feelings toward ModelChoiceIterator's implementation.
That custom iterator intends to fetch choices defined by the queryset
argument of a ChoiceField.

Issue
=====

Currently, the expectation seems to be that the latest data is queried
from the database each time the choices attribute of a ChoiceField
instance is evaluated. That's best described by
ModelFormBasicTests.test_runtime_choicefield_populated [1]. Besides,
#28312 reports reusing [...] the cached QuerySet in ModelChoiceIterator
as a bug [2].

My expectation is that once evaluated, the QuerySet stays cached for the
ModelChoiceIterator instance's lifetime (i.e. in
test_runtime_choicefield_populated the w_bernstein entry would not be
visible in the second f.as_ul() output). Unaware of #28312, I created
#29159 to avoid making a duplicate query when calling list() on a
ModelChoiceIterator's choices. The fix for #29159 [3] makes a step
toward caching the QuerySet results on the ModelChoiceIterator instance,
which goes against the existing expectation.

Questions
=========

Before to break the existing assumption further, I would like to gather
opinions on the following two options.

1. Should a new patch be submitted for #29159 [4] to use `.count()` for
`__len__`, thus reverting back to the original behaviour?

2. Should ModelChoiceIterator be updated to cache its QuerySet? (reverts
results of #23623 [5])
2a. Removing test_runtime_choicefield_populated
2b. Marking #28312 as fixed (because the len() and list() content would
be consistent)

My opinion (option 2)
=====================

AFAICT, reusing the ModelChoiceIterator queryset results, instead of
issuing a new query whenever the ModelChoiceIterator instance is
evaluated, is the only way to guarantee results consistency across calls
to the choices attribute of a ChoiceField instance.
If the QuerySet results were to change, there could be mismatches, for
example between the choices count and the actual number of choices
offered to the user. That would lead to subtle bugs under concurrent
activity, where a new entry is inserted between the counting and the
rendering.

However, this option means breaking the current assumption that choices
will be as "fresh" as possible. For instance, a choice created after the
ModelChoiceIterator instance was evaluated the first time would not be
available for selection, making the following pattern incorrect (untested):

```
f = MyChoiceField(Category.objects.all())
if 'form' not in {val for val, label in f.choices}: # Evaluate choices
Category.objects.create('Django')
f.as_ul() # Won't display Django, because the first choices evaluation
# created a cached queryset.
```

IMHO, the freshness isn't so much of an issue for most use cases,
because the ModelChoiceIterator will likely be consumed during the
template rendering, so the choices should be just as fresh.

This option also means consuming more memory, because Python objects
would stay cached on the QuerySet (we currently use .iterator() to load
the choices when possible). It don't think that's too bad, since
ModelChoiceIterator instances evaluation is lazy, thus the lifespan of
the choices cache should be short.

I don't expect that loading even thousands of choices would lead to a
major memory overhead. SQLite and MariaDB/MySQL adapters will store
database results in memory anyway, so the overhead would be on storing
Python instances versus storing text for these backends.
If memory really is a concern, I think significant savings can be
achieved using `.defer()` to reduce the size of the Python objects or
filtering/limiting the choices QuerySet. Another option is to use a
custom ModelChoiceIterator subclass to fetch the results using
`.iterator()`.

I would prefer Django to have consistent results by default, even though
that means more memory usage. What do you think?

Kind regards,
François Freitag

[1]
https://github.com/django/django/blob/73cb62a33197652a3c8261dbf052d7eb75e26139/tests/model_forms/tests.py#L1483-L1538
[2] https://code.djangoproject.com/ticket/28312
[3] https://code.djangoproject.com/ticket/29159
[4]
https://github.com/django/django/commit/a2e97abd8149e78071806a52282a24c27fe8236b
[5] https://code.djangoproject.com/ticket/23623

Carlton Gibson

unread,
Apr 10, 2018, 3:22:41 PM4/10/18
to Django developers (Contributions to Django itself)
Hi François, 

Thank you for a good write-up of the issue. 

Having looked at this from each direction I can't see a way of squaring all the desires here.

I see the possibility of an inconsistency between the __len__() and __iter__() calls. This would certainly be nice to avoid but I'm not sure it is the overriding concern here. 

I can easily see the memory issue being more significant than you're giving it credit for. 

The original issue (#23623) cited a ≈25% reduction in peak memory usage moving to the iterator. Choices are used in validation and not just template rendering. If not reset, the queryset is retained for the lifetime of the form instance, not just the ModelChoiceIterator iteration. With a large formset, with forms using ModelChoiceField on foreign keys with many thousands of records, the retained queryset would likely add up quickly. 

As such, I don't think we can just switch the default implementation. (I think we'd quickly see a lot of complaints.)

I **do see** the appeal of re-using the queryset for a whole lot of cases. Many times I would trade a bit more memory for the lifetime of a form instance for one less query. (Although, just personally, I'm not sure how often `if field.choices` actually comes up, for me.)

So, conclusions (with _I think we should…_ etc elided):

* Close your PR as is. https://github.com/django/django/pull/9867 — We can't just change the behaviour. 
* Look at Tim's PoC PR. https://github.com/django/django/pull/8649 — Can we implement __len__() to not _fetch_all()? Can we implement the custom __bool__ to avoid `count()`? Can we do that without an extra DB query? 
* If we **can't** avoid the extra query then we have to decide whether it's an acceptable price? 
    * wontfix: If it's not, the existing behaviour is _OK_: it's not incorrect for __len__() to _fetch_all(), and your previous fix makes sure we don't waste that effort if it has been done. 
    * Fix: I'd guess the extra query probably is OK, especially with an optimised __bool__ just checking existence. It's the price of consistency here. 
* Can we then document, and maybe provide, an alternate ModelChoiceIterator that just re-uses the queryset? 
    * Advantages of this are consistency, fewer DB hits etc. 
    * Cost is the higher memory use.
    
That's my take on the issue thus far. 

Kind Regards,

Carlton

Reply all
Reply to author
Forward
0 new messages