Changing deferred model attribute behavior

474 views
Skip to first unread message

Adrian Holovaty

unread,
Apr 25, 2013, 1:06:06 PM4/25/13
to django-d...@googlegroups.com
At the moment, if you call defer() or only() on a QuerySet, then access the deferred fields individually, *each* access of a deferred field will result in a separate query.

For example, assuming a User model with username/bio/location fields, this is what currently happens:

"""
>>> u = User.objects.only('username').get(id=1)
# Results in: SELECT id, username FROM users WHERE id=1

>>> u.bio
# Results in SELECT bio FROM users WHERE id=1

>>> u.location
# Results in SELECT location FROM users WHERE id=1
"""

I'd like there to be a way of retrieving *all* deferred fields the first time a deferred field is accessed. So with the above example, this would happen instead:

"""
>>> u = User.objects.only('username').get(id=1)
# Results in: SELECT id, username FROM users WHERE id=1

>>> u.bio
# Results in SELECT bio, location FROM users WHERE id=1

>>> u.location
# No database query
"""

Here's my use case. In my app, I'm storing frequently accessed user attributes (id, username, etc.) in signed cookies, to prevent an unnecessary database hit on every page view. I create a User object with those attributes, so that I can take advantage of various methods on the User class, but I want the other User fields to be lazily loaded. I have this all working except for the last piece, which is to change lazy loading such that it loads *everything* else the first time a deferred field is accessed.

So that's the "what" and "why" -- here's the how...

The current implementation is in django/db/models/query_utils.py -- see the DeferredAttribute class. When you call only() or defer() on a QuerySet, the resulting model instance will have a DeferredAttribute instance for each deferred field. The problem is that DeferredAttributes only load their own column's data, not the data for all *other* DeferredAttributes on the given model instance. Hence, the individual SQL queries for each column.

To solve this, we would need to change DeferredAttribute to find all *other* DeferredAttributes on the given model and load them in a single query somehow.

Also, I should mention that this should be *optional* behavior, as the current behavior is reasonable for the common case. The API for specifying this "load everything" behavior is a separate discussion. Perhaps a keyword argument like: User.objects.only('username', loadall=True).

Thoughts?

Adrian

Alex Gaynor

unread,
Apr 25, 2013, 1:08:43 PM4/25/13
to django-d...@googlegroups.com
This sounds like a reasonable request, I don't yet have an opinion on API or anything. One tiny thing I'd like to note though, "change DeferredAttribute to find all *other* DeferredAttributes". I don't think `finding` is the right way to think about it, a `DeferredAttribute` with loadall semantics should know about the specific set of objects that came from the queryset.

Alex


--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
"I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero
GPG Key fingerprint: 125F 5C67 DFE9 4084

Florian Apolloner

unread,
Apr 25, 2013, 2:10:40 PM4/25/13
to django-d...@googlegroups.com
On Thursday, April 25, 2013 7:06:06 PM UTC+2, Adrian Holovaty wrote:
Also, I should mention that this should be *optional* behavior, as the current behavior is reasonable for the common case. The API for specifying this "load everything" behavior is a separate discussion. Perhaps a keyword argument like: User.objects.only('username', loadall=True).

I could imagine a Meta attribute which introduces so called "deferred groups" like SA has http://docs.sqlalchemy.org/en/latest/orm/mapper_config.html#deferred-column-loading, accessing one column of a group will load all columns of the group. Not sure if we want that level of control, but I thought it would be worth to look what SA can do in this regard.

Regards,
Florian

Alex Ogier

unread,
Apr 25, 2013, 4:44:25 PM4/25/13
to django-d...@googlegroups.com
I think groups are a very good abstraction for this problem. The two common cases are probably "Load this column alone, because it's potentially a big honking blob of text or binary" and "Load everything we don't have on this object, because we are actually using it actively". Groups let you solve both problems flexibly. The downside is that they might not be very DRY, having to repeat group="everything" over and over if you just want to load it all on first access.

Best,
Alex Ogier

Anssi Kääriäinen

unread,
Apr 25, 2013, 5:37:10 PM4/25/13
to Django developers
On 25 huhti, 23:44, Alex Ogier <alex.og...@gmail.com> wrote:
> On Thu, Apr 25, 2013 at 2:10 PM, Florian Apolloner <f.apollo...@gmail.com>wrote:
>
> > On Thursday, April 25, 2013 7:06:06 PM UTC+2, Adrian Holovaty wrote:
>
> >> Also, I should mention that this should be *optional* behavior, as the
> >> current behavior is reasonable for the common case. The API for specifying
> >> this "load everything" behavior is a separate discussion. Perhaps a keyword
> >> argument like: User.objects.only('username', loadall=True).
>
> > I could imagine a Meta attribute which introduces so called "deferred
> > groups" like SA has
> >http://docs.sqlalchemy.org/en/latest/orm/mapper_config.html#deferred-...,
> > accessing one column of a group will load all columns of the group. Not
> > sure if we want that level of control, but I thought it would be worth to
> > look what SA can do in this regard.
>
> I think groups are a very good abstraction for this problem. The two common
> cases are probably "Load this column alone, because it's potentially a big
> honking blob of text or binary" and "Load everything we don't have on this
> object, because we are actually using it actively". Groups let you solve
> both problems flexibly. The downside is that they might not be very DRY,
> having to repeat group="everything" over and over if you just want to load
> it all on first access.

+1 to this approach.

IMO load everything is a better default than one field at a time. When
deferred field loading happens something has already gone wrong. In
almost every case it is better to try to minimize the amount of
queries than the amount of loaded fields. The cases where you have
deferred multiple fields, need only one of them later on, and there is
a field that you can't load due to potentially huge value are
hopefully rare.

But, I don't really care too much. If the objects come from DB queries
instead of something like the use case of Adrian then if you end up
doing deferred field loading you have already failed. So, even an
error on deferred field access would work for my use cases of defer...

- Anssi

Anssi Kääriäinen

unread,
Apr 25, 2013, 5:59:39 PM4/25/13
to Django developers
On 25 huhti, 20:08, Alex Gaynor <alex.gay...@gmail.com> wrote:
> This sounds like a reasonable request, I don't yet have an opinion on API
> or anything. One tiny thing I'd like to note though, "change DeferredAttribute
> to find all *other* DeferredAttributes". I don't think `finding` is the
> right way to think about it, a `DeferredAttribute` with loadall semantics
> should know about the specific set of objects that came from the queryset.

Searching for deferred fields already happens in some places (at least
model.__init__ and model.save()). It would be possible to store the
set of deferred fields in model._state on load from DB, but why record
this information (with real possibility of messing it up if users do
something like direct __dict__ assignments after load from DB) when
there is a way to calculate the set of deferred fields on demand? One
possible reason is to avoid the expense of the calculation, but the
calculation costs very little compared to the DB query about to
happen. I actually tried to store the set of deferred fields per
instance when polishing the update_fields patch but the approach
turned out to be uglier at least for the update_fields case.

Now, if there would be some way to avoid the dynamic model subclassing
used by deferred loading then storing the set of deferred fields per
instance would be more than welcome. One possibility might be defining
Model.__getattr__. If you end up in __getattr__ and the value isn't
present in the model's __dict__ but the attname is present in the
model._state.deferred_fields, then you know to load it from DB. I
assume there are some reasons why this wasn't done in the first
place... Maybe descriptors for custom fields would break or the above
mentioned direct __dict__ access case would fail?

- Anssi

Alex Gaynor

unread,
Apr 25, 2013, 6:01:04 PM4/25/13
to django-d...@googlegroups.com
Sorry, I misunderstood the original request. Yes, you're right Anssi and Adrian, finding them on demand is reasonable.

Alex


--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.


Shai Berger

unread,
Apr 26, 2013, 4:55:26 AM4/26/13
to django-d...@googlegroups.com
On Friday 26 April 2013, Alex Gaynor wrote:
> Sorry, I misunderstood the original request. Yes, you're right Anssi and
> Adrian, finding them on demand is reasonable.
>
Reasonable, but not entirely necessary for this.

I have code that requires something like that; when it is time to "undefer"
the object, I just load it up anew from the db. I think this covers Adrian's
use case well, except (perhaps) for the part where it is triggered by
attribute access.

(my own use case is going through a list of objects and serializing all of
them; deferring is necessary because the query that selects the objects needs
to use distinct(), and some objects have TextFields -- which would make the
"distinct" so non-performant, that Oracle simply forbids it).

Shai.

Anssi Kääriäinen

unread,
Apr 26, 2013, 7:54:21 AM4/26/13
to Django developers
On 26 huhti, 00:59, I wrote:
> Now, if there would be some way to avoid the dynamic model subclassing
> used by deferred loading then storing the set of deferred fields per
> instance would be more than welcome. One possibility might be defining
> Model.__getattr__. If you end up in __getattr__ and the value isn't
> present in the model's __dict__ but the attname is present in the
> model._state.deferred_fields, then you know to load it from DB. I
> assume there are some reasons why this wasn't done in the first
> place... Maybe descriptors for custom fields would break or the above
> mentioned direct __dict__ access case would fail?

I tried the above approach. It works, but there are some potential
problems:
1. There is a need to alter __init__ signature (added a
__loaded_fields kwarg).
2. There is now __getattr__ for Model, this will cause slowdown for
attribute access and hasattr in case the attribute searched isn't
found. For cases where the attribute is found there is no speed
difference (as __getattr__ is only called for no-found cases).
3. Potential problems for descriptors: the descriptors need to be
programmed defensively, if the attribute isn't found from the
instance's __dict__, then AttributeError will need to be raised
instead of KeyError. See changes in fields/subclassing.py for an
example.
4. Doing del obj.someattr or obj.__dict__.pop(someattr) then
accessing the same attr again will result in reload from DB instead of
AttributeError.

But on the other hand you get totally rid of the ugly dynamic
subclassing used by deferred loading. This will resolve some
longstanding bugs (like signals not working when using deferred
models).

No 1. above seems like the biggest problem. Personally I don't see
overridden __init__ methods very useful in the first place. This is
because there is no way to know if load from DB happened, or if this
was a regular init. And in addition the semantics of __init__ is
strange. The initialization might happen through *args, or through
**kwargs, or through both, and missing kwargs might be deferred or
not.

This is maybe going out of topic, but I would actually like to add a
new _from_db class method which does model construction by direct
__dict__ assignment. So, model loading from DB wouldn't go through
__init__ at all. The reasoning for this is to get rid of the weird
__init__ semantics. Another reason is that load from DB is just a form
of deserialization, and deserialization shouldn't go through __init__,
instead the object should be constructed manually. This is how normal
Python pickling works. Finally, this way is much faster, up to around
4x faster for deferred loading (see https://code.djangoproject.com/ticket/19501
for some performance numbers). Unfortunately this change is hard to do
with any kind of backwards compatibility period.

Here is the work-in-progress patch: https://github.com/akaariai/django/compare/defer_getattr.
Unfortunately I don't have time to work on this more until start of
June or so...

- Anssi

ptone

unread,
Apr 28, 2013, 5:09:21 PM4/28/13
to django-d...@googlegroups.com
A couple just quick observations.

defer and only are tasks/concepts used when doing a query based on knowledge of your dataset - adding them to the model itself expands the number of places where this concept is considered. This has some good and some bad.

What happens if you have defined a group on a model, and use only a single field for 'only' in a QS? Does it fetch the only one I've asked for, or does it trigger the group?

Why couldn't one just defined the group in the code using .only() and pass all the fields at the time you want.

In Adrian's case, there will always be at least 2 DB hits - one could define the group of "lazy fields" and do something like:

>>> u = User.objects.only(*lazygroup).get(id=u.id)

I guess for something like that to be more practical, we need to expose something on the model instance that makes it easy to see what fields are currently deferred? Something that could easily check whether the second load had been done, and the lazy fields were available or not.

These are mostly observations, I'm not against adding the idea of groups to the model definition, but do think that if it can be solved at the scope of the QS usage, where .only() and .defer() currently are used, that would be better - one less reason to check the model definition to see how it was set up.

-Preston

Shai Berger

unread,
Apr 28, 2013, 7:48:05 PM4/28/13
to django-d...@googlegroups.com
Hi again,

On Friday 26 April 2013, Anssi Kääriäinen wrote:
> [...]
> In almost every case it is better to try to minimize the amount of
> queries than the amount of loaded fields. The cases where you have
> deferred multiple fields, need only one of them later on, and there is
> a field that you can't load due to potentially huge value are
> hopefully rare.
>
> But, I don't really care too much. If the objects come from DB queries
> instead of something like the use case of Adrian then if you end up
> doing deferred field loading you have already failed.

LOBs in Oracle always require at least a separate fetch operation to read
them. Thus, you get an extra database hit per field (per record). Getting them
before they are needed is almost always inefficient. As I already noted in this
thread, they can even get in the way of query execution if not deferred.

I think it would be better if, at least on Oracle, such fields could be
deferred by default. We are considering achieving this with an introspective
custom manager, applied to all models (or at least all models with TextFields)
in our project. If I could do it with a special field type, I would, and I'd
recommend every other user do the same.

Just a data point for this discussion,

Shai.

Anssi Kääriäinen

unread,
Apr 30, 2013, 5:52:09 AM4/30/13
to Django developers
On 29 huhti, 00:09, ptone <pres...@ptone.com> wrote:
> A couple just quick observations.
>
> defer and only are tasks/concepts used when doing a query based on
> knowledge of your dataset - adding them to the model itself expands the
> number of places where this concept is considered. This has some good and
> some bad.
>
> What happens if you have defined a group on a model, and use only a single
> field for 'only' in a QS? Does it fetch the only one I've asked for, or
> does it trigger the group?
>
> Why couldn't one just defined the group in the code using .only() and pass
> all the fields at the time you want.
>
> In Adrian's case, there will always be at least 2 DB hits - one could
> define the group of "lazy fields" and do something like:
>
> >>> u = User.objects.only(*lazygroup).get(id=u.id)
>
> I guess for something like that to be more practical, we need to expose
> something on the model instance that makes it easy to see what fields are
> currently deferred? Something that could easily check whether the second
> load had been done, and the lazy fields were available or not.
>
> These are mostly observations, I'm not against adding the idea of groups to
> the model definition, but do think that if it can be solved at the scope of
> the QS usage, where .only() and .defer() currently are used, that would be
> better - one less reason to check the model definition to see how it was
> set up.

How about taking a different approach? If a Model.refresh(*fields)
method is introduced and if deferred loading happens through
instance.refresh(), then by overriding the method you can alter the
deferred loading strategy in any way you wish. In addition, if
refresh() takes an all_deferred kwarg, then the original problem can
be solved in this way:
def refresh(self, *fields, **kwargs):
kwargs['all_deferred'] = True
super(User, self).refresh(*fields, **kwargs)

There are other use cases where model.refresh() would be useful. The
currently recommended way to refresh an instance is to do obj =
SomeModel.objects.get(pk=obj.pk), but this doesn't work nicely in all
cases.

While having Meta.deferred_groups or QuerySet.only(*fields,
loadall=True/False) would be a bit nicer API, I am still of the
opinion that use cases where it actually matters how deferred loading
happens are rare. The above gives a way to customise deferred loading
strategy. For the concrete use cases I have seen so far
overridden .refresh() should be enough.

- Anssi
Reply all
Reply to author
Forward
0 new messages