Proposed change in ORM model save logic

259 views
Skip to first unread message

Barry Johnson

unread,
Oct 15, 2021, 2:50:20 AM10/15/21
to django-d...@googlegroups.com

Trac ticket #33191 was recently closed as a “wontfix”, but Carlton encouraged bringing the matter to this group.  The actual issue is a small one, but it seems that Django could do better than it is doing today with one small change, and avoid the need for a counter-intuitive "x = x" statement added to application code.

The claim is that Django is unnecessarily clearing the cached entry for an in-memory related object at a time that cache does not need to be cleared, and that this cache clearing can result in unwanted lazy reads down the line.  A one-line fix for the problem is suggested.

Many apologies in advance for the extreme length of this post.  The requested change is small and subtle (although quite beneficial in some cases); we must dive deep into code to understand WHY it is important.

 

Background
———————

My team is developing a major retail point of sale application; it’s currently live in nearly 1,000 stores and we’re negotiating contracts that could bring it to another 10,000 stores across the US over the next three to five years before going world-wide.  The backend is Django and PostgreSQL.

One of the typical problems we face has to do with importing data from existing systems.  This data usually comes to us in large sets of flat ASCII files, perhaps a gigabyte or few at a time.  We have to parse this incoming data and fill our tables as quickly as possible.  It’s not usual for one such data import to create tens of millions of rows across a more than a hundred tables.  Our processes to do this are working reasonably well.

In many cases, the incoming stream of data generates related records in several tables at a time.  For example, our product import populates related data across a dozen tables organized into parent/child hierarchies that are up to four levels deep.  The incoming data is grouped by product (not by functional data type); the rows and columns of the records for a single product will create ORM objects scattered across those dozen ORM models.  The top-level model, Product, has four child models with references to Product; and each of those child models may have other child tables referring back to THOSE models, etc.

If the database schema sounds surprisingly complex, it is.  Big box retail is more complex than most people realize.  Each store typically carries 30,000 to 70,000 individual products; some are even larger.  Some of the retail chains to which we market our systems have more than $1 billion USD per year in sales.

To be efficient, we process this incoming data in chunks:  We may instantiate ORM objects representing, say, 5,000 products and all of their related child objects.  For example, we’ll create a new instance of a Product model, then instantiate the various children that reference that product.  So a very typical pattern is something like

     product = Product(values)
     pv = ProductVariant(product=product, **more_values)
     upc = UPC(product_variant=product_variant, **upc_values)

It’s not that simple, of course; we’re reading in sequential data that generates multiple instances of the children in loops, etc., but essentially building a list of products and the multi-level hierarchy of child objects that dependent upon each of those products.  We then use bulk_create() to create all of the top-level products in one operation, then bulk_create() the various lists of first-level children, then bulk_create the children of that second level, etc.  LOTS of bulk creates.

 

Prior to Django 3.2, we had to explicitly set the associated “_id” field for each child’s reference to a parent after the list of parents were bulk_created.  Using our examples above, we would see things like:

bulk_create(list_of_products)

for each variant in list_of_product_variants:
    variant.product_id = variant.product.id
bulk_create(list_of_product_variants)

for each upc in list_of_upc_entries:
   upc.product_variant_id = upc.product_variant.id
bulk_create(list_of_upc_entries)

            […]

Again, this is somewhat simplifying the code, but the key takeaway is that older versions of Django required us to manually pick up the primary key value from recently created instances and set the associated “_id” value in each instance pointing at the recently created objects.

As expected, setting the “_id” value of a foreign key field clears the internal cache entry containing the reference to the parent object.  We would fully expect that if we said “variant.product_id = 10”, that “variant.product” would be considered “not yet loaded”, and a reference to variant.product would generate a lazy read.  We don’t disagree with that interpretation or behavior, and that is the existing behavior that Carlton implicitly referenced when he closed the ticket in question as “wontfix”.

But…

 

Relatively Recent Django Changes
——————————————————

In the Django 3.2 timeframe, Ticket #29497 (“Saving parent object after setting on child leads to an unexpected data loss in bulk_create()”) has affected code in this area.  Consider this sequence of code:

   my_parent = Parent()
   my_child = Child(parent=parent)
   my_parent.save()
   my_child.save()

(Yes, our real business case involves use of bulk_create() for thousands of objects at a time, but this simpler example using only .save() illustrates the problem in a stripped-down manner.)

What is the expected result in the database for the child’s foreign key to the parent?   Well…   We think most application programmers would expect the row for child object to contain a foreign key refererence to the parent object.

Instead, the pre-3.2 Django behavior was that the child would be saved with a null value in the foreign key reference to parent…  Because, when we save the parent (and internally populate the parent’s primary key with a newly assigned ID), Django is unaware of the in-memory reference from child to parent, and cannot automatically set child_parent_id = child.parent.id, so it remains a null value.

If the foreign key from child to parent is required, the save of the child object will fail because of the null value, and the developer will quickly realize something is wrong.  However, if the foreign key to the parent is NOT required, then the child WILL be quietly saved to the database with a null foreign key to the parent... and an unwary developer will wind up with unexpected null values.

This was considered an unintentional loss of data and reported on ticket #29497.

The solution to this problem was clever:  It recognizes that this is a case where my_child.parent_id and my_child.parent.pk are out of sync with one another (something we normally never see in Django), and that this particular case can be detected and solved.  In particular, the logic implemented as part of the ticket effectively changes the internal priority of these two related values.

Prior to 3.2, the code to build the SQL INSERT statement for saving the child would always use the “parent_id” attribute regardless of whether “parent” was an ORM instance or not.  The new 3.2 code does a quick initial test:  If the “parent” attribute contains an ORM object with a non-null primary key AND if “parent_id” does not have a value, then set parent_id = parent.pk and proceed with the remainder of the save.

In other words, if we don’t have a value in “parent_id” but we DO have a value in “parent.pk”, then fix parent_id before we move on to build the SQL.   Nice!  

But…

 

The change we are requesting
——————————————————

For those that want to see the code in this area, check out the method _prepare_related_fields_for_save() in db.models.base.py.  The specific line of code in question is:
    setattr(self, field.attname, obj.pk)

It uses an internal “setter” to set “parent_id” to the primary key of the referenced object.

Note this point carefully:  The value of “obj” is the referenced instance of the parent in question.  Just a few lines earlier in the code, “obj” is set with this statement:
    obj = getattr(self, field.name, None)

When we chase down this getter, it is returning the value in self._state.fields_cache[fieldname].  In other words, it’s the actual instance of the parent referenced in memory from the child.  The application-level code has clearly and intentionally made this reference from child to parent, and the fact that we saved the parent to the database doesn’t (well, shouldn’t) affect that relationship.  It certainly would NOT affect the relationship if the parent was updated instead of inserted, and we would expect consistent behavior from .save() regardless of insert vs. update.

So, back to the narrow problem.  This line of code:  
    setattr(self, field.attname, obj.pk)
causes the internal cached reference from parent to child to be cleared!

Well, yes, of course it does.  This setattr() operation is effectively doing the same thing as 
    child.parent_id = 10
which we’ve already said OUGHT to invalidate an existing reference to “child.parent”. 

But here’s the crux of the argument:  The code within _prepare_related_fields_for_save() ISN’T application-level code clearly altering the value of an ID field.  Instead, it’s internal Django code attempting to reconcile a mismatch between the cached ID of the parent (child.parent_id) and primary key value of the in-memory reference to a model instance (child.parent.pk).  And most importantly, it’s reconciling that difference by believing that the referenced instance is more accurate and should be preserved.  Yet the very act of preserving that referenced instance’s primary key is removing the reference to the very instance that is considered more accurate.

Wait a moment — to preserve a pointer to the referenced instance, we’re going to pluck one tiny value out of it AND THEN throw the referenced instance away!  To those of my generation, this reminds us of a 50+ year old infamous quotation, “To save the village we had to destroy it.”

 

Why does it matter?
————————————

If the sole use of the parent and child objects in this example was to get data saved to the database, it really wouldn’t matter.

However, in our specific use case, we have to record certain types of operations in an internal log.  (It’s an “inventory transaction register”; whenever we change something in inventory that has an effect on the value of inventory, we must log that change).  That logging logic is actually relative to ORM instances that are two levels down from the top in our hierarchy of parent/child objects being bulk created.  We have a method to which we pass each of these not-top-level objects, and the log entries includes pieces of data from those objects’ parent and grandparent objects.

When we access the parent object through the child we’re logging, we wind up causing lazy reads of the objects we just saved to the database (and whom we still have in memory, and whom we THOUGHT were still referenced from our child objects).  In our case, we actually get 2 lazy reads for each store-specific product being imported; 50,000 items being imported for a four-store chain causes an extra 400,000 unnecessary SQL SELECT statements.  Removing these potential lazy reads is the performance improvement we’d like to share with the world.

And perhaps just as importantly, it avoids differences in behavior caused when a referenced object is updated vs. inserted into the database during a save.  When it is updated, there is no need to alter the primary key, so the referenced object remains referenced.  But if the referenced object is inserted, the internal code that updates the primary key for that instance removes the reference.  We normally don’t like these types of behavioral differences in Django – we want .save() to behave the same way regardless of inserting vs. updating.

 

There is a workaround
——————————————————

There is a fairly simple workaround, and it harkens back to the days of pre-3.2 Django.

   my_parent = Parent()
   my_child = Child(parent=parent)
   my_parent.save()
   my_child.parent = my_child.parent    # Add this line to work around the problem.
   my_child.save()

Yes, we can do this.  This rather unintuitive statement (that appears to be useless) obtains the internal cached reference to the parent object, then set both child.parent_id and the internal cached reference back to that instance.  (The code in the setter first sets child.parent_id (which clears the cached entry)  then inserts the originally referenced parent back into the internal cache.)

This workaround greatly resembles the statement from days of old: (“child.parent_id = child.parent.id”), but has the additional benefit of keeping our in-memory reference to parent (and to all of the objects that parent itself might be referencing as well), and avoiding lazy reads in the future.

 

Our suggested change
——————————————————

After this very long-winded explanation, we are proposing a small change that makes the workaround unnecessary while still avoiding the potential performance hit from lazy queries.

Let’s look back to our friend _prepare_related_fields_for_save().  

We argue there is a STRONG belief that preparing an object for saving to the database should not substantially alter the content of that object.  Yet this method does not adhere to that belief:  While the results saved to the database are correct, it has altered the in-memory object by discarding the internal reference to a parent object that the application level code has carefully set. 

In general, we like what _prepare_related_fields_for_save() is doing — we like the fact that it’s reconciling the difference between the copy of the referenced object’s primary key (child.parent_id) and the in-memory reference (child.parent.pk).  We agree that this method is doing the right thing by using the primary key value from the referenced object.  We simply do not think that the parent instance should be removed from the internal cache as part of this change.

Instead of resolving this difference in keys by executing:
    setattr(self, field.attname, obj.pk)

We believe it should instead:
    setattr(self, field.name, obj)

This would effectively replace the pseudo-code
    child.parent_id = child.parent.pk
With
    child.parent = child.parent

It replaces the value of child.parent_id with child.parent.pk, but without the side effect of discarding the in-memory relation of the child and parent instances.

 

Finally, there is a small performance improvement that could be made to this suggestion.

If we look at what the current
    setattr(self, field.attname, obj.pk)
actually does, we have to look at the __set__() method of class ForeignKeyDeferredAttribute in django.db.models.fields.related_descriptors.py.

The code in this method is:

def __set__(self, instance, value):
    if instance.__dict__.get(self.field.attname) != value and self.field.is_cached(instance):
        self.field.delete_cached_value(instance)
    instance.__dict__[self.field.attname] = value

It does what we expect:  If the “parent_id” value is changing and there is a cached in-memory reference, then clear the cached entry before setting the integer value.

But note that the value of the “parent_id” attribute is really stored in the dictionary instance.__dict__[self.field.attname]

Our proposed  change back in _prepare_related_fields_for_save() has the goal of setting the value of the integer “parent_id” while preserving the cached in-memory reference to the parent.

Instead of doing 
    setattr(self, field.name, obj)
(Which accomplishes the goal by essentially clearing the cache, then restoring it)

We could just do this:
    self.__dict__[self.field.attname] = obj.pk

It is a slight denormalization of the logic that manages the value of the deferred foreign key attribute, but it has a tiny performance improvement by not removing, then replacing a dictionary entry.

 

What are WE doing about this problem?
—————————————————————

We’re going to be staying with Django 3.2 until the next long-term support release is available, so this change has no immediacy for our particular application.  However, we think it is a nice extension that removes a non-obvious behavior in Django 3.2 (the .save() method can sometimes discard in-memory references to other ORM objects) and seems like it enhances the magic of the ORM framework.

In the meantime, we’ll be plugging non-intuitive lines of code that look like
    child.parent = child.parent

into about fifty locations throughout our app, as well as training a team of 20 server-side developers to watch out for this edge case in any future code they write.  It’s all doable, and we’ll do it… but we really like Django because it does “automagically” handle so many little database oddities and nuances.  Over time, we’d like to get rid of “surprise” code like these lines that an inexperienced developer might remove because they look like they do nothing.

 

IF YOU’VE REACHED THE END
—————————————————

Thank you!  I’m amazed you stuck through it all.

What do you think?

Ian Foote

unread,
Oct 15, 2021, 1:39:52 PM10/15/21
to django-d...@googlegroups.com
Hi Barry,

Based on this I'm cautiously in favour. You set out a good case, it's much easier to understand the "why" here than in the original ticket.

I'd avoid the extra optimisation of accessing __dict__ directly though - if __set__ gains extra functionality, I'd prefer not to accidentally miss it here.

Ian

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/596D3E68-4A37-423D-AEDF-64D2A5658DE8%40epicor.com.

Barry Johnson

unread,
Oct 16, 2021, 11:17:32 AM10/16/21
to Django developers (Contributions to Django itself)
On Friday, October 15, 2021 at 12:39:52 PM UTC-5 pyt...@ian.feete.org wrote:

> You set out a good case, it's much easier to understand the "why" here than in the original ticket.

Yes.  It was hard to compress the matter down small enough to reasonably fit in a Trac entry, so skipped the vast majority of the rationale. Sorry that it took that much writing to make the case, but had to set the groundwork first.

On later consideration, the two most compelling points seem to be:
* Internal machinations of Django shouldn't change object relationships that the application-level code has set
* save() shouldn't have different effects (-sometimes- uncaching the reference) if the parent object is inserted versus updated

> I'd avoid the extra optimisation of accessing __dict__ directly though - if __set__ gains extra functionality, I'd prefer not to accidentally miss it here.

Agreed!  Not being sure just how widely spread accessing the underlying __dict__ is throughout the code, I wasn't sure which way to go.  Although it may be a -measurable- improvement in performance, it's miniscule.

baj

Aymeric Augustin

unread,
Oct 17, 2021, 3:59:18 AM10/17/21
to django-d...@googlegroups.com
Hello,

On 15 Oct 2021, at 08:49, Barry Johnson <bajo...@epicor.com> wrote:

Instead of resolving this difference in keys by executing:
    setattr(self, field.attname, obj.pk)
We believe it should instead:
    setattr(self, field.name, obj)


I think the proposed change is correct. I reviewed this pull request and related ticket, as well as prior art in the area, and I'm not seeing a reason for assigning just the PK (= PK set, related cache not set) rather than the entire object (= PK and related cache both set).

Disclaimer — while I was involved in this area long ago, I didn't follow what happened after commit 5980b05c, where the "prevent data loss with unsaved related objects" checks were moved from the moment a related object is assigned to the moment the object is saved.

I reviewed Carlton's wontfix, which made sense to me, to understand why we didn't reach the same conclusion. Carlton thinks the code has been here for a very long time and would be tricky to touch; I believe that we're only touching a line introduced in 519016e5, 2.5 years ago, which doesn't feel _that_ old to me.

Hope this helps!

-- 
Aymeric.

Carlton Gibson

unread,
Oct 18, 2021, 6:22:48 AM10/18/21
to Django developers (Contributions to Django itself)
OK, thanks all, let's reopen. These kind of wontfix+MailingList issues is more about getting more eyes on it than anything else, and the explanation you've given is super Barry. 

C. 

Adam Johnson

unread,
Oct 20, 2021, 12:14:10 PM10/20/21
to django-d...@googlegroups.com
I am also in favour. Thanks for explaining Barry.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages