Related manager remove() and clear() methods - backwards incompatible changes

1,200 views
Skip to first unread message

Anssi Kääriäinen

unread,
Oct 24, 2013, 4:40:37 PM10/24/13
to django-d...@googlegroups.com
Somewhat long post ahead - jump to the list in the end of post to see the backwards incompatible changes.

Ticket #21169 deals with a bunch of different related manager improvements. Some are bug fixes, some are there to make related manager methods work consistently and finally the ticket's original issue is solved. That is custom related manager's possible filters are taken in account when removing or clearing the relation. These changes affect .clear() and .remove() of ManyToManyField, GenericForeignKey and reverse ForeignKey fields. The .add() methods aren't dealt with in the ticket.

A proposed patch is available from https://github.com/django/django/pull/1685

I'll start with the ticket's actual issue. Currently, if you have a custom related manager which has default filtering (say, .filter(deleted=False) in get_queryset()), that limitation isn't taken in account when clearing the relation or when removing items from the relation. In short, you can remove items from the relation even if they aren't visible in obj.relation.all(). There is one exception to this, that is reverse foreign key's.clear() did take the filtering in account.

There are two bug fixes in the proposed patch. First, some of the related manager methods used multiple data modifying queries without using transactions. Second, symmetric m2m.clear() has a concurrency issue - it was possible that just other side of a symmetric relation got cleared, but the other half was left in place.

Consistency issues are:
  1) Some of the methods used consecutive save or delete calls, some used bulk deletion or direct update() calls.
  2) As mentioned earlier, most of the manager methods didn't respect related manager's default filtering. Now they all do that.
  3) Due to usage of different style of operations in the methods, sometimes signals were sent, sometimes not.

The actual changes fall mainly in two categories. First, use bulk operations always, and just one bulk operation per remove() and clear() call. This takes care of transaction and concurrency issues, as well as making the API consistent. The other category of changes is applying default filtering to those bulk operations. Notably in some cases (especially symmetric m2m with custom filtering) this means using complex subqueries. I wonder if some databases will choke on those subqueries.

Usage of bulk operations will result in changing how signals work. For example reverse foreign key .remove() used multiple save() calls, and thus signals were sent for each .save(). .clear() used an update() operation, and so no signals were sent. The .remove() is changed to use .update(), too, so no signals any more in that case. The .add() method was left alone - it is possible to still add items that are out of the related manager's domain. This is consistent with QuerySet .create() and .delete() - you can create items outside the current query's domain, but delete naturally applies the query's filters to the operation.

Loic Bistuer has written a patch that addresses all of these issues. The list of issues and changes he has written is worth reading. See comment: https://code.djangoproject.com/ticket/21169#comment:9

The problem here is that the above changes *will* result in backwards compatibility problems for some users. The problems have a possibility to be data-corruption issues. The exact working of these methods isn't documented so I am proposing to use this as a loophole to avoid backwards compatibility rules.

This is a nasty problem to solve. If we leave things as they are, the API will remain inconsistent. If we fix these issues, it is impossible to have a deprecation period for these changes. Even a check command seems hard to write.

The actual bug fixes will need to be done. It is possible to fix the bugs without doing backwards incompatible changes. But having a consistent related manager experience would of course be nice...

I am slightly in favour of doing these changes. If somebody has good ideas how to add deprecation periods to these changes I am all ears.

Here is the full list of changes that potential for breaking user code:
  - If related object's default manager has default filtering, then .remove() and .clear() will not clear those items that are filtered out.
  - Reverse ForeignKey .remove() will not use .save() - it will use .update() - so no model save signals, and overridden model.save() is missed, too.
  - GFK.remove() and GFK.clear() will use queryset.delete()  - so model.delete() is not called anymore (signals are sent in this case as QuerySet.delete() does that).

Loic's list of fixes & changes is also a good summary of this ticket: https://code.djangoproject.com/ticket/21169#comment:9

 - Anssi

Anssi Kääriäinen

unread,
Nov 16, 2013, 2:02:00 PM11/16/13
to django-d...@googlegroups.com
On Thursday, October 24, 2013 11:40:37 PM UTC+3, Anssi Kääriäinen wrote:
Here is the full list of changes that potential for breaking user code:
  - If related object's default manager has default filtering, then .remove() and .clear() will not clear those items that are filtered out.
  - Reverse ForeignKey .remove() will not use .save() - it will use .update() - so no model save signals, and overridden model.save() is missed, too.
  - GFK.remove() and GFK.clear() will use queryset.delete()  - so model.delete() is not called anymore (signals are sent in this case as QuerySet.delete() does that).

Loic's list of fixes & changes is also a good summary of this ticket: https://code.djangoproject.com/ticket/21169#comment:9

I haven't figured any other way to deal with this ticket than just committing the changes. I hope the backwards incompatibilities aren't that severe. The second issue above seems to be the hardest one for users - if you relied on model save signals in your code the code will now be broken *and* you don't have any viable upgrade path if you relied on post_save signal.

I am considering addition of pre/post update signals. These signals would give an upgrade path for those who are hit with backwards incompatibility problems from #21169. The signals would of course be a good addition for those who happen to need them. The signals shouldn't affect the performance of your project if you don't use the signals.

The idea is that pre_update listeners get a queryset that isn't executed. Accessing that queryset might be costly, but if it isn't accessed, there isn't much cost of adding a pre_update signals. For post_update the signal handler would be given a list of PK values (the original queryset doesn't work for post_update, the update might cause different instances to be returned than was updated, consider qs.filter(deleted=False).update(deleted=True))

The problem here is that if you are updating a lot of rows (millions+), then even fetching the pk value can be too much. Maybe an update(__signals=False) flag could be added? It is also easy to optimize away the pk fetching if there aren't any post_update listeners.

Any objections to committing the fixes for https://code.djangoproject.com/ticket/21169? Any feedback for pre/post_update idea?

 - Anssi

Loic Bistuer

unread,
Nov 16, 2013, 11:14:25 PM11/16/13
to django-d...@googlegroups.com

On Nov 17, 2013, at 2:02 AM, Anssi Kääriäinen <anssi.ka...@thl.fi> wrote:

> - Reverse ForeignKey .remove() will not use .save() - it will use .update() - so no model save signals, and overridden model.save() is missed, too.

+1 on the pre/post_update signals as they can be useful for a variety of purposes.

Although as an upgrade path, anyone who cares about the save signals and save() method can replace rel.remove(*objs) by the following snippet to replicate the old behavior:

for obj in objs:
obj.fk = None
obj.save()

This is similar to the recommendation we document for custom delete() methods and QuerySet.delete():

"If you’ve provided a custom delete() method on a model class and want to ensure that it is called, you will need to “manually” delete instances of that model (e.g., by iterating over a QuerySet and calling delete() on each object individually) rather than using the bulk delete() method of a QuerySet."

--
Loic

Shai Berger

unread,
Nov 20, 2013, 10:17:11 AM11/20/13
to django-d...@googlegroups.com
Hi,

On Saturday 16 November 2013 21:02:00 Anssi Kääriäinen wrote:
> Any feedback for pre/post_update idea?
>
As Loic said, the signals sound like they can be useful in a variety of
situations. A couple of notes, though:

> pre_update listeners get a queryset that isn't executed.

The "query" part, only, I assume; then they should also somehow get the
intended update parameters. Perhaps the actual keyword arguments to update().

> Accessing that queryset might be costly

Accessing objects from that queryset may be a footgun -- if they somehow stay
in memory after the update is executed, they will become stale. There should
be stern warnings about this in the documentation, or even -- and this is just
thinking outloud, I haven't considered all the evil effects -- holding weakrefs
to all these objects and using the new model reload mechanism to refresh them
after the update.

Shai.

Anssi Kääriäinen

unread,
Nov 21, 2013, 6:37:22 AM11/21/13
to django-d...@googlegroups.com

The current idea is to only add post_update with pk_set and kwargs to update as data.

Unfortunately such a signal doesn't cover all usage cases - for example if you want to do database change tracking using signals, you can't do it with the above approach. We (as in I and Loic Bistuer) have tried a lot of different approaches to the update signals problem. It seems no matter what idea is tried it always fails in some cases, or adds a lot of API/code complexity.

Why not post_update with a queryset? If you need a queryset it is easy to construct inside the signal from the pk_set. But if you need the pk_set only you can't get that back from the queryset in any cheap way. In addition the queryset must be constructed with pk__in=pk_set, and such querysets can be expensive or impossible to execute (sqlite more than 1000 parameters exception for example). Finally the pk_set calculation will be cheap on databases which implement RETURNING.

Why not pre_update? It turns out this one will be expensive to implement in a way that is race-free. We must execute .select_for_update() query before pre_update, and after that we need to use pk__in=pk_set_from_select_from_update for the update query. Add in the pk__in=large_set problem and this doesn't feel like a good addition.

The real problem here is that no matter how we design our signals framework it will be either leaky, or horrendously expensive. Just imagine what it means to update a field for a large table if you will need to fetch both pre/post images for each row. A signals framework that actually works for database auditing would need to come with a disclaimer: "Not usable in most projects due to performance reasons".

So, now the question is, is the suggested post_update signal a good addition to Django?

 - Anssi
Reply all
Reply to author
Forward
0 new messages