Adding signals to bulk update/create operations

3,065 views
Skip to first unread message

Todor Velichkov

unread,
Mar 30, 2017, 4:28:00 PM3/30/17
to Django developers (Contributions to Django itself)
Consider the following piece of code:

@receiver(pre_save, sender=MyModel)
def my_handler(sender, **kwargs):
    instance
= kwargs['instance']
   
if instance.verified:
        do_something
(instance)
   
else:
        do_else
(instance)



Its good, because it keeps `MyModel` decoupled from `do_something`

But there is one flaw. If we do:

MyModel.objects.filter(verified=False).update(verified=True)

we are screwed, `do_something` is not executed, and our models get out-of-sync.

If we try to get smart and manually fire the pre_save signal for each instance, we are gonna have a hard time.
Its gonna be slow.
And its gonna be memory inefficient.

We already experienced it in our app.

So our new approach is like this:

    pre_bulk_update = Signal(providing_args=["queryset", "update_kwargs"])
    post_bulk_update
= Signal(providing_args=["update_kwargs",])

    pre_bulk_create
= Signal(providing_args=["objs", "batch_size"])
    post_bulk_create
= Signal(providing_args=["objs", "batch_size"])


   
class MyModelQuerySet(models.QuerySet):
       
def update(self, **kwargs):
            pre_bulk_update
.send(sender=self.model, queryset=self, update_kwargs=kwargs)
            res
= super(MyModelQuerySet, self).update(**kwargs)
           
# The queryset will be altered after the update call
           
# so no reason to send it.
            post_bulk_update
.send(sender=self.model, update_kwargs=kwargs)
           
return res

       
def bulk_create(self, objs, batch_size=None):
            pre_bulk_create
.send(sender=self.model, objs=objs, batch_size=batch_size)
            res
= super(MyModelQuerySet, self).bulk_create(objs, batch_size)
            post_bulk_create
.send(sender=self.model, objs=objs, batch_size=batch_size)
           
return res


   
class MyModel(models.Model):
       
#...
        objects
= MyModelQuerySet.as_manager()



This gives us a nice interface to handle all kind of changes regarding `MyModel`
Our example usage looks like this:


@receiver(pre_save, sender=MyModel)
def my_handler(sender, **kwargs):
    instance
= kwargs['instance']
   
if instance.verified:
        do_something
(instance)
   
else:
        do_else
(instance)


@receiver(pre_bulk_update, sender=MyModel)
def my_bulk_update_handler(sender, **kwargs):
    update_kwargs
= kwargs['update_kwargs']
   
if 'verified' not in update_kwargs:
       
# no change im interested in
       
# no need to take any action
       
return

    queryset
= kwargs['queryset']
    pks_to_be_updated
= queryset.values_list('pk', flat=True)
   
if update_kwargs['verified']:
        do_something_bulk_update_implementation
(pks_to_be_updated)
   
else:
        bulk_update_do_else_implementation
(pks_to_be_updated)


@receiver(pre_bulk_create, sender=MyModel)
def my_bulk_create_handler(sender, **kwargs):
    objs
= kwargs['objs']
    group_1
= []
    group_2
= []
   
for obj in objs:
       
if obj.verified:
            group_1
.append(obj)
       
else:
            group_2
.append(obj)

   
if group_1:
        do_something_bulk_create_implementation
(group_1)
   
if group_2:
        bulk_create_do_else_implementation
(group_2)



I think this turns out to be a very clean approach.
It help us use the most optimal strategy to handle the change.
So I'm sharing this with the community to check your feedback.
I believe if this gets into the Django Internals, it can be a very powerful tool.
It will lose power as a 3rd party app.

Tim Graham

unread,
Mar 30, 2017, 8:51:34 PM3/30/17
to Django developers (Contributions to Django itself)
There's an accepted ticket about adding pre_update and post_update signals: https://code.djangoproject.com/ticket/21461. From a quick glance, I think this is what you're proposing.

Anssi Kääriäinen

unread,
Mar 31, 2017, 1:01:45 AM3/31/17
to Django developers (Contributions to Django itself)
The problem with passing the queryset is that it's possible that some object is added to or removed from the queryset between the pre_update and actual update execution. To avoid this the execution should go somewhere along the lines of:
   1) if there is pre_update or post_update do stages 2-5, if not, update as with current code
   2) fetch primary keys for models to be updated to a set, with .select_for_update() applied
   3) fire pre_update, give the primary key set as argument
   4) do the update against a queryset with .filter(pk__in=pk_set)
   5) fire post_update with primary key set as argument

This way the pre and post update signals will execute against a fixed set of instances. The bad part is that this can significantly slow down the .update() call, but to get actually safe signals, I don't see a way around that.

 - Anssi

Todor Velichkov

unread,
Mar 31, 2017, 3:50:54 AM3/31/17
to Django developers (Contributions to Django itself)
@Tim, sorry about that, I did a search before I posted, but it looks like it slipped away somehow.

@Anssi, thank you for your response.
I really haven't think about threat safety, but maybe its because in our specific usage it's close to impossible to happen.

What do you think about this:
1) Put everything into a transaction
2) Before update, count the queryset.
3) Fire the pre_update with the queryset, and the count.
4) Do the actual update -> get the updated rows
5) Assert the updated rows is equal to the queryset.count().
6) Revert the transaction if there is a difference.

No pk fetching, this is left to the be implemented by the listener if he needs it.

Anssi Kääriäinen

unread,
Apr 6, 2017, 1:53:55 AM4/6/17
to Django developers (Contributions to Django itself)
On Friday, March 31, 2017 at 10:50:54 AM UTC+3, Todor Velichkov wrote:
@Anssi, thank you for your response.
I really haven't think about threat safety, but maybe its because in our specific usage it's close to impossible to happen.

What do you think about this:
1) Put everything into a transaction
2) Before update, count the queryset.
3) Fire the pre_update with the queryset, and the count.
4) Do the actual update -> get the updated rows
5) Assert the updated rows is equal to the queryset.count().
6) Revert the transaction if there is a difference.

No pk fetching, this is left to the be implemented by the listener if he needs it.

I'm afraid this won't work well enough. It does work for many use cases, but in cases where concurrency is an issue, you'd get some loud failures, and worse, you could get cases where things work silently wrong (one item removed, one added concurrently). These cases would be near-impossible to debug.

The reason why we don't have update signals is that it's hard to find a solution that allows users to run actions when an instance is updated, works correctly in concurrent cases and with large amount of objects and doesn't turn bulk update in to a loop of save() calls.

Maybe the best approach is to look for a solution where users can customise update behaviour for remote models if they so wish. Then you could write your own update signals solution, one that doesn't need to care about all of the above constraints.

 - Anssi

gregship...@gmail.com

unread,
Nov 5, 2017, 12:54:48 PM11/5/17
to Django developers (Contributions to Django itself)
@Anssi, Perhaps we could use your approach above, except using rowids instead of primary keys? Many SQL flavors support retrieving the rowids of the created/updated rows within the same query as a bulk create/update (e.g. OUTPUT or RETURNING).

Greg
Reply all
Reply to author
Forward
0 new messages