#23646 Enhancement: Updating multiple values in a single sql update using Django ORM

2,222 views
Skip to first unread message

Anshuman Aggarwal

unread,
Oct 20, 2014, 1:22:23 AM10/20/14
to django-d...@googlegroups.com
Posting this Django Project ticket that I opened to track the enhancement request to update multiple rows with different values for the same field for a particular Django queryset in a single SQL query without having to write raw SQL. Please see the discussion there. 


I disagree with Russel that there will not be a performance impact. Firing 1000s of SQL queries to update single rows will lead to locking of the table in Postgres atleast, not to mention the overheads of individual SQL queries, transactions etc. If the database support natively updating multiple values then it can be done that way. 


A possible SQL solution is outlined in the link (below) but it should ideally be supported via the Django Manager and not via a Raw SQL which becomes much harder to maintain.

If agreed, I would look to contributing a patch if we can finalize the API for the same here.

Thanks,
Anshuman

Russell Keith-Magee

unread,
Oct 20, 2014, 1:47:00 AM10/20/14
to Django Developers
Hi Anshuman,

As I mentioned when you raised this issue 2 days, ago:


if you can provided evidence to back up your claim, I'll reverse my position on the need for an API entry point like the one you describe.

However, as I said in that thread, the comparison shouldn't be "Update 1000 rows in 1 statement" vs "Update 1000 rows using 100 transactions" - it should be vs "update 1000 rows in a  *single* transaction". Using 1000 *transactions* will clearly be slower, but my gut tells me that 1000 individual SQL statements, amalgamated into a single transaction, should be just as efficient as a single large statement - or, at least, that the difference in performance will be sufficiently small as to be ignorable.

You're not going to advance your cause by starting a new thread every couple of days until someone agrees with you. I've given you a specific course of action - provide performance benchmarks. Until you provide them, the only thing reposting like this will do is annoy the people you need to help you.

Yours,
Russ Magee %-)

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/c650999b-7a0d-49e8-aa0d-84e789b2f833%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Anshuman Aggarwal

unread,
Oct 20, 2014, 2:04:13 AM10/20/14
to django-d...@googlegroups.com
Hi Russ,
Notwithstanding the performance, the current solution of doing 1000 updates in a single transaction using  ORM appears to be (in pseudocode below):

<start manual transaction>
<Loop through all update values>
    <Issue individual django update statement>
<end manual transaction>

as opposed do:
<Issues multiple value update statement in a single line>

 My point is if this is acceptable then why have a .update() in ORM. I am sure by manually collating the updates in a single transaction that could also be done equally well.

The idea of having a .update() ORM construct is to be able to do this without having to fall down to a manual transaction every time, otherwise why have a DB level <qs>.update()...I am sure the performance of above pseudo code would be about the same (or sufficiently small as to be ignorable)...

the ORM is much about convenience and this additonal syntax seems to provide the same, without taking away much.

No intention to annoy and thanks for looking into it, but I've only raise a ticket, and started the discussion on the users/dev group as you had suggested.

Regards,
Anshu



--
You received this message because you are subscribed to a topic in the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/Hcsh4r9_mHg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-develop...@googlegroups.com.

To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Javier Guerra Giraldez

unread,
Oct 20, 2014, 2:22:30 AM10/20/14
to django-d...@googlegroups.com
On Mon, Oct 20, 2014 at 1:03 AM, Anshuman Aggarwal
<anshuman...@gmail.com> wrote:
> The idea of having a .update() ORM construct is to be able to do this
> without having to fall down to a manual transaction every time, otherwise
> why have a DB level <qs>.update()...I am sure the performance of above
> pseudo code would be about the same (or sufficiently small as to be
> ignorable)...


i'm sure it's far from ignorable. <qs>.update() generates an UPDATE
... WHERE ... statement, not specifying each record to be updated,
just the criteria to let the DB engine choose them. Also, the new
field content is either a constant or specified as a function of other
fields (when using F() objects); again, not specified individually,
but let to the DB engine to work it out.

in sum, it's a _lot_ less data to prepare, send, and interpret and
uses a heavily optimized, totally SQL statement. Not having it would
be a big missing feature.

--
Javier

Anshuman Aggarwal

unread,
Oct 20, 2014, 2:34:53 AM10/20/14
to django-d...@googlegroups.com
Thanks for the input Javier. Wouldn't a similar argument hold for:

UPDATE books_book 

SET 

 price = 10 where pk = 1,  price = 25 where pk = 2 .....

Such a single SQL statement would also have similar benefits to having much less data to prepare, send and interpret vs having multiple update queries being collated into a single transaction.


--
You received this message because you are subscribed to a topic in the Google Groups "Django developers  (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/Hcsh4r9_mHg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Alex Gaynor

unread,
Oct 20, 2014, 2:36:59 AM10/20/14
to django-d...@googlegroups.com
No, it doesn't. The atomicity semantics of update() are impossible to implement without it, this isn't the case for a multi-object UPDATE as far as I can tell.

Alex

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.

To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

For more options, visit https://groups.google.com/d/optout.



--
"I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero
GPG Key fingerprint: 125F 5C67 DFE9 4084

Anshuman Aggarwal

unread,
Oct 20, 2014, 2:51:31 AM10/20/14
to django-d...@googlegroups.com
Wouldn't a manual transaction (suitably rolled back after a bunch of changes) retain atomicity as well?

The proposed solution using the UPDATE ....FROM from_list ... syntax and based on the postgres documentation link (below) does not seem to be any inferior to doing a single UPDATE XYZ SET A = 1, so it makes a valid candidate for the ORM to expose it to applications which require the same.


Russell Keith-Magee

unread,
Oct 20, 2014, 3:44:28 AM10/20/14
to Django Developers
HI Anshuman,

On Mon, Oct 20, 2014 at 2:03 PM, Anshuman Aggarwal <anshuman...@gmail.com> wrote:
Hi Russ,
Notwithstanding the performance, the current solution of doing 1000 updates in a single transaction using  ORM appears to be (in pseudocode below):

<start manual transaction>
<Loop through all update values>
    <Issue individual django update statement>
<end manual transaction>

as opposed do:
<Issues multiple value update statement in a single line>

 My point is if this is acceptable then why have a .update() in ORM. I am sure by manually collating the updates in a single transaction that could also be done equally well.

The idea of having a .update() ORM construct is to be able to do this without having to fall down to a manual transaction every time, otherwise why have a DB level <qs>.update()...I am sure the performance of above pseudo code would be about the same (or sufficiently small as to be ignorable)...
 
Because:

1) a single UPDATE statement isn't the same as "get and save". Get and save requires *at least* 2 SQL queries, *and* a transaction wrapping them.

2) Even if this wasn't true, Update is a sufficiently common task that it warrants first class API representation. I'd argue that bulk update (of the kind you describe) isn't.

3) There are certain classes of update (e.g., "increment current value by one") that benefit from being expressed as a single statement.

the ORM is much about convenience and this additonal syntax seems to provide the same, without taking away much.
 
But it *does* take away. Complexity is a cost. The sort of API you've described will either make the API for update() more complex, or it will require a *new* API entry point (e.g., bulk_update(), or similar). And it adds new complexity to the implementation, since there's now a whole new update statement implementation path that needs to be maintained.

At some point, we have to make a decision over what we include, and what we exclude. I'm putting it to you that unless there's a *significant* performance benefit associated with bulk updates, then it's not worth the complexity to add this new API entry point. As for the "need" for this API - well... Django has been around for almost 10 years; at this point, if there was a pressing need for bulk updates, we would be aware of them.

No intention to annoy and thanks for looking into it, but I've only raise a ticket, and started the discussion on the users/dev group as you had suggested.

My point was that you've started this conversation 3 times now. 

The first time was a misunderstanding about the right mailing list. I pointed you in the right direction, and you followed that direction. You did the right thing.

The second time you posted the right information in the right mailing list, and I responded with a request for a specific course of action. So far, you haven't provided any indication that you're planning to follow up on that course of action.

The third time was this thread - a repeat of *exactly* the same information in the second post, without *any* of the information that was requested of you in the second thread. *That* is the source of my "annoyance".
 
Yours,
Russ Magee %-)

Anshuman Aggarwal

unread,
Oct 20, 2014, 4:11:11 AM10/20/14
to django-d...@googlegroups.com
Hi Russ,
 Will get back with the benchmarks with performance benefits (if any).

I do appreciate the complexity argument. We have been missing this API for a few years now, but been getting by ignoring the individual time stamp of the updates and being too lazy to follow up to see the feasibility. Now that we do 'require' it, we're making the effort. Its quite possible there are others who would also benefit once implemented.

Thanks,
Anshuman

--
You received this message because you are subscribed to a topic in the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/Hcsh4r9_mHg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Anssi Kääriäinen

unread,
Oct 20, 2014, 4:11:25 AM10/20/14
to django-d...@googlegroups.com
I have long thought that bulk_save(objs) QuerySet method would be an
useful addition to Django. The method would be defined as:

Save the objects to the database using fastest available method.
The objs can contain both objects that already exist in the
database and thus need to be updated, and objects that need to
be inserted to the database. The method is allowed to fail due
to concurrent modifications to the database.

The basic implementation would be:
1. Check which objects have primary key set
2. For those objects which have primary key set, check which objects
exists in the database
3. Update the objects which exists in the database in bulk (this step
would be similar to the proposal in this thread)
4. Insert the rest of the objects in bulk

On backends that support merge or upsert, Django could instead just do a
single statement upsert/merge to the table.

Use cases would be fixture loading, synchronizing data from external
data sources and other bulk loading cases.

If this method were available there would be no need for the multi-value
update statement.

I did a quick benchmark comparing multi-rows update of 1000 rows to
single row updates. There is almost an order of magnitude difference in
speed: around 17-18 milliseconds compared to 115 milliseconds. The test
script was:

from datetime import datetime
from django.db import transaction
from django.db import connection
cursor = connection.cursor()
cursor.execute("create table foo1 (id integer primary key, val1 text, val2 date)")
with transaction.atomic():
for i in range(0, 10000):
cursor.execute("insert into foo1 values(%s, %s, now())", (i, str(i)))
for j in range(0, 10):
cursor.execute("vacuum full foo1")
with transaction.atomic():
start = datetime.now()
values_str = ','.join('(%s, %s)' for i in range(0, 1000))
params = []
for i in range(0, 1000):
params.extend([i + i, i])
cursor.execute("update foo1 set val1 = vals.iplusi "
"from (values %s) as vals(iplusi, i) where id = vals.i" % values_str,
params)
print("update %s: %s" % (j, datetime.now() - start))
for j in range(0, 10):
cursor.execute("vacuum full foo1")
with transaction.atomic():
start = datetime.now()
for i in range(0, 1000):
cursor.execute("update foo1 set val1 = %s, val2 = now() where id = %s", (str(i + i), i))
print("update %s: %s" % (j, datetime.now() - start))

The results were:
[Multi-update]
update 0: 0:00:00.022665
update 1: 0:00:00.018951
update 2: 0:00:00.016230
update 3: 0:00:00.017642
update 4: 0:00:00.015061
update 5: 0:00:00.016559
update 6: 0:00:00.018298
update 7: 0:00:00.017187
update 8: 0:00:00.014943
update 9: 0:00:00.014960
[One row per query update]
update 0: 0:00:00.114907
update 1: 0:00:00.115965
update 2: 0:00:00.109886
update 3: 0:00:00.118936
update 4: 0:00:00.114295
update 5: 0:00:00.119926
update 6: 0:00:00.117856
update 7: 0:00:00.115348
update 8: 0:00:00.115445
update 9: 0:00:00.122262

The more complex the table, the less difference I expect there to be as
the overhead of single row update processing is less compared to all the
other work required by the query. On the other hand if ran against a
database on remote host (as opposed to localhost PostgreSQL in the
test), the larger the network overhead per query would be, and thus the
results could look a lot worse for single row updates. Using localhost
TCP connection instead of socket connection changes the single row
update execution time to 200ms, while the multi-row update case's
execution time doesn't change noticeably.

- Anssi
Reply all
Reply to author
Forward
0 new messages