Django Transactions Performance increase

139 views
Skip to first unread message

Robin Fordham

unread,
Oct 29, 2013, 10:39:52 AM10/29/13
to django...@googlegroups.com
Hi,

I have been doing some reading and am looking to increase the performance of updating existing database entries via incoming datafeeds.

I am finding conflicting opinions if wrapping updates in a transaction context manager helps improve performance, some sources say it does, others say it simply provides data integrity across the queryset within the transaction and no performance improvements and others have cited the transaction management overhead actually degrades performance;

for instance:

with transaction.commit_on_success()
for row in updatedata:
i = item.objects.get(id=row[0])
i.foo = row[1]
i.baa = row[2]
i.save()

for row in updatedata:
i = item.objects.get(id=row[0])
i.foo = row[1]
i.baa = row[2]
i.save()

Some clarification on this matter would be greatly appreciated. Also any pointers to improve my updating efficiency would be appreciated (although I know I cannot do a filter and a .update() on the queryset, as each row's update data is distinct).

Thanks.

Regards,

Robin.

Tom Evans

unread,
Oct 29, 2013, 12:05:00 PM10/29/13
to django...@googlegroups.com
This is inefficient. Each row you are updating requires 2 queries, one
to fetch the existing row, one to update it.

for row in updatedata:
Item.objects.filter(id=row[0]).update(foo=row[1], baa=row[2])

This variant would execute only one query per row, and is entirely
atomic even without transactions, since the fetch/update/store happens
entirely in the database (although the entire update operation is not
atomic, to do that would require a transaction).

As to speed, adding transactions does have some cost, test it and see.
The variant I wrote above is extremely unlikely to ever throw an
exception however - if the database server disappears or crashes, or
the table doesn't have the columns indicated, that could cause an
exception.

Since commit_on_success only doesn't commit when an exception is
raised, it means there is highly unlikely to ever be an exception
worth rolling back. If the DB server crashes or goes away, you are
going to find it difficult to roll back the transaction (!), if the
columns aren't there it will fail on the first query, and nothing will
have changed and so there is nothing to roll back.

If you cannot update the database without first creating the model
instance - perhaps you need to call a method on the model instance in
order to get one of the values to update, for example - then you could
still speed it up slightly by reducing the number of queries to load
model data, by loading the model instances in bulk.

There is a limit of how many objects you would want to create in one
go, normally by the length of the query you are allowed to send to the
database server, so you might require some logic to work out the batch
to load from your long list of ids. Eg, you probably cannot load a
million objects at a time, but you can load 10,000 at a time no
bother, so instead of a million queries to look up items, you have
100.

The DB server still does much the same work, so you are saving
(1000000 - 100) * query_overhead.

Cheers

Tom

Robin Fordham

unread,
Oct 29, 2013, 12:29:06 PM10/29/13
to django...@googlegroups.com
Thanks Tom, that clears things up a lot and gives me something to work on.

Sorry for the noob question;

If you cannot update the database without first creating the model
instance - perhaps you need to call a method on the model instance in
order to get one of the values to update, for example - then you could
still speed it up slightly by reducing the number of queries to load
model data, by loading the model instances in bulk.

What do you mean by this? 

By the nature of the data coming in it is pre validated, so I know already that row[0] is a valid id, row[1] is a valid foo value, row[2] is a valid baa value. 

Thanks.

Robin.

Apostolos Bessas

unread,
Oct 29, 2013, 12:12:13 PM10/29/13
to django...@googlegroups.com
Hi Robin,

As far as I can tell, using one transaction should increase the
performance of the process. The reason is that you issue just one
COMMIT for the whole process instead of one per UPDATE. As an added
benefit, it helps with the data integrity.

There are two main ways I know you can improve the performance:
- Use executemany
(http://initd.org/psycopg/docs/cursor.html#cursor.executemany), which
issues O(n) queries, but does the query planning only once. One app
that uses executemany is django-bulk:
https://github.com/KMahoney/django-bulk and
https://github.com/transifex/django-bulk for some updated code.
- or use COPY (for postgresql), which uses three queries, but has
other kinds of overhead. I guess that other RDBMSs provide a similar
functionality.

The COPY method is the following: You create a new table (probably a
temporary one), COPY all entries there and then do an update in one
query. You can find an implementation of COPY for django in
https://github.com/mpessas/django-pg-extensions/blob/master/djangopg/copy.py
(that's mine).

Which one is preferrable depends on the number of queries you have.

Hope this helps a bit,
Apostolis
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-users...@googlegroups.com.
> To post to this group, send email to django...@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-users/8dd77008-1715-4c63-9860-d82ce5c65131%40googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Tom Evans

unread,
Oct 29, 2013, 12:44:04 PM10/29/13
to django...@googlegroups.com
On Tue, Oct 29, 2013 at 4:29 PM, Robin Fordham <ging...@gmail.com> wrote:
> What do you mean by this?
>
> By the nature of the data coming in it is pre validated, so I know already
> that row[0] is a valid id, row[1] is a valid foo value, row[2] is a valid
> baa value.
>

In case you were simplifying the example for the list, and what you
actually needed to do was this:

for row in updatedata:
obj = Object.objects.get(id=row[0])
obj.foo = run_some_calculation(obj, row[1])
obj.baa = row[2]
obj.save()

eg, so you required the object itself. If that was the case, you would
still get benefit from loading multiple objects at once, since you
would issue less queries. However, to do so would require some magic
in order to chunk updatadata into sets small enough to load all at
once, ~10,000 objects at a time. (magic ommitted :)

Cheers

Tom
Reply all
Reply to author
Forward
0 new messages