A db.multiple_update() function

Ben Hoyt

unread,

Mar 27, 2013, 4:41:40 PM3/27/13

to we...@googlegroups.com

Hi folks,

In our own utility file, I've added a function called multiple_update() that executes multiple separate update statements in one query. Note that this is not the same as a single db.update() that updates multiple rows -- with this function, the columns and vars for each row can be different.

In my testing (Windows 64-bit, db is PostgreSQL running on another machine) I've found this to be about 3-4x as fast as individual db.update()s wrapped in a transaction, so a worthwhile speed increase. The speed increase factor seems to be about constant regardless of the number of rows (initially I was hoping it go up with N, but it didn't in my tests.)

Any thoughts on the API signature, and whether this would be a good addition to web.DB? Any comments on this approach in general? Is the way I'm doing the cast to unicode and then '\n'.join(updates) the correct approach?

Couple of notes:

* If this were part of web.py, obviously it'd be a method on DB instead of a stand-alone function.

* dbconn.db_connection is our "current database object". Again, this would go away if it were a DB method.

Code below:

-----

def multiple_update(table, where, values_vars, database=None, _test=False):

"""Execute multiple separate update statements in one query.

>>> database = web.DB(None, {})

>>> row1 = ({'name': 'Bob'}, {'id': 42})

>>> row2 = ({'name': 'Sarah', 'age': 30}, {'id': 23})

>>> print multiple_update('foo', 'id = $id', [row1, row2], _test=True, database=database)

UPDATE foo SET name = 'Bob' WHERE id = 42;

UPDATE foo SET age = 30, name = 'Sarah' WHERE id = 23;

>>> print multiple_update('foo', 'id = $id', [], _test=True, database=database)

None

"""

if not values_vars:

return

if database is None:

database = dbconn.db_connection

updates = []

for values, vars in values_vars:

update = u'UPDATE {table} SET {values} WHERE {where};'.format(

table=table,

values=web.db.sqlwhere(values, ', '),

where=database._where(where, vars))

updates.append(unicode(update))

query = '\n'.join(updates)

if _test:

return query

database.query(query)

-----

Thanks,

Ben.

Ben Hoyt

unread,

Mar 27, 2013, 10:23:05 PM3/27/13

to we...@googlegroups.com

Okay, so I was definitely doing the unicode and joining wrong. New version below. I *think* this is correct. (It definitely works now, at least.)

-----

def multiple_update(table, where, values_vars, database=None, _test=False):

r"""Execute multiple separate update statements in one query.

>>> database = web.DB(None, {})

>>> row1 = ({'name': 'Bob'}, {'id': 42})

>>> row2 = ({'name': 'Sarah', 'age': 30}, {'id': 23})

>>> query = multiple_update('foo', 'id = $id', [row1, row2], _test=True, database=database)

>>> query

<sql: "UPDATE foo SET name = 'Bob' WHERE id = 42;\nUPDATE foo SET age = 30, name = 'Sarah' WHERE id = 23">

>>> query.query()

'UPDATE foo SET name = %s WHERE id = %s;\nUPDATE foo SET age = %s, name = %s WHERE id = %s'

>>> query.values()

['Bob', 42, 30, 'Sarah', 23]

>>> query = multiple_update('bar', 'a = $b', [({'c': decimal.Decimal(2)}, {'b': 3})], _test=True, database=database)

>>> query

<sql: "UPDATE bar SET c = Decimal('2') WHERE a = 3">

>>> query.query()

'UPDATE bar SET c = %s WHERE a = %s'

>>> query.values()

[Decimal('2'), 3]

>> print multiple_update('foo', 'id = $id', [], _test=True, database=database)

None

"""

if not values_vars:

return

if database is None:

database = dbconn.db_connection

updates = []

for values, vars in values_vars:

updates.append(

'UPDATE ' + table +

' SET ' + web.db.sqlwhere(values, ', ') +

' WHERE ' + database._where(where, vars)

)

query = web.SQLQuery.join(updates, sep=';\n')

if _test:

return query

database.query(query)

-----

-Ben

Anand Chitipothu

unread,

Mar 27, 2013, 10:46:10 PM3/27/13

to we...@googlegroups.com

Nice!

Why do we need a different function? Can't the same update function handle this when vars is a list instead of a dict?

Anand

--
You received this message because you are subscribed to the Google Groups "web.py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to webpy+un...@googlegroups.com.
To post to this group, send email to we...@googlegroups.com.
Visit this group at http://groups.google.com/group/webpy?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

--
Anand
http://anandology.com/

Ben Hoyt

unread,

Mar 27, 2013, 10:52:45 PM3/27/13

to we...@googlegroups.com, Anand Chitipothu

No, I don't think so, because values needs to be a list of dicts too, and db.update() takes its values as keyword args.

Also, I guess I don't love APIs with too much overloading for substantially different functionality.

-Ben

--
You received this message because you are subscribed to a topic in the Google Groups "web.py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/webpy/dTeS_5Vm7BM/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to webpy+un...@googlegroups.com.

zbouboutchi

unread,

Mar 29, 2013, 5:00:58 AM3/29/13

to we...@googlegroups.com

Hi Ben,

I'm confused about the utility of this function.. Since it's possible to
use external table with update statements, you certainly can do the
whole update using a single request.
I prepared a little example that reproduces yours in pure SQL.

-- let's create a similar dataset:
create temporary table test (id integer, age integer, name text);
insert into test values (1, 10, 'Rose'), (42, 45, 'Bobby'), (23, 38,
'Sarrah');

-- Here are the values I want to update, in this example, null won't
update the test table.
create temporary table source (id integer, age integer, name text);
insert into source values (42, null, 'Bob'), (23, 30, 'Sarah');

-- The "Bazinga" part:
update test t
set name = coalesce(s.name, t.name), -- a small workaround to keep
the original value if the source is null
age = coalesce(s.age, t.age) -- feel free to use any function you
want, that works exactly the same way as usual.
from source s
where s.id = t.id;

-- Here it is:
select * from test order by id;

1;10;Rose
23;30;Sarah
42;50;Bob

-- Bazinga --

This method is the fastest I know, even on largest datasets where you
might waste a lot of time and energy to execute millions and millions of
requests, here you just update your whole dataset with one request and
take profit of all the database optimisations (indexes etc.). With a
modern dbms, you can even check the execution plan and get usefull tips
on needed optimisations before you execute the nightmare.
The only limitation I know is that you can't update more than 1 table at
a time, but transactions give the ability to delay constraints checks at
their end, so, 1 update per table seems legit.

I think not every DBMS can handle this type of update, but it's part of
SQL standards.
At least, the famous ones (postgres, mssql, oracle, mysql etc.) have
these functionnalities, with more or less syntax variations.
I hope this will help.

Regards,

Christophe NINUCCI

Le 27/03/2013 21:41, Ben Hoyt a �crit :

Ben Hoyt

unread,

Mar 29, 2013, 5:17:18 PM3/29/13

to we...@googlegroups.com

The main utility of multiple_update is performance. In my tests it's 2-4x faster than separate updates, and this is the main reason I'm using it in our application.

And putting this logic in a function means there's a nice Pythonic API to call, rather than hand-building SQL whenever you want to do this.

In terms of the approach you've given with the temporary table, that's interesting. I haven't used the UPDATE...FROM syntax or temporary tables much at all, so I didn't think of this approach.

It's not quite as simple as the separate updates in one query. However, I'd be happy to use that approach if it's significantly faster, and if it's factored out into the multiple_update() function. The nice thing about having a function is that you can change how it works under the covers, but the API can remain the same.

Are you interested in benchmarking my multiple_update (second version) against your approach and posting results?

-Ben

To unsubscribe from this group and stop receiving emails from it, send an email to webpy+unsubscribe@googlegroups.com.

To post to this group, send email to we...@googlegroups.com.
Visit this group at http://groups.google.com/group/webpy?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

--

You received this message because you are subscribed to a topic in the Google Groups "web.py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/webpy/dTeS_5Vm7BM/unsubscribe?hl=en.

To unsubscribe from this group and all its topics, send an email to webpy+unsubscribe@googlegroups.com.

zbouboutchi

unread,

Mar 30, 2013, 3:42:24 AM3/30/13

to we...@googlegroups.com

Hi Ben,

thanks for you answer, I did'nt think of the pythonic freedom, this is indeed a really good reason ;).

Indeed, my approach is pure SQL and it's quite complicated to think of a generic function for all case since the SQL approach needs often a complete rewrite of functions.
Depending on the dbms you use, there are more or less reason to use this method, but actually there's a pro and cons balance to use the update from syntax.

Pros are that with this method you can optimize your requests efficiently, more than thousand time faster than a standard update approach for large datasets and complex calculus.
Cons are that to use this method you have to think your database schema in order to apply it. It's totally impossible to imagine a real standard function since you can use inner joins, group by, custom function, sql inline to "forge" your source dataset.

I have no idea when this is best to use the "ensemblistic" (I don't know if this word even exists... :D) method. But I think it's totally overkill, when you have only 2 or 3 lines to update, to create a temporary table and feed it with data you already know on the python side.
This approach gives nearly magical results with large datasets, given that you can already use the insert in the temporary table to validate some data domains and constraints before you send theses rows in the final table. (This approach can also apply to insert and delete statements..)

The main useful case is when you have every datas you need already in the database and a just have to propagate them with some calculus results.
Let see some interesting example:
Imagine you work on a program who create orders and estimates, and you wan't to give a user to update an order giving a last minute update on the estimate. This method enables you to update every orders that are different of estimates, reading and writing the information directly in the dbms, instead off taking it to a client and sending it back to the dbms, and without any parameters to introduce in your query. That could even be launched with a trigger on update, hiding totally the sql part from client and giving into some integrity checks before validating the transaction.

We can also consider a small trick to work on small datasets without creating temporary tables, because this is overkill to create temporary table before updating 2 or 3 lines and you almost never need to validate data in the sql side when you send them from python. You can work with a "fake dataset" inline:

update table t set name = s.name, city = s.city from
(select 1 as id, 'toto' as name, 'London' as city
union select 2 as id, 'Roger' as name, 'Gandelnans-lès-moulin' as city) s
where s.id = t.id

Imho, Postgres enables a cleaner syntax, but the concept is the same:

with source as ( select 1 as id, 'toto' as name, 'London' as city
union all select 2 as id, 'Roger' as name, 'Gandelnans-lès-moulin' as city )
update table t set name = s.name, city = s.city from source s
where s.id = t.id

I think it's possible to adapt your function to use this type of fake dataset, if I find some time this week-end I might consider taking it to write something.
Then it would be easy to benchmark them and see when it's overkill and when it's good. The main problem remains in the "never uniform" syntax that remains around these not well known sql habits and the differences between dbms.

Regards,

Christophe.

To unsubscribe from this group and stop receiving emails from it, send an email to webpy+un...@googlegroups.com.

To post to this group, send email to we...@googlegroups.com.
Visit this group at http://groups.google.com/group/webpy?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "web.py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/webpy/dTeS_5Vm7BM/unsubscribe?hl=en.

To unsubscribe from this group and all its topics, send an email to webpy+un...@googlegroups.com.

To post to this group, send email to we...@googlegroups.com.
Visit this group at http://groups.google.com/group/webpy?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

--

You received this message because you are subscribed to the Google Groups "web.py" group.

To unsubscribe from this group and stop receiving emails from it, send an email to webpy+un...@googlegroups.com.

Ben Hoyt

unread,

Mar 31, 2013, 6:59:55 PM3/31/13

to we...@googlegroups.com

Good comments.

Yeah, I think as soon as one wants to get very fancy with SQL joins, temporary tables, etc, it's better just to use raw SQL with db.query().

-Ben

Ben Hoyt

unread,

Mar 31, 2013, 7:01:06 PM3/31/13

to we...@googlegroups.com, Anand Chitipothu

Anand, what do you think of

1) my response here (re having a separate function)?

2) do you think this would be a good addition to web.py proper, and do you want a patch for that?

-Ben

Anand Chitipothu

unread,

Apr 2, 2013, 2:00:44 AM4/2/13

to Ben Hoyt, webpy

I'm fine with a separate function, but I'm not happy with the API.

>>> row1 = ({'name': 'Bob'}, {'id': 42})

>>> row2 = ({'name': 'Sarah', 'age': 30}, {'id': 23})

>>> query = multiple_update('foo', 'id = $id', [row1, row2], _test=True, database=database)

Passing row as tuple of two elements looks very strange. It doesn't fit any of the existing usage pattern.

wouldn't it look nice to have the transaction takes care of it automatically? If required with a separate keyword argument to transaction.

with db.transaction(combine_updates=True):

db.update("foo", "id=$id", vars={"id": 42}, name="Bob")

db.update("foo", "id=$id", vars={"id": 42}, name="Sarah", age=30)

or, something like:

with db.multiple_updater() as x:

x.update("foo", "id=$id", vars={"id": 42}, name="Bob")

x.update("foo", "id=$id", vars={"id": 42}, name="Sarah", age=30)

Anand

--
Anand
http://anandology.com/

Ben Hoyt

unread,

Apr 2, 2013, 3:51:46 PM4/2/13

to Anand Chitipothu, webpy

Aha, I love it -- that's really good thinking.

I prefer your second option. Partly because it's probably going to mean a simpler implementation, but also because overloading transaction (a very specific database concept) seems like it'd be confusing.

So for an API, how about "with db.multiple_updater() as mu: ...", where mu supports all the db write operations: .update(), .insert(), .multiple_insert(), and .delete(). The multipler_updater() instance stores all the <sql> objects (via _test=True?) and writes them on __exit__.

If we make it support all the write operations like this, should it be called something else, or is multiple_updater() still okay? multiple_writer() feels odd.

Let me know what you think, and then I can submit a patch.

-Ben

Ben Hoyt

unread,

Apr 3, 2013, 5:14:40 PM4/3/13

to Anand Chitipothu, webpy

How's this? The doctests are a bit unwieldly, but at least it's tested. :-)

And for convenience, you could add a "multiple_updater" function to DB which just returned "MultipleUpdater(self)".

Thanks again for the with-statement idea.

-----

class MultipleUpdater(object):

r"""Used to group multiple database writes into one query to minimize I/O

to and from the database. This can result in a speed increase of 2-4x on

bulk updates. Intended usage:

with MultipleUpdater(db) as mu:

for row_id, name in lots_of_ids_and_names_to_update:

mu.update('my_table', where='id = $id', vars={'id': row_id}, name=name)

On the MultipleUpdater instance, you can use all the database write

functions: update(), insert(), multiple_insert, and delete(). If any

exception occurs inside the with statement, no updates will be done.

>>> import decimal

>>> db = web.DB(None, {})

>>> db.supports_multiple_insert = True

>>> mu = MultipleUpdater(db, _test=True)

>>> with mu:

... mu.update('foo', where='id = $id', name='Bob', vars={'id': 42})

... mu.update('foo', where='id = $id', name='Sarah', age=30, vars={'id': 23})

>>> mu.sql

<sql: "UPDATE foo SET name = 'Bob' WHERE id = 42;\nUPDATE foo SET age = 30, name = 'Sarah' WHERE id = 23">

>>> mu.sql.query()

'UPDATE foo SET name = %s WHERE id = %s;\nUPDATE foo SET age = %s, name = %s WHERE id = %s'

>>> mu.sql.values()

['Bob', 42, 30, 'Sarah', 23]

>>> with mu:

... mu.update('bar', where='a = $b', c=decimal.Decimal(2), vars={'b': 3})

>>> mu.sql

<sql: "UPDATE bar SET c = Decimal('2') WHERE a = 3">

>>> bool(mu.sql)

True

>>> mu.sql.query()

'UPDATE bar SET c = %s WHERE a = %s'

>>> mu.sql.values()

[Decimal('2'), 3]

>>> with mu:

... pass

>>> mu.sql

<sql: ''>

>>> bool(mu.sql)

False

>>> with mu:

... mu.update('foo', where='id = 1', name='Bob')

... mu.insert('bar', age=32)

... mu.multiple_insert('baz', [{'name': 'John'}, {'name': 'Sally'}])

... mu.delete('qux', where='id = 2')

>>> print str(mu.sql)

UPDATE foo SET name = 'Bob' WHERE id = 1;

INSERT INTO bar (age) VALUES (32);

INSERT INTO baz (name) VALUES ('John'), ('Sally');

DELETE FROM qux WHERE id = 2

>>> mu.sql.query()

'UPDATE foo SET name = %s WHERE id = 1;\nINSERT INTO bar (age) VALUES (%s);\nINSERT INTO baz (name) VALUES (%s), (%s);\nDELETE FROM qux WHERE id = 2'

>>> mu.sql.values()

['Bob', 32, 'John', 'Sally']

>>> with MultipleUpdater(db, _test=True) as mu2:

... mu2.update('foo', where='id = 1', name='Bob')

>>> mu2.sql

<sql: "UPDATE foo SET name = 'Bob' WHERE id = 1">

"""

def __init__(self, database, _test=False):

self.database = database

self.test = _test

def __enter__(self):

self.updates = []

self.sql = None

return self

def __exit__(self, exc_type, exc_value, traceback):

if exc_type is not None:

# Don't update anything if an exception occurred

return

self.sql = web.SQLQuery.join(self.updates, sep=';\n')

if self.sql and not self.test:

self.database.query(self.sql)

def update(self, *args, **kwargs):

self.updates.append(self.database.update(*args, _test=True, **kwargs))

def insert(self, *args, **kwargs):

self.updates.append(self.database.insert(*args, _test=True, **kwargs))

def multiple_insert(self, *args, **kwargs):

self.updates.append(self.database.multiple_insert(*args, _test=True, **kwargs))

def delete(self, *args, **kwargs):

self.updates.append(self.database.delete(*args, _test=True, **kwargs))

-----

Reply all

Reply to author

Forward