Batching INSERT statements

1,366 views
Skip to first unread message

vin...@benchling.com

unread,
Oct 9, 2017, 4:15:13 AM10/9/17
to sqlalchemy
Hello! I've spent some time looking at SQLAlchemy's ability to batch inserts, and have a few questions about bulk_save_objects (and flushing in general).

Two statements that I think are true:
  1. Right now, bulk_save_objects does not fetch primary keys for inserted rows (unless return_defaults is passed in, which mitigates the performance gains by performing one insert statement per model).
  2. When running db.session.flush(), it looks like SQLAlchemy can batch model creation into a single multi-value insert statement *only if all of the primary keys are already defined on the model*. (Verified by looking at sqlalchemy/lib/sqlalchemy/orm/persistence.py:_emit_insert_statements)

Questions:
  1. Would it be possible for bulk_save_objects to fetch primary keys while still batching INSERTs into a single query?
  2. Or even better, would it be possible to batch INSERT statements during a flush even when models don't have their primary keys pre-defined? (This would help with e.g. inserting a bunch of rows into a table with an auto-incrementing primary key).
At least in Postgres (haven't verified for other databases), it looks like one can use RETURNING for inserts with multi-value VALUES clauses. At the surface, it seems like one could extend the current behavior for a single row INSERT:

INSERT INTO table ... VALUES (1) RETURNING id

to multi-value INSERTS:

INSERT INTO table ... VALUES (1), (2), ... RETURNING id

and get all the benefits of the ORM while still batching inserts.

I'm sure this is something that was thought about already, so I'd mostly love to hear any thoughts about what makes this hard. We have a lot of complex relationships and joined-table inheritance, so working with bulk_save_objects has been a bit challenging. From the comments in sqlalchemy/lib/sqlalchemy/sql/dml.py:return_defaults(), it seems like SQLAlchemy is already aware that RETURNING for insert statements with multi-values VALUES clauses is supported, so it's possible there is a reason that this is hard or can't be done.

This is also something that, if it just a “missing feature”, I would love to learn more and see if I could contribute.

Mike Bayer

unread,
Oct 9, 2017, 11:44:51 AM10/9/17
to sqlal...@googlegroups.com
On Mon, Oct 9, 2017 at 4:15 AM, <vin...@benchling.com> wrote:
> Hello! I've spent some time looking at SQLAlchemy's ability to batch
> inserts, and have a few questions about bulk_save_objects (and flushing in
> general).
>
> Two statements that I think are true:
>
> Right now, bulk_save_objects does not fetch primary keys for inserted rows
> (unless return_defaults is passed in, which mitigates the performance gains
> by performing one insert statement per model).

the point of bulk save is that the objects passed to it are considered
as throwaway after the operation, to avoid the overhead of maintaining
the objects' persistence state in memory. Using it with
return_defaults is nearly pointless as this will blow away the
efficiency gains you might get from the bulk operation, and is there
mostly to allow the joined-table inheritance use case to be usable.


> When running db.session.flush(), it looks like SQLAlchemy can batch model
> creation into a single multi-value insert statement *only if all of the
> primary keys are already defined on the model*. (Verified by looking at
> sqlalchemy/lib/sqlalchemy/orm/persistence.py:_emit_insert_statements)
>
>
> Questions:
>
> Would it be possible for bulk_save_objects to fetch primary keys while still
> batching INSERTs into a single query?

What do you mean "single query", do you mean, executemany()? the
answer is no because DBAPI drivers don't return result sets with
executemany(). Do you mean, a single INSERT..VALUES with all the
parameters in one statement ? The answer is no because SQLAlchemy
doesn't include multi-values as a transparent option; the DBAPI
drivers instead make use of the multi-values syntax within their
executemany() implementations, where again, they don't return result
sets.

Right now, only the MySQL drivers do this by default, the psycopg2
driver does it with a recently added option that SQLAlchemy does not
directly support, however you can set up via a connection event. For
psycopg2 it can vastly speed up inserts as psycopg2's normal
executemany() implementation has some performance issues.


(This
> would help with e.g. inserting a bunch of rows into a table with an
> auto-incrementing primary key).

if you're using Postgresql, there's a vastly easier technique to use
which is just to pre-fetch from the sequence:

identities = [
val for val, in session.execute(
"select nextval('mytable_seq') from "
"generate_series(1,%s)" % len(my_objects))
)
]
for ident, obj in zip(identities, my_objects):
obj.pk = ident


Now you don't need to do RETURNING or anything and the inserts can be
at their top efficiency.



>
> At least in Postgres (haven't verified for other databases), it looks like
> one can use RETURNING for inserts with multi-value VALUES clauses. At the
> surface, it seems like one could extend the current behavior for a single
> row INSERT:
>
> INSERT INTO table ... VALUES (1) RETURNING id
>
> to multi-value INSERTS:
>
> INSERT INTO table ... VALUES (1), (2), ... RETURNING id
>
> and get all the benefits of the ORM while still batching inserts.


This would be an enormous undertaking to implement, test, and release.
It would have to exist as an entirely additional series of codepaths
within persistence.py as multi-valued INSERT is not available on most
databases as well as on earlier versions of the databases that do
support it. For all of that complexity, the approach would work on
exactly: Postgresql only - SQLite and MySQL don't support RETURNING,
Oracle, SQL Server and others don't support INSERT..VALUES with
multiple sets.

If you are using Postgresql, you can instead pre-fetch the sequence up
front. This can even be integrated into a before_flush() event.

So the complexity of adding multi-values insert with sequences would
benefit an extremely narrow set of use cases, would be very labor
intensive to implement and maintain, and is unnecessary for the single
target platform in which this case would work.


>
> I'm sure this is something that was thought about already, so I'd mostly
> love to hear any thoughts about what makes this hard. We have a lot of
> complex relationships and joined-table inheritance, so working with
> bulk_save_objects has been a bit challenging. From the comments in
> sqlalchemy/lib/sqlalchemy/sql/dml.py:return_defaults(), it seems like
> SQLAlchemy is already aware that RETURNING for insert statements with
> multi-values VALUES clauses is supported, so it's possible there is a reason
> that this is hard or can't be done.
>
> This is also something that, if it just a “missing feature”, I would love to
> learn more and see if I could contribute.
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+...@googlegroups.com.
> To post to this group, send email to sqlal...@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

vin...@benchling.com

unread,
Oct 10, 2017, 1:27:12 AM10/10/17
to sqlalchemy
if you're using Postgresql, there's a vastly easier technique to use 
which is just to pre-fetch from the sequence: 
identities = [ 
    val for val, in session.execute( 
             "select nextval('mytable_seq') from " 
             "generate_series(1,%s)" % len(my_objects)) 
    ) 

for ident, obj in zip(identities, my_objects): 
    obj.pk = ident 
Wow, that's a great idea! I got it working for most of our models. I have some questions about how inserts for joined-table inheritance tables are batched together, but I'll ask them in a separate post since they're somewhat unrelated to this.

So the complexity of adding multi-values insert with sequences would 
benefit an extremely narrow set of use cases, would be very labor 
intensive to implement and maintain, and is unnecessary for the single 
target platform in which this case would work. 
That makes sense, thanks for the explanation!

 

Vineet Gopal

unread,
Feb 19, 2020, 9:28:42 PM2/19/20
to sqlal...@googlegroups.com
Hi Mike,

Thanks for all of your help getting this working again. We've used this solution in production for two years now, and it's helped our performance significantly.

We just open-sourced the solution that we built so others can use it, and are also writing a blog post to cover some of the details. I'm attaching a copy of the blog post here. Obviously not expected, but if you are interested in taking a look, we are happy to incorporate any comments that you may have before publishing.

Here's a link to the repo as well: https://github.com/benchling/sqlalchemy_batch_inserts

Best,
Vineet

You received this message because you are subscribed to a topic in the Google Groups "sqlalchemy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sqlalchemy/GyAZTThJi2I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sqlalchemy+...@googlegroups.com.
BatchInsertsBlogPost.pdf

Mike Bayer

unread,
Feb 21, 2020, 9:49:54 AM2/21/20
to noreply-spamdigest via sqlalchemy
Hi Vineet -

glad that worked!   I'll have to find some time to recall what we worked out here and how it came out for you, I wonder where on the site this kind of thing could be mentioned.    we have 3rd party dialects listed out in the docs but not yet a place for extensions.


Attachments:
  • BatchInsertsBlogPost.pdf

Cristian Bulgaru

unread,
Feb 12, 2021, 2:06:55 PM2/12/21
to sqlalchemy
Hi Vineet, Mike,


A few questions:

1. Do we need to get the incremented IDs from Postgresql itself, or can we just fetch the current max ID on a table and increment IDs in Python without querying the DB for the incremented values?

2. I was intrigued by the following phrase:

> * P.S. execute_values in psycopg2 v2.8 supports getting returned values back, so it’s possible that SQLAlchemy may support batch inserting these models (with an auto-incrementing primary key) in the future.

@Mike @Vineet, do you know if this is the case, ie if bulk insert now works out of the box (without pre-assigning incremented IDs)?

3. Does this imply any change in case of bulk insert of multiple models with foreign keys referring to each other? This answer seems to suggest pre-assigning IDs for it to work.

Jonathan Vanasco

unread,
Feb 12, 2021, 3:22:38 PM2/12/21
to sqlalchemy
I'm not familiar with this exactly, but have a bit of experience in this area.

I just took a look at this module (nice work!).  It's VERY well documented in the docstrings (even nicer work!)

I think the core bit of this technique looks to be in `_get_next_sequence_values` -  
https://github.com/benchling/sqlalchemy_batch_inserts/blob/master/sqlalchemy_batch_inserts/__init__.py#L51-L83

Vineet is obtaining the ids by running the SQL generate_series function over the nextval function.

When I've done large batches and migrations like this, I've used a somewhat dirty cheat/trick.  Assuming a window of 1000 inserts, I would just increment the serial by 1000 and use "new number - 1000" as the range for IDs.  That is somewhat closer to the "max id" concept.  Vineet's approach is better.

In terms of dealing with multiple foreign key constraints, pre-assigning IDs may or may not work depending on how your database constraints exist.

As a habit, I always create (or re-create) Postgres foreign key checks as deferrable. When dealing with batches, I (i) defer all the involved constraints [which can be on other tables!], (ii) process the batch, (iii) set constraints to immediate.  If the migration is LARGE, i'll drop all the indexes the tables too, and possibly drop the constraints too and run multiple workers. This gets around the overheads from every insert populating rows+indexes, and the FKEY integrity checks on every row. 




Mike Bayer

unread,
Feb 12, 2021, 5:16:36 PM2/12/21
to noreply-spamdigest via sqlalchemy


On Fri, Feb 12, 2021, at 2:06 PM, Cristian Bulgaru wrote:
Hi Vineet, Mike,


A few questions:

1. Do we need to get the incremented IDs from Postgresql itself, or can we just fetch the current max ID on a table and increment IDs in Python without querying the DB for the incremented values?

2. I was intrigued by the following phrase:

> * P.S. execute_values in psycopg2 v2.8 supports getting returned values back, so it’s possible that SQLAlchemy may support batch inserting these models (with an auto-incrementing primary key) in the future.

SQLAlchemy now supports this in version 1.4 including that's it's used by the ORM:



basically it's now the most efficient INSERT + autoincrement PK we can get with the psycopg2 driver.







Reply all
Reply to author
Forward
0 new messages