On Mon, Oct 9, 2017 at 4:15 AM, <
vin...@benchling.com> wrote:
> Hello! I've spent some time looking at SQLAlchemy's ability to batch
> inserts, and have a few questions about bulk_save_objects (and flushing in
> general).
>
> Two statements that I think are true:
>
> Right now, bulk_save_objects does not fetch primary keys for inserted rows
> (unless return_defaults is passed in, which mitigates the performance gains
> by performing one insert statement per model).
the point of bulk save is that the objects passed to it are considered
as throwaway after the operation, to avoid the overhead of maintaining
the objects' persistence state in memory. Using it with
return_defaults is nearly pointless as this will blow away the
efficiency gains you might get from the bulk operation, and is there
mostly to allow the joined-table inheritance use case to be usable.
> When running db.session.flush(), it looks like SQLAlchemy can batch model
> creation into a single multi-value insert statement *only if all of the
> primary keys are already defined on the model*. (Verified by looking at
> sqlalchemy/lib/sqlalchemy/orm/persistence.py:_emit_insert_statements)
>
>
> Questions:
>
> Would it be possible for bulk_save_objects to fetch primary keys while still
> batching INSERTs into a single query?
What do you mean "single query", do you mean, executemany()? the
answer is no because DBAPI drivers don't return result sets with
executemany(). Do you mean, a single INSERT..VALUES with all the
parameters in one statement ? The answer is no because SQLAlchemy
doesn't include multi-values as a transparent option; the DBAPI
drivers instead make use of the multi-values syntax within their
executemany() implementations, where again, they don't return result
sets.
Right now, only the MySQL drivers do this by default, the psycopg2
driver does it with a recently added option that SQLAlchemy does not
directly support, however you can set up via a connection event. For
psycopg2 it can vastly speed up inserts as psycopg2's normal
executemany() implementation has some performance issues.
(This
> would help with e.g. inserting a bunch of rows into a table with an
> auto-incrementing primary key).
if you're using Postgresql, there's a vastly easier technique to use
which is just to pre-fetch from the sequence:
identities = [
val for val, in session.execute(
"select nextval('mytable_seq') from "
"generate_series(1,%s)" % len(my_objects))
)
]
for ident, obj in zip(identities, my_objects):
obj.pk = ident
Now you don't need to do RETURNING or anything and the inserts can be
at their top efficiency.
>
> At least in Postgres (haven't verified for other databases), it looks like
> one can use RETURNING for inserts with multi-value VALUES clauses. At the
> surface, it seems like one could extend the current behavior for a single
> row INSERT:
>
> INSERT INTO table ... VALUES (1) RETURNING id
>
> to multi-value INSERTS:
>
> INSERT INTO table ... VALUES (1), (2), ... RETURNING id
>
> and get all the benefits of the ORM while still batching inserts.
This would be an enormous undertaking to implement, test, and release.
It would have to exist as an entirely additional series of codepaths
within persistence.py as multi-valued INSERT is not available on most
databases as well as on earlier versions of the databases that do
support it. For all of that complexity, the approach would work on
exactly: Postgresql only - SQLite and MySQL don't support RETURNING,
Oracle, SQL Server and others don't support INSERT..VALUES with
multiple sets.
If you are using Postgresql, you can instead pre-fetch the sequence up
front. This can even be integrated into a before_flush() event.
So the complexity of adding multi-values insert with sequences would
benefit an extremely narrow set of use cases, would be very labor
intensive to implement and maintain, and is unnecessary for the single
target platform in which this case would work.
>
> I'm sure this is something that was thought about already, so I'd mostly
> love to hear any thoughts about what makes this hard. We have a lot of
> complex relationships and joined-table inheritance, so working with
> bulk_save_objects has been a bit challenging. From the comments in
> sqlalchemy/lib/sqlalchemy/sql/dml.py:return_defaults(), it seems like
> SQLAlchemy is already aware that RETURNING for insert statements with
> multi-values VALUES clauses is supported, so it's possible there is a reason
> that this is hard or can't be done.
>
> This is also something that, if it just a “missing feature”, I would love to
> learn more and see if I could contribute.
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
>
http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See
http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
sqlalchemy+...@googlegroups.com.
> To post to this group, send email to
sqlal...@googlegroups.com.
> Visit this group at
https://groups.google.com/group/sqlalchemy.
> For more options, visit
https://groups.google.com/d/optout.