insert performance vs executemany and max_allowed_packet limit

2,854 views
Skip to first unread message

Михаил Доронин

unread,
May 25, 2017, 5:49:06 AM5/25/17
to sqlalchemy
I've tried to benchmark alchemy performance when inserting a lot of data.
The results wasn't that good for sqlalchemy. The difference was up to three times in median values.

First of all the more elements inserted the more the difference between sqlalchemy and executemany (mysqlclient).
I've profiled the code - most of the time spent in visit_bind_param and BindParam initializer. I've skimmed over the code and no places for optimization are obvious, however it seems like the logic is too much compilcated. There's a lot of conditions etc. Maybe this can be simplified in some way or maybe there could be a parameter in the insert that user can use to say that he don't want any complex logic, he just inserting some data and he takes the responsibility that the data is correct.

Next thing is that in executemany they keep an eye on the size of the string to be executed and if it's more than max_allowed_packet limit they split it into batches (they hardcoded this limit though instead of taking it from database at runtime).
Not only sqlalchemy isn't doing that - it doesn't provide a way to know what the size of a string would be. And the only thing the user can do is too catch exception and use heuristics to split the data.

mike bayer

unread,
May 25, 2017, 8:50:06 AM5/25/17
to sqlal...@googlegroups.com
the topic of performance is one that many years of effort have gone
into. For background on this topic particularly inserts, see:

http://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow

for a detailed test suite that will illustrate the many varieties of
INSERT, see
http://docs.sqlalchemy.org/en/latest/_modules/examples/performance/bulk_inserts.html.

as far as bound parameter slowness, this is a known condition which can
arise if you are for example passing a lot of unicode strings in and the
database driver and/or SQLAlchemy is spending lots of time encoding them
into a string encoding like utf-8 or similar. There's not really any
other bound parameter process that is known to take much time, other
than strings, pretty much all bound parameters are straight
pass-throughs to the DBAPI. The difference is often that the straight
DBAPI application doesn't set up unicode encoding whereas the SQLAlchemy
version does (and can of course be changed).


If you can share a Python profile as documented at
http://docs.sqlalchemy.org/en/latest/faq/performance.html#code-profiling
I can show you how to remove the bound parameter overhead.
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sqlalchemy+...@googlegroups.com
> <mailto:sqlalchemy+...@googlegroups.com>.
> To post to this group, send email to sqlal...@googlegroups.com
> <mailto:sqlal...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

Михаил Доронин

unread,
May 25, 2017, 12:02:08 PM5/25/17
to sqlalchemy
Sorry I can't post a benchmark here, because I'm not in the office. Maybe I will do that tomorrow. I use vmprof and can just share a link to the uploaded profile. Will that suite you?

But I can say right now that I've passed only integers, not strings. Also mysqlclient encoding values if needed as far as I can see.


And the fact that sqlalchemy doesn't make batches automatically is much worse than performance problems. Heuristics will not work one day, which means any inserts to the database should be surrounded with try catch etc.
sqlalchemy makes sql by ', '.join on values, which makes it impossible for it to see that the string size is to big until it's too late. I think it's better to use approach that mysqlclient is using, where they encode prefix and postfix query part into bytearray (which is mutable) and append arguments to it checking that the bytearray size is less then the limit.

mike bayer

unread,
May 25, 2017, 12:26:29 PM5/25/17
to sqlal...@googlegroups.com


On 05/25/2017 12:02 PM, Михаил Доронин wrote:
> Sorry I can't post a benchmark here, because I'm not in the office.
> Maybe I will do that tomorrow. I use vmprof and can just share a link to
> the uploaded profile. Will that suite you?

I would much prefer if you can send a traditional cProfile file as I
don't have the time to learn how to use vmprof.


>
> But I can say right now that I've passed only integers, not strings.
> Also mysqlclient encoding values if needed as far as I can see.
>
> https://github.com/PyMySQL/mysqlclient-python/blob/master/MySQLdb/cursors.py
> see _do_execute_many at line 286.
>
> And the fact that sqlalchemy doesn't make batches automatically is much
> worse than performance problems.

SQLAlchemy batches inserts in the ORM as is possible and with Core you
do this explicitly, both make use of cursor.executemany() which is then
determined by how the DBAPI handles it.

Heuristics will not work one day, which
> means any inserts to the database should be surrounded with try catch etc.
> sqlalchemy makes sql by ', '.join on values, which makes it impossible
> for it to see that the string size is to big until it's too late.

I don't know what "the string size is too big" refers to, but also this
sounds like a topic other than performance. if the ",".join() refers
to how an INSERT is composed, that has nothing to do with the data
you're passing in the insert, it is just building bound parameter
placeholders.


I
> think it's better to use approach that mysqlclient is using, where they
> encode prefix and postfix query part into bytearray (which is mutable)
> and append arguments to it checking that the bytearray size is less then
> the limit.

Sorry, I'm not familiar with what this is referring towards.

Михаил Доронин

unread,
May 25, 2017, 1:44:12 PM5/25/17
to sqlalchemy
> SQLAlchemy batches inserts in the ORM as is possible and with Core you 
> do this explicitly, both make use of cursor.executemany() which is then 
> determined by how the DBAPI handles it.

Ummm. If that is true, why this line is used when I pass a list of values (its from visit_insert)? Clearly sqlalchemy creates big string `VALUES (%s, %s), (%s, %s)....` by itself without executemany.


        elif insert_stmt._has_multi_parameters:
            text += " VALUES %s" % (
                ", ".join(
                    "(%s)" % (
                        ', '.join(c[1] for c in crud_param_set)
                    )
                    for crud_param_set in crud_params
                )
            )

I've written benchmark again.
https://gist.github.com/purpleP/35c4920020ad95585419eb1d250200b7
TLDR; alchemy is more than three times slower than executemany for inserting 10000 values.
for a million values sqlalchemy drops with OperationalError.

Here's a vmprof visualization for 10000 values http://vmprof.com/#/c7f9a4fd-55f9-469d-b6e4-b7cac04473fa .
vmprof is much better than cProfile give it a try if you still coding in python.

If you will need more data then just a visualization - I'll try to provide it. But you have the benchmark code in the gist.

Михаил Доронин

unread,
May 25, 2017, 2:36:47 PM5/25/17
to sqlalchemy

For Christ sake how to post a new message here and not just reply? And how to edit the old post? Arghhhh!

mike bayer

unread,
May 25, 2017, 3:43:55 PM5/25/17
to sqlal...@googlegroups.com


On 05/25/2017 01:44 PM, Михаил Доронин wrote:
>> SQLAlchemy batches inserts in the ORM as is possible and with Core you
>> do this explicitly, both make use of cursor.executemany() which is then
>> determined by how the DBAPI handles it.
>
>
> Ummm. If that is true, why this line is used when I pass a list of
> values (its from visit_insert)? Clearly sqlalchemy creates big string
> `VALUES (%s, %s), (%s, %s)....` by itself without executemany.
>
>
>
> elif insert_stmt._has_multi_parameters:
> text += " VALUES %s" % (
> ", ".join(
> "(%s)" % (
> ', '.join(c[1] for c in crud_param_set)
> )
> for crud_param_set in crud_params
> )
> )
>
>
> I've written benchmark again.
>
> https://gist.github.com/purpleP/35c4920020ad95585419eb1d250200b7
>
> TLDR; alchemy is more than three times slower than executemany for
> inserting 10000 values.

Hello -

I gave you links to detailed documentation that represents the many
years I have spent and developing documenting the performance
characteristics of SQLAlchemy.

The test you have posted here, along with your conclusions, does not
take this guidance into account nor does it indicate awareness of basic
API use as documented in the Core tutorial at:

http://docs.sqlalchemy.org/en/latest/core/tutorial.html#executing-multiple-statements

Here are the links again, which refer to the performance characteristics
in detail as well as illustrate how to make use of executemany(), along
with benchmarks comparing different API options:

http://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow

http://docs.sqlalchemy.org/en/latest/_modules/examples/performance/bulk_inserts.html.

SQLAlchemy's use of executemany() is a top-level, well documented
feature. It is not an obscure, hidden side effect. Years of effort
have gone into SQLAlchemy's documentation to ensure that behaviors like
these will not be mysterious.

Let's apply it to your script at:

https://gist.github.com/purpleP/35c4920020ad95585419eb1d250200b7#gistcomment-2106299

for 10K results:

----------------------------------------------------------------------------------
benchmark: 3 tests
----------------------------------------------------------------------------------
Name (time in ms) Min Max
Mean StdDev Median IQR
Outliers(*) Rounds Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_insert[raw] 141.2799 (1.0) 326.9598 (1.0)
182.5684 (1.0) 80.9658 (1.55) 144.1729 (1.0) 57.0499
(1.16) 1;1 5 1
test_insert[alcehmy_correct] 179.7159 (1.27) 340.2121 (1.04)
224.5844 (1.23) 68.2835 (1.30) 188.8490 (1.31) 79.4864
(1.61) 1;0 5 1
test_insert[alchemy_incorrect] 495.6210 (3.51) 636.7581 (1.95)
550.5236 (3.02) 52.3564 (1.0) 537.3681 (3.73) 49.3491
(1.0) 2;0 5 1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


for 100K rows (no OperationalError is raised):

-----------------------------------------------------------------------------
benchmark: 3 tests
-----------------------------------------------------------------------------
Name (time in s) Min Max
Mean StdDev Median IQR
Outliers(*) Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_insert[raw] 1.3214 (1.0) 1.4007 (1.0)
1.3660 (1.0) 0.0307 (1.94) 1.3709 (1.0) 0.0448 (1.95)
2;0 5 1
test_insert[alcehmy_correct] 1.7432 (1.32) 1.7848 (1.27)
1.7645 (1.29) 0.0159 (1.0) 1.7655 (1.29) 0.0229 (1.0)
2;0 5 1
test_insert[alchemy_incorrect] 5.7412 (4.34) 5.9293 (4.23)
5.8330 (4.27) 0.0687 (4.33) 5.8319 (4.25) 0.0831 (3.63)
2;0 5 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

SQLAlchemy Core is therefore about 1.2 times slower than pure C
mysqlclient when used as documented. Which is actually a little less
than the difference against SQLite's driver vs. Core, that is described
at
http://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow,
where Core is more like 1.6 times slower.

The SQLAlchemy project is very welcoming of thoughtful criticism and is
in fact the product of over a decade of critique by a wide range of
users. I encourage you to contribute, but please ensure you take the
time for links that I provide in response to your questions. I am not
the brightest light on the planet but I have been doing this for a very
long time. Thanks!








>
> for a million values sqlalchemy drops with OperationalError.
>
>
> Here's a vmprof visualization for 10000 values
> http://vmprof.com/#/c7f9a4fd-55f9-469d-b6e4-b7cac04473fa .
>
> vmprof is much better than cProfile give it a try if you still coding in python.
>
>
> If you will need more data then just a visualization - I'll try to provide it. But you have the benchmark code in the gist.
>

Михаил Доронин

unread,
May 25, 2017, 5:35:59 PM5/25/17
to sqlalchemy

Wow, I've never saw this thing about multiple statements. Thanks! RTFM 
Message has been deleted

Михаил Доронин

unread,
May 29, 2017, 8:54:19 AM5/29/17
to sqlalchemy
I have a questions. If the only way to issue executemany is to do a conn.execute(table.insert(), list_of_dict). How users supposed to use postgres on_conflict and mysql on duplicate key update with this? Seems like there would be no way to access .updated in postgres and corresponding .values in mysql, because there are no values have been passed at this point.


On Thursday, 25 May 2017 22:43:55 UTC+3, Mike Bayer wrote:

mike bayer

unread,
May 29, 2017, 12:36:44 PM5/29/17
to sqlal...@googlegroups.com


On 05/29/2017 08:54 AM, Михаил Доронин wrote:
> I have a questions. If the only way to issue executemany is to do a
> conn.execute(table.insert(), list_of_dict). How users supposed to use
> postgres on_conflict and mysql on duplicate key update with this?
Seems
> like there would be no way to access .updated in postgres and
> corresponding .values in mysql, because there are no values have been
> passed at this point.


I'm assuming you're asking how to pass values to things like the WHERE
clause. You use bindparam(). See the examples for UPDATE in the Core
Tutorial at http://docs.sqlalchemy.org/en/latest/core/tutorial.html.




>
> On Thursday, 25 May 2017 22:43:55 UTC+3, Mike Bayer wrote:
>
>
>
> On 05/25/2017 01:44 PM, Михаил Доронин wrote:
> >> SQLAlchemy batches inserts in the ORM as is possible and with
> Core you
> >> do this explicitly, both make use of cursor.executemany() which
> is then
> >> determined by how the DBAPI handles it.
> >
>
Message has been deleted

mike bayer

unread,
May 30, 2017, 9:37:40 AM5/30/17
to sqlal...@googlegroups.com


On 05/29/2017 11:34 PM, Михаил Доронин wrote:
> Umm, what I've meant is how to use postgresql
>
> on_conflict_do_update in such a way that sqlalchemy would use executemany behind the scenes.
>
> In examples it usage looks like this.
>
> stmt = insert(table, values)
> stmt = stmt.on_conflict_do_update(set_=dict(a=stmt.excluded.a))
>
> excluded is generated from values, right? If I don't pass values to the statement, how can I use excluded?


you would not use insert(table, values), which as we reviewed earlier is
not "executemany" syntax. the values are passed to execute() as the
second argument, and is the list of values which are invoked for the
statement one at a time. .excluded is a server side collection
generated by Postgresql and is based on the current row being operated
upon. These values are not returned to the client. They are only used
in context of the statement.

think of executemany like this:


stmt =
table.insert().on_conflict_do_update(set_=dict(a=stmt.excluded.a)).values(x
= bindparam('x'), y=bindparam('y'))

def executemany(stmt, values):
for value in values:
conn.execute(stmt, value)


executemany(stmt, [{"x": 1, "y": 2}, {"x": 3, "y": 4}, ...])


e.g. if your statement works for one execute() and one set of
parameters, it will work for any number of individual sets of parameters.


Reply all
Reply to author
Forward
0 new messages