On Fri, Oct 5, 2018 at 9:15 AM <
adam....@gmail.com> wrote:
>
> Simon,
>
> The MySQLdb module performs this operation with very little difference to native (mysql < data.sql) speed, so I'm not concerned on that, the delta was <0.5 seconds
>
> In my original example (I didn't know I could pass a dict to the execute to have it fill in the parameters) I built the complete SQL raw with calling mysqldb's escape string on each value, and formatting that SQL took slightly over 1 second, and I didn't even try to optimize how I was doing that code.
>
> My more concern/curiosity is the overhead that sqlalchemy core is adding to these inserts and just figuring out why...
SQLAlchemy Core adds a very low amount of overhead when used
correctly. the one exception to this is when there is some
processing of data occurring, which historically has been involved
with coercing Python Unicode objects to strings, but there are other
examples of this as well. These can be spotted by using Python
profiling.
We know that Core's overhead is low because I profile it like crazy,
and here are what typical results look like for inserts using
exeutemany:
https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow
SQLAlchemy Core: Total time for 100000 records 0.21024107933 secs
sqlite3: Total time for 100000 records 0.137335062027 sec
Above, we can see for a test that eliminates network and file overhead
by using a SQLite in memory database, the overhead for 100000 rows
being INSERTed is less than .1 seconds greater compared to using
sqlite3 directly. When using the MySQL driver instead, the Core
doesn't have any additional work to do beyond type-specific issues.
The test suite I linked can be run for any database, so here's a run
using a local MariaDB database with mysqlclient, for 10000 rows like
you are doing (you can do the same thing here just running from the
SQLAlchemy source checkout):
$ python -m examples.performance --dburl
mysql://scott:tiger@localhost/test bulk_inserts --num 10000
Tests to run: test_flush_no_pk, test_bulk_save_return_pks,
test_flush_pk_given, test_bulk_save, test_bulk_insert_mappings,
test_core_insert, test_dbapi_raw
test_flush_no_pk : Individual INSERT statements via the ORM, calling
upon last row id (10000 iterations); total time 3.738233 sec
test_bulk_save_return_pks : Individual INSERT statements in "bulk",
but calling upon last row id (10000 iterations); total time 3.571442
sec
test_flush_pk_given : Batched INSERT statements via the ORM, PKs
already defined (10000 iterations); total time 0.966315 sec
test_bulk_save : Batched INSERT statements via the ORM in "bulk",
discarding PKs. (10000 iterations); total time 0.496674 sec
test_bulk_insert_mappings : Batched INSERT statements via the ORM
"bulk", using dictionaries. (10000 iterations); total time 0.283908
sec
test_core_insert : A single Core INSERT construct inserting mappings
in bulk. (10000 iterations); total time 0.220689 sec
test_dbapi_raw : The DBAPI's API inserting rows in bulk. (10000
iterations); total time 0.189292 sec
Pretty much the same result, Core insert is .22 seconds for 10000
rows, .18 seconds with straight DBAPI. Here it is over a network
connection to a MySQL 8.0 database:
python -m examples.performance --dburl
mysql://scott:tiger@mysql80/test bulk_inserts --num 10000
Tests to run: test_flush_no_pk, test_bulk_save_return_pks,
test_flush_pk_given, test_bulk_save, test_bulk_insert_mappings,
test_core_insert, test_dbapi_raw
test_flush_no_pk : Individual INSERT statements via the ORM, calling
upon last row id (10000 iterations); total time 30.921060 sec
test_bulk_save_return_pks : Individual INSERT statements in "bulk",
but calling upon last row id (10000 iterations); total time 30.468841
sec
test_flush_pk_given : Batched INSERT statements via the ORM, PKs
already defined (10000 iterations); total time 1.618901 sec
test_bulk_save : Batched INSERT statements via the ORM in "bulk",
discarding PKs. (10000 iterations); total time 1.136148 sec
test_bulk_insert_mappings : Batched INSERT statements via the ORM
"bulk", using dictionaries. (10000 iterations); total time 0.878011
sec
test_core_insert : A single Core INSERT construct inserting mappings
in bulk. (10000 iterations); total time 0.871134 sec
test_dbapi_raw : The DBAPI's API inserting rows in bulk. (10000
iterations); total time 0.870091 sec
So we can note, two of those runs took 30 seconds - those are both ORM
runs where the primary key of the row was not provided, which means
the ORM has to run each row into an individual execute() and get the
primary key back. Then, the rest of the runs use executemany() where
we can see the time for the ORM is in the 1.5 second area and then
back to the core, there is virtually no difference in speed between
Core and straight DBAPI.
So if you want to figure out what the issue is on your end, here are
things you can do:
1. make sure you are using the C extensions (I don't recall if you
said you were on linux, or windows, or what)
2. run the above performance suite on your database
3. use the profiling recipe against your specific code:
https://docs.sqlalchemy.org/en/latest/faq/performance.html#code-profiling
4. provide *runnable* examples here, e.g. MCVE, which illustrate
*exactly* what is being called, and how, and with what data,
datatypes, schema, etc. Specifics matter. if the rows here have 5
columns vs. 100, that matters. if the rows have large binary objects,
unicodes, dates, precision decimals, that all matters. But really,
the profiling you get from #3 will show where the time is being spent.