pandas.DataFrame.to_sql method: how to speed up exporting to Microsoft SQL Server (6 minutes for 11 MB!)

16,026 views
Skip to first unread message

John Doe

unread,
Apr 21, 2015, 1:47:53 PM4/21/15
to sqlal...@googlegroups.com
I am using pandas 0.16 and sqlalchemy 0.99. Yes, I know I should probably upgrade, but I don't have admin rights on my PC.
I understand the pandas.DataFrame.to_sql() method relies on sqlalchemy.

I have a pandas dataframe with ca 155,000 rows and 12 columns. If I export it to csv with dataframe.to_csv , the output is an 11MB file (which is produced instantly).

If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes!

Reading the same table from SQL to Python with the pandas.read_sql_table takes 2 seconds.

No columns are text: only int, float, bool and dates. Also, there are no constraints on the table. I have seen cases where ODBC drivers set nvarchar(max) and this slows down the data transfer, but it cannot be the case here.

Any suggestions on how to speed up the export process? 

I appreciate that SQLAlchemy will probably never be the fastest method to insert data, but 6 minutes for 11 MBs of data is too slow...


Thanks!


My code looks like this:

import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"

engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database)
conn = engine.connect()

metadata = MetaData(conn)

my_data_frame.to_sql(TableName,engine)

Mike Bayer

unread,
Apr 21, 2015, 4:52:01 PM4/21/15
to sqlal...@googlegroups.com


On 4/21/15 1:47 PM, John Doe wrote:
I am using pandas 0.16 and sqlalchemy 0.99. Yes, I know I should probably upgrade, but I don't have admin rights on my PC.
I understand the pandas.DataFrame.to_sql() method relies on sqlalchemy.

I have a pandas dataframe with ca 155,000 rows and 12 columns. If I export it to csv with dataframe.to_csv , the output is an 11MB file (which is produced instantly).

If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes!

Reading the same table from SQL to Python with the pandas.read_sql_table takes 2 seconds.

No columns are text: only int, float, bool and dates. Also, there are no constraints on the table. I have seen cases where ODBC drivers set nvarchar(max) and this slows down the data transfer, but it cannot be the case here.

Any suggestions on how to speed up the export process? 

I appreciate that SQLAlchemy will probably never be the fastest method to insert data, but 6 minutes for 11 MBs of data is too slow...

Unfortunately I don't work with Pandas so this is dependent on how Pandas is doing their queries here.  

If you can at least set echo=True on the engine here, you'd see what queries are being emitted.   Whether they are emitting a handful of queries, or thousands, makes a difference, as well as if these queries are returning vast numbers of rows due to cartesian products or similar makes a difference, and how they are fetching rows back makes a difference.

It would be helpful to everyone if either you or someone on the Pandas development team could walk through the steps detailed at http://docs.sqlalchemy.org/en/rel_1_0/faq/performance.html#how-can-i-profile-a-sqlalchemy-powered-application to isolate where the performance issue is originating.



Thanks!


My code looks like this:

import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"

engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database)
conn = engine.connect()

metadata = MetaData(conn)

my_data_frame.to_sql(TableName,engine)

--
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
To post to this group, send email to sqlal...@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

John Doe

unread,
Apr 22, 2015, 5:45:23 AM4/22/15
to sqlal...@googlegroups.com
Is there a way to output all the SQL statements into a file? I don't mean a logging file with other information, I mean a file with **only** SQL statements, which could ideally be run as it is from within my database client.

I tried setting echo=True, and I also tried this: http://stackoverflow.com/questions/6350411/how-to-retrieve-executed-sql-code-from-sqlalchemy

but I can't spot anything out of the ordinary.

I profiled the code, and saw that 99% of the time is spent in the do_executemany method of sqlalchemy\engine\default.py , which arguably doesn't reveal much per se unless I understand better how pandas.DataFrame.to_sql calls sqlachemy

Lastly, do you know of any settings / parameters / hacks which could speed up sqlalchemy's connection to a SQL server database?

Thanks!

Mike Bayer

unread,
Apr 22, 2015, 9:46:19 AM4/22/15
to sqlal...@googlegroups.com


On 4/22/15 5:45 AM, John Doe wrote:
> Is there a way to output all the SQL statements into a file? I don't
> mean a logging file with other information, I mean a file with
> **only** SQL statements, which could ideally be run as it is from
> within my database client.
Python logging will get you there:


import logging

logger = logging.getLogger("sqlalchemy.engine")
logger.setLevel(logging.INFO) # use DEBUG to get result sets also
handler = logging.FileHandler("myfile.txt")
logger.addHandler(handler)

from sqlalchemy import create_engine
e = create_engine("sqlite://")
e.execute("select 1")



>
> I tried setting echo=True, and I also tried this:
> http://stackoverflow.com/questions/6350411/how-to-retrieve-executed-sql-code-from-sqlalchemy
>
> but I can't spot anything out of the ordinary.
>
> I profiled the code, and saw that 99% of the time is spent in the
> do_executemany method of sqlalchemy\engine\default.py , which arguably
> doesn't reveal much per se unless I understand better how
> pandas.DataFrame.to_sql calls sqlachemy
This is the main thing you'd want to see when data is being inserted
into the database and indicates the most efficient route is being taken.

But reading that surprised me.

I carefully re-read your first email, and now I see, that you are trying
to write *TO* the database, yes? And you're comparing the speed of
writing *TO* to the speed of reading *FROM*?

Unfortunately, selecting data from some tables compared to inserting
data in those tables are two completely different activities. It is
not at all surprising to see a large difference in speed. Particularly
if the table has a lot of indexing on it, or we're putting in very large
data values, inserts can be slow. It also varies much by database
backend and the configuration and environment of those databases. SQL
Server can be a very fast database but it is not always that easy to tune.

99% of the time spent in do_execute() actually means most of the time is
spent outside of SQLAlchemy, within the database driver inserting those
rows.

The echo output will show you more though bulk INSERT statements will be
very long in the logfile because we log a segment of the parameters as well.



John Doe

unread,
Apr 22, 2015, 10:11:26 AM4/22/15
to sqlal...@googlegroups.com


On Wednesday, April 22, 2015 at 2:46:19 PM UTC+1, Michael Bayer wrote:


I carefully re-read your first email, and now I see, that you are trying
to write *TO* the database, yes?    And you're comparing the speed of
writing *TO* to the speed of reading *FROM*?

Unfortunately, selecting data from some tables compared to inserting
data in those tables are two completely different activities.   It is
not at all surprising to see a large difference in speed.

Of course. I do not expect that reading from will be as fast as writing to the database.
However, 6 minutes to write 11MBs of data is ridiculous, because this is the time it takes even when:
1) there are no text columns (so the nvarchjar(max) issue I mentioned and I saw in other cases cannot apply)
2) there are only floating-point numbers
3) there are no primary keys, no indices and no constraints of any kind

I trust you'll concur it's hard to justify such a long time to transfer such a small table even in the examples above.
I am not a DBA, but I cannot think of other reasons that would cause such terrible speeds. Any thoughts would be more than welcome!

Maybe the to_sql method is committing to the database after inserting every single row and this slows things down? Even so, 6 minutes for 11MBs means 31KB per second. My internet connection in the '90s was faster than that :)

I have also been very surprised by the lack of info on the net. Either I am the only one using the to_sql method, or I am the only one who finds it so slow!

Mike Bayer

unread,
Apr 22, 2015, 10:20:13 AM4/22/15
to sqlal...@googlegroups.com


On 4/22/15 10:11 AM, John Doe wrote:


On Wednesday, April 22, 2015 at 2:46:19 PM UTC+1, Michael Bayer wrote:


I carefully re-read your first email, and now I see, that you are trying
to write *TO* the database, yes?    And you're comparing the speed of
writing *TO* to the speed of reading *FROM*?

Unfortunately, selecting data from some tables compared to inserting
data in those tables are two completely different activities.   It is
not at all surprising to see a large difference in speed.

Of course. I do not expect that reading from will be as fast as writing to the database.
However, 6 minutes to write 11MBs of data is ridiculous, because this is the time it takes even when:
1) there are no text columns (so the nvarchjar(max) issue I mentioned and I saw in other cases cannot apply)
2) there are only floating-point numbers
3) there are no primary keys, no indices and no constraints of any kind

I trust you'll concur it's hard to justify such a long time to transfer such a small table even in the examples above.
I am not a DBA, but I cannot think of other reasons that would cause such terrible speeds. Any thoughts would be more than welcome!

Maybe the to_sql method is committing to the database after inserting every single row and this slows things down? Even so, 6 minutes for 11MBs means 31KB per second. My internet connection in the '90s was faster than that :)

a COMMIT every time would slow it down, yes, but then in your profiling you'd see the do_commit() method taking up that time as well.

if you can get your logging going, you'll see what SQL its emitting, and you can also set up the formatting to have timestamps so you can measure the time spent between queries.

Also, if you can, try out the pymssql driver, instead of pyodbc.  pymssql is much better maintained these days.





I have also been very surprised by the lack of info on the net. Either I am the only one using the to_sql method, or I am the only one who finds it so slow!

John Doe

unread,
Apr 22, 2015, 12:07:33 PM4/22/15
to sqlal...@googlegroups.com


On Wednesday, April 22, 2015 at 3:20:13 PM UTC+1, Michael Bayer wrote:

a COMMIT every time would slow it down, yes, but then in your profiling you'd see the do_commit() method taking up that time as well.

if you can get your logging going, you'll see what SQL its emitting, and you can also set up the formatting to have timestamps so you can measure the time spent between queries.

Also, if you can, try out the pymssql driver, instead of pyodbc.  pymssql is much better maintained these days.


Will try that, thanks.

In the meanwhile, I have tried your suggestions. The to_sql method creates a single insert statement and then commits at the end. I can't see anything which would justify such a low speed. The mistery remains!

If I can abuse your patience a bit more, is there a way to:
1) display the entire SQL statements created? The method you suggested shows only 10 records, and adds the text "displaying 10 of 1000 total bound parameter sets" when run on a 1000-record table
2) generate the SQL statements but without connecting to the server? If I can get a text file with the insert statement, then I can run that from within my database client, and it will be faster

Mike Bayer

unread,
Apr 22, 2015, 1:24:23 PM4/22/15
to sqlal...@googlegroups.com


On 4/22/15 12:07 PM, John Doe wrote:


On Wednesday, April 22, 2015 at 3:20:13 PM UTC+1, Michael Bayer wrote:

a COMMIT every time would slow it down, yes, but then in your profiling you'd see the do_commit() method taking up that time as well.

if you can get your logging going, you'll see what SQL its emitting, and you can also set up the formatting to have timestamps so you can measure the time spent between queries.

Also, if you can, try out the pymssql driver, instead of pyodbc.  pymssql is much better maintained these days.


Will try that, thanks.

In the meanwhile, I have tried your suggestions. The to_sql method creates a single insert statement and then commits at the end. I can't see anything which would justify such a low speed. The mistery remains!

If I can abuse your patience a bit more, is there a way to:
1) display the entire SQL statements created? The method you suggested shows only 10 records, and adds the text "displaying 10 of 1000 total bound parameter sets" when run on a 1000-record table

use an event listener and do the logging, printing, and/or debugging that you want.  I recommend before_cursor_execute():  http://docs.sqlalchemy.org/en/rel_1_0/core/events.html?highlight=before_cursor_execute#sqlalchemy.events.ConnectionEvents.before_cursor_execute



2) generate the SQL statements but without connecting to the server? If I can get a text file with the insert statement, then I can run that from within my database client, and it will be faster

Stick with #1.   This can be more challenging because the script needs to only emit statements that don't expect results.  There is a tool called the "mock strategy" that will do this which has a short example at http://docs.sqlalchemy.org/en/rel_1_0/faq/metadata_schema.html#how-can-i-get-the-create-table-drop-table-output-as-a-string.  

Jonathan Vanasco

unread,
Apr 22, 2015, 2:19:08 PM4/22/15
to sqlal...@googlegroups.com
I've ran into similar issues like this before -- but on different databases.

I think it's bad to think of the problem as "transmitting 31k/s" -- which assumes a particular issue; and much better to think of it as "processing 31k/s", which gives a lot more room for interpretation. 

Looking on StackOverflow and it's related sites, it seems that Sql Server also falls victim to the usual patterns:

• The database doesn't like the format of the data or configuration (it takes up too much memory, there are too many constraints, etc)
• The driver isn't appropriate for the type of data

I'm guessing that this issue is with the driver.

Here's a semi-related thread:
http://stackoverflow.com/questions/5693885/pyodbc-very-slow-bulk-insert-speed   It looks like the pyodbc driver handles "executemany" in a not very ideal manner.

There were also some threads that noted ODBC tracing being on, and others that noted most python drivers are just painfully slow. 

In any event, if you're talking about a single insert statement.... that sounds like an executemany context and a driver issue.

John Doe

unread,
Apr 22, 2015, 6:11:50 PM4/22/15
to sqlal...@googlegroups.com


On Wednesday, April 22, 2015 at 7:19:08 PM UTC+1, Jonathan Vanasco wrote:
[...]


I'm guessing that this issue is with the driver.

Here's a semi-related thread:
http://stackoverflow.com/questions/5693885/pyodbc-very-slow-bulk-insert-speed   It looks like the pyodbc driver handles "executemany" in a not very ideal manner.

There were also some threads that noted ODBC tracing being on, and others that noted most python drivers are just painfully slow. 

In any event, if you're talking about a single insert statement.... that sounds like an executemany context and a driver issue.

Tracing is off.
If I had to place a bet, my money would be on pyodbc having too slow a network connection, for some reason that's totally beyond me.
On my home PC I generated a dataframe of random numbers in Python, then used the to_sql() method to transfer it to a SQL Server express running on the same machine, and it was fast. This suggests that SQL server has no issue with the data per se.
When I ran the same code on my work PC, trying to export to a SQL Server 2014 machine which is part of the same company network and only a few miles away, it took ages.

I'll try having Python installed on the SQL server, and running it from there, to see if this theory is correct. 

I also tried pymssql, but it took ages.
Some stack overflow users had luck with adodb, but sqlalchemy no longer supports it.

I miss Matlab's database toolbox! Yes, it's expensive, and Matlab has tons of flaws, but at least Matlab's documentation is excellent and doesn't cause you to lose the will to live wasting hours trying to figure out how to carry out a banal task like exporting a table...

The bottom line is that pandas to_sql() methos is basically unusable if you're trying to export more than 2 MBs of data

Anyway, thanks a lot for your help, Jonathan and Michael.

Sam Mourad

unread,
Apr 28, 2015, 5:42:42 PM4/28/15
to sqlal...@googlegroups.com
Joe - I'm having the same issue. Dataframe.to_sql performance is fine with sqllite and mysql but SQLAlchemy engine it too slow. After a few days of research and trial and error, I was able to improve performance by using pymssql. However, the performance is still way below standards. 35000 rows of an account relation table (accountid - int, relatedaccountid - int) takes 2 minutes to finish. 

After a long research and reading I found an article explaining that SQLAlchemy ORM being used by df.to_sql is VERY SLOW because it uses a Unit of Work pattern. SQLAlchemy CORE is supposed to be much faster. Check this link out:

My issue is still not solved because my goal is to insert a dataframe into a sql table. If I use SQLAlchemy CORE I'll have to iterate through dataframe rows and insert them into the SQL table which ends up being even slower than to_sql.

Did you find a solution? If so, please share.

Thanks,
Reply all
Reply to author
Forward
0 new messages