I have a pandas dataframe with ca 155,000 rows and 12 columns. If I export it to csv with dataframe.to_csv , the output is an 11MB file (which is produced instantly).
If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes!
Reading the same table from SQL to Python with the pandas.read_sql_table takes 2 seconds.
No columns are text: only int, float, bool and dates. Also, there are no constraints on the table. I have seen cases where ODBC drivers set nvarchar(max) and this slows down the data transfer, but it cannot be the case here.
Any suggestions on how to speed up the export process?
I appreciate that SQLAlchemy will probably never be the fastest method to insert data, but 6 minutes for 11 MBs of data is too slow...
Thanks!
My code looks like this:
import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"
engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database)
conn = engine.connect()
metadata = MetaData(conn)
my_data_frame.to_sql(TableName,engine)I am using pandas 0.16 and sqlalchemy 0.99. Yes, I know I should probably upgrade, but I don't have admin rights on my PC.I understand the pandas.DataFrame.to_sql() method relies on sqlalchemy.
I have a pandas dataframe with ca 155,000 rows and 12 columns. If I export it to csv with dataframe.to_csv , the output is an 11MB file (which is produced instantly).
If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes!
Reading the same table from SQL to Python with the pandas.read_sql_table takes 2 seconds.
No columns are text: only int, float, bool and dates. Also, there are no constraints on the table. I have seen cases where ODBC drivers set nvarchar(max) and this slows down the data transfer, but it cannot be the case here.
Any suggestions on how to speed up the export process?
I appreciate that SQLAlchemy will probably never be the fastest method to insert data, but 6 minutes for 11 MBs of data is too slow...
Thanks!
My code looks like this:
import pandas as pd from sqlalchemy import create_engine, MetaData, Table, select ServerName = "myserver" Database = "mydatabase" TableName = "mytable" engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database) conn = engine.connect() metadata = MetaData(conn) my_data_frame.to_sql(TableName,engine)
--
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
To post to this group, send email to sqlal...@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.
I carefully re-read your first email, and now I see, that you are trying
to write *TO* the database, yes? And you're comparing the speed of
writing *TO* to the speed of reading *FROM*?
Unfortunately, selecting data from some tables compared to inserting
data in those tables are two completely different activities. It is
not at all surprising to see a large difference in speed.
On Wednesday, April 22, 2015 at 2:46:19 PM UTC+1, Michael Bayer wrote:
I carefully re-read your first email, and now I see, that you are trying
to write *TO* the database, yes? And you're comparing the speed of
writing *TO* to the speed of reading *FROM*?
Unfortunately, selecting data from some tables compared to inserting
data in those tables are two completely different activities. It is
not at all surprising to see a large difference in speed.
Of course. I do not expect that reading from will be as fast as writing to the database.However, 6 minutes to write 11MBs of data is ridiculous, because this is the time it takes even when:1) there are no text columns (so the nvarchjar(max) issue I mentioned and I saw in other cases cannot apply)2) there are only floating-point numbers3) there are no primary keys, no indices and no constraints of any kind
I trust you'll concur it's hard to justify such a long time to transfer such a small table even in the examples above.
I am not a DBA, but I cannot think of other reasons that would cause such terrible speeds. Any thoughts would be more than welcome!
Maybe the to_sql method is committing to the database after inserting every single row and this slows things down? Even so, 6 minutes for 11MBs means 31KB per second. My internet connection in the '90s was faster than that :)
I have also been very surprised by the lack of info on the net. Either I am the only one using the to_sql method, or I am the only one who finds it so slow!
a COMMIT every time would slow it down, yes, but then in your profiling you'd see the do_commit() method taking up that time as well.
if you can get your logging going, you'll see what SQL its emitting, and you can also set up the formatting to have timestamps so you can measure the time spent between queries.
Also, if you can, try out the pymssql driver, instead of pyodbc. pymssql is much better maintained these days.
On Wednesday, April 22, 2015 at 3:20:13 PM UTC+1, Michael Bayer wrote:
a COMMIT every time would slow it down, yes, but then in your profiling you'd see the do_commit() method taking up that time as well.
if you can get your logging going, you'll see what SQL its emitting, and you can also set up the formatting to have timestamps so you can measure the time spent between queries.
Also, if you can, try out the pymssql driver, instead of pyodbc. pymssql is much better maintained these days.
Will try that, thanks.
In the meanwhile, I have tried your suggestions. The to_sql method creates a single insert statement and then commits at the end. I can't see anything which would justify such a low speed. The mistery remains!
If I can abuse your patience a bit more, is there a way to:1) display the entire SQL statements created? The method you suggested shows only 10 records, and adds the text "displaying 10 of 1000 total bound parameter sets" when run on a 1000-record table
2) generate the SQL statements but without connecting to the server? If I can get a text file with the insert statement, then I can run that from within my database client, and it will be faster
[...]
I'm guessing that this issue is with the driver.Here's a semi-related thread:* http://stackoverflow.com/questions/5693885/pyodbc-very-slow-bulk-insert-speed It looks like the pyodbc driver handles "executemany" in a not very ideal manner.There were also some threads that noted ODBC tracing being on, and others that noted most python drivers are just painfully slow.
In any event, if you're talking about a single insert statement.... that sounds like an executemany context and a driver issue.