SQLAlchemy Access to Native Bulk Loaders

10 views
Skip to first unread message

Ben

unread,
Apr 19, 2020, 12:46:45 PM4/19/20
to sqlalchemy
I hope this is the right place for this... I need to load large files into my database. As I understand it, I can do this in one of two ways with SQLAlchemy Core: 1) Bring the data into Python and then write it out with the add method or, alternatively, 2) Use SQLAlchemy to issue a command to the DB to use it's native bulk loader to read data from my file. I would expect this second approach to be faster, to require less code, and to avoid issues such as trying to put too much in memory at one time. However, it is DB-vendor-specific (i.e. I believe the command I send to a MySQL DB will differ from that I send to a Postgres DB). 

So, 
  • Do I properly understand SQLAlchemy's capabilities here or am I missing something?
  • If I do have this right, is generic access to bulk loaders something that is on the upgrade / new development list?
Sorry if this isn't the right place for this.
Thanks!
Ben

James Fennell

unread,
Apr 19, 2020, 1:23:06 PM4/19/20
to sqlal...@googlegroups.com
Hi Ben,


Bulk operations provide a DB-agnostic API for doing large numbers of inserts and/or updates. The speed up you see, compared to session.add, depends on your database and the database connection arguments. With Postgres, I've needed to enable batch mode to see the fullest benefit: https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#psycopg2-batch-mode

Personally I've seen huge performance impovements in my applications after migrating big operations from session.add over to the bulk API.

Of course, as you say, you can do more low level SQL calls to get it even faster, but then you run into a bunch of other issues.

James


--
SQLAlchemy -
The Python SQL Toolkit and Object Relational Mapper
 
http://www.sqlalchemy.org/
 
To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
---
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/fd55ee9a-6918-4b60-88bb-961787e3c53e%40googlegroups.com.

Ben

unread,
Apr 19, 2020, 3:01:43 PM4/19/20
to sqlalchemy
Hi, James, thank you for the info. I'll give this a try. Still, it does seem to require that the data be read into Python and then written out to the DB, albeit quickly. What I'd prefer is a way to issue a call to the DB to have it 'suck in' some csv file directly, rather than having to read it into Python first. Most databases have a bulk loader to handle such jobs so I would think it would be possible to provide a vendor-neutral way to call this functionality but I could be missing something. Perhaps I should look at adding this to the SQLAlchemy github project but I'm not sure that the world would want to rely on my code :) 

In any case, thank you so much for taking the time to reply.

Ben
To unsubscribe from this group and stop receiving emails from it, send an email to sqlal...@googlegroups.com.

Richard Damon

unread,
Apr 19, 2020, 3:59:48 PM4/19/20
to sqlal...@googlegroups.com
On 4/19/20 3:01 PM, Ben wrote:
> Hi, James, thank you for the info. I'll give this a try. Still, it
> does seem to require that the data be read into Python and then
> written out to the DB, albeit quickly. What I'd prefer is a way to
> issue a call to the DB to have it 'suck in' some csv file directly,
> rather than having to read it into Python first. Most databases have a
> bulk loader to handle such jobs so I would think it would be possible
> to provide a vendor-neutral way to call this functionality but I could
> be missing something. Perhaps I should look at adding this to the
> SQLAlchemy github project but I'm not sure that the world would want
> to rely on my code :) 
>
> In any case, thank you so much for taking the time to reply.
>
> Ben

I can't think of any SQL engines, where the engine itself can read a CSV
file to load a database (In many cases, the actual SQL engine is off on
another machine with the database, and has no direct link to the local
file system). Like SQLite, many have a command line interface that can
read the file and insert it into the database.

Also, some higher end wrappers might provide such a feature, but that is
NOT part of the base SQL language.

--
Richard Damon

Benjamin Taub

unread,
Apr 19, 2020, 4:44:56 PM4/19/20
to sqlal...@googlegroups.com
Thanks for taking the time to respond, Richard. I may be thinking about the command line option that you mentioned. However, I do see that MySQL has a LOAD DATA statement (https://dev.mysql.com/doc/refman/8.0/en/load-data.html) that, I think, does what I'm thinking about. Similarly, Postgres has COPY (https://www.postgresql.org/docs/9.2/sql-copy.html) and SQL Server has BULK INSERT (https://docs.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver15).

These seem to be embedded in the related SQL implementations but are clearly not ANSI standard. I'm not sure if that makes it disqualifying for a SQLAlchemy feature request, or if anyone else could even use it, but functionality like this is something that, at least for me, would make my implementation more DB independent.

Anyhow, thanks again for your note and your work on SQLAlchemy. I appreciate it.

Ben
__________________________________________
light logo
Benjamin Taub, CEO
Check out Talentspace to see a list of prescreened analytics and data engineering candidates
NOTE: I usually check email only 2 - 3 times / day.  For issues needing immediate attention, please call.



On Sun, Apr 19, 2020 at 3:59 PM Richard Damon <Ric...@damon-family.org> wrote:

I can't think of any SQL engines, where the engine itself can read a CSV
file to load a database (In many cases, the actual SQL engine is off on
another machine with the database, and has no direct link to the local
file system). Like SQLite, many have a command line interface that can
read the file and insert it into the database.

Also, some higher end wrappers might provide such a feature, but that is
NOT part of the base SQL language.

--
Richard Damon

--
SQLAlchemy -
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example.  See  http://stackoverflow.com/help/mcve for a full description.
---
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/babc0ee4-d95d-af02-54cc-0691ab9ce6da%40Damon-Family.org.

Richard Damon

unread,
Apr 19, 2020, 6:05:01 PM4/19/20
to sqlal...@googlegroups.com
On 4/19/20 4:44 PM, Benjamin Taub wrote:
> Thanks for taking the time to respond, Richard. I may be thinking
> about the command line option that you mentioned. However, I do see
> that MySQL has a LOAD DATA statement
> (https://dev.mysql.com/doc/refman/8.0/en/load-data.html) that, I
> think, does what I'm thinking about. Similarly, Postgres has COPY
> (https://www.postgresql.org/docs/9.2/sql-copy.html) and SQL Server has
> BULK INSERT
> (https://docs.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver15).
>
> These seem to be embedded in the related SQL implementations but are
> clearly not ANSI standard. I'm not sure if that makes it disqualifying
> for a SQLAlchemy feature request, or if anyone else could even use it,
> but functionality like this is something that, at least for me, would
> make my implementation more DB independent.
>
> Anyhow, thanks again for your note and your work on SQLAlchemy. I
> appreciate it.
>
> Ben

I will admit that wasn't a command I was familiar with, but being DB
Specific it would be something I tend to try to minimize the use of.

--
Richard Damon

Jonathan Vanasco

unread,
Apr 20, 2020, 12:39:54 PM4/20/20
to sqlalchemy

On Sunday, April 19, 2020 at 4:44:56 PM UTC-4, Ben wrote:

These seem to be embedded in the related SQL implementations but are clearly not ANSI standard. I'm not sure if that makes it disqualifying for a SQLAlchemy feature request, or if anyone else could even use it, but functionality like this is something that, at least for me, would make my implementation more DB independent.

The major database servers and clients do have similar extensions to ANSI and the Python DBAPI here, but they're fairly different from one another in how they work and their input/output formats (and capabilities).  They also are not necessarily supported by the underlying database drivers. Pyscopg2 has "copy_" methods on the cursor (https://www.psycopg.org/docs/usage.html#copy); there are several mysql drivers, i looked at two and neither have explicit support for this. I'm not sure how this would really work though - it's basically designed for being implemented in the C clients with local files. 

The "bulk loading" functionality in SqlAlchemy is largely in the realm of being a low-level DBAPI operations for insert statements, and SqlAlchemy's performance is about as close to using the driver itself - while still having the syntactic sugar and not needing to write to a database specifically.  I don't believe there is any code to invoke a database client to read a source file itself; I am often mistaken though as Mike has put a ton of work into SqlAlchemy over the years.

From experience with large imports, I can tell you the following:

* The best performance for large updates involves:
    1) take the database offline to clients
    2) drop (and copy) all the foreign key and check constraints
    3) import your data from a file with the native client that ships with your database *
    4) add the foreign key and check constraints back in
    if you feel the need to script this, you could use `subprocess` to invoke the database's native client on the machine

* The second best performance is:
    use SqlAlchemy's bulk strategies, 
    segment out the ranges of your source data file
    run multiple processes in parallel, each processing a subset of the source data file

* The third best performance is:
   write a script that uses the SqlAlchemy ORM with the 'windowed query' option to iterate over a range of lines in your source file
   have the script use a 3rd party tool like Redis to mark that it has "checked out" sections of the source data, and has completed/not-completed that range
   run multiple processes in parallel, each processing a subset of the source data file

 
In the second and third approach, you can periodically run a query or script to calculate the right number of processes to run.  It could be 7, or 20, or another number.




 





 

Benjamin Taub

unread,
Apr 22, 2020, 4:03:44 PM4/22/20
to sqlal...@googlegroups.com
Thank you for taking the time to lay that out, Jonathan. I am not sure if this will be a major issue for me yet or not and I would like to stay within SQLAlchemy to the greatest extent possible. Since I'm using the core, I think I'll try your second option. If I run into issues, I can rethink at that time.

Thanks again!

Ben
__________________________________________
light logo
Benjamin Taub, CEO
Check out Talentspace to see a list of prescreened analytics and data engineering candidates
NOTE: I usually check email only 2 - 3 times / day.  For issues needing immediate attention, please call.


--
SQLAlchemy -
The Python SQL Toolkit and Object Relational Mapper
 
http://www.sqlalchemy.org/
 
To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
---
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages