Large RAM usage in bulk_insert

James Fennell

unread,

Nov 14, 2019, 12:16:56 PM11/14/19

to sqlal...@googlegroups.com

Hi all,

Just sharing some perf insights into the bulk operation function bulk_insert_mappings.

I was recently debugging a SQL Alchemy powered web app that was crashing due to out of memory issues on a small Kubernetes node. It turned out to be "caused" by an over optimistic invocation of bulk_insert_mappings. Basically I'm reading a CSV file with ~500,000 entries into a list of dictionaries, and then passing it into the bulk_insert_mappings function at once. It turned out the SQL Alchemy work was using 750mb of RAM, which was enough to OOM the small node the web app was running on.

A simple workaround is to split the list of 500,000 entries into chunks of 1000 entries each, and then call bulk_insert_mappings on each chunk. When I do this, the extra memory usage is not even noticeable. But also, it seems that this chunked approach is actually faster! I might benchmark that to quantify that.

Thought it was interesting. I wonder would it be worth adding to the docs on bulk_insert_mappings? Given that function is motivated by performance, it seems it might be relevant.

James

Soumaya Mauthoor

unread,

Nov 14, 2019, 12:41:34 PM11/14/19

to sqlal...@googlegroups.com

What did you use to profile memory usage? I've recently been investigating memory usage when loading data using memory_profiler and would be interested to find out about the best approach

--
SQLAlchemy -
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
---
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/CALDF6i1sx_QmW1v1jvopT-iWZqmSGmJ7JpJi10egNnbE0D7FMQ%40mail.gmail.com.

James Fennell

unread,

Nov 14, 2019, 2:42:03 PM11/14/19

to sqlal...@googlegroups.com

Because the memory spike was so bad (the application usually runs at 250mb RAM, and it went up to a GB during this process), I was able to find the problem by running htop and using print statements to discover where in the execution the Python code was when the RAM spike happened.

I unfortunately don't have any advice on actual good tools for tracking RAM usage in Python programs but would to leave to hear if others do.

To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/CAN14jWTdFQ7tSyhr71H_wCX_JXT58S40Q4MN7qte6pE2N-%2BOLw%40mail.gmail.com.

Elmer de Looff

unread,

Nov 15, 2019, 4:03:44 AM11/15/19

to sqlal...@googlegroups.com

I'm not even sure the problem is with the batch insert function itself, creating half a million dicts in Python is going to cause you a bit of a bad time. That said, I've run into the same problem with a little toy project, which works around this with a 'bulk save' interface. With a minimal change you get to create the object/mapping to insert at the time you need it, push it into the bulk saver, and it will flush every time it hits a threshold. The actual threshold to use will depend a bit on your database and driver, and I imagine connection latency plays an important part as well.

Anyway, use what you can: https://github.com/edelooff/smallville/blob/master/scripts/seed.py#L27

To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/CALDF6i3KeKteJ8aV5XHM-g%2BMsAZdRoAnGqLfCxVETfj8vsDTSg%40mail.gmail.com.

--

Elmer

Jonathan Vanasco

unread,

Nov 15, 2019, 5:41:26 PM11/15/19

to sqlalchemy

On Friday, November 15, 2019 at 4:03:44 AM UTC-5, Elmer de Looff wrote:

That said, I've run into the same problem with a little toy project, which works around this with a 'bulk save' interface.

FWIW on something related... In my experience, if you need to focus on speed for bulk inserts / migrations, one of the fastest methods I've used is to have several Python scripts working in parallel. You can use a shared resource like a sqlite3 file or Redis to coordinate the workers acting together. Usually. I'll have workers "claim" a range of 10k items with Redis (if we're doing a migration, then I'll use the Windowed Ranged Query recipe - https://github.com/sqlalchemy/sqlalchemy/wiki/WindowedRangeQuery )

it takes a little bit of extra work to determine the right number of workers to spin up, but parallel workers will usually make a migration or bulk insert task 5-15x faster for me.

Reply all

Reply to author

Forward

Large RAM usage in bulk_insert_mappings

James Fennell

Soumaya Mauthoor

James Fennell

Elmer de Looff

Jonathan Vanasco