SQLAlchemy taking too much time to process the result

1,070 views
Skip to first unread message

Saylee M.

unread,
Jul 6, 2020, 2:14:33 PM7/6/20
to sqlalchemy
Hello all, 
Hope you are fine and safe in these times!

I can be easily considered as a novice in SQLAlchemy.
I am trying to pull data from a MYSQL database from a table having around 20 columns and more than 10 million rows. 
The table is partitioned and indexed. I am using a complex query, having UNION of multiple sub-queries, in the format:

with cte1 as <subquery1>, 
cte2 as <subquery2>, 
.. 
select * from cte1
union all
select * from cte2
;

The code demands it to be a dynamic query and number of sub-queries depends on the user input. 
For understanding purpose, I took a query having two sub-queries. 
So, when I passed the query to MySQL directly, it took very less time (around 0.016 seconds) but when I passed the same 
query through SQLAlchemy connector, it took around 600 seconds.
I tried generating the query using only text clause in one approach and I also tried combination of .select(), .where() and .and_ query approach.
However, both were taking a very long time.

I am not sure what can be issue. It'll be great if I can get any pointers to reduce the time, preferably under 10 seconds!

Thanks in advance!

Regards,
Saylee

Jonathan Vanasco

unread,
Jul 6, 2020, 3:19:46 PM7/6/20
to sqlalchemy


On Monday, July 6, 2020 at 2:14:33 PM UTC-4, Saylee M. wrote:
 
So, when I passed the query to MySQL directly, it took very less time (around 0.016 seconds) but when I passed the same 
query through SQLAlchemy connector, it took around 600 seconds

"query ... MySQL directly"

Do you mean using the MySQL commandline client?  Assuming yes, the 0.016 time only reflects the time MySQL spent processing the query and generating the result set; the SQLAlchemy time includes that + transferring all the data + generating Python data structures (which could be SQLAlchemy ORM models or generic python data structures)

There are also external factors that can account for time changes - like server load, index loading, cache utilization.... 

I am not sure what can be issue. It'll be great if I can get any pointers to reduce the time, preferably under 10 seconds!

Showing a short, self contained, correct example (sscce) of your code would let others troubleshoot it more effectively.  The most likely situation though, is that you are loading all the rows.  There should be no difference in the query time.
 

Mike Bayer

unread,
Jul 6, 2020, 3:48:05 PM7/6/20
to noreply-spamdigest via sqlalchemy


On Mon, Jul 6, 2020, at 2:14 PM, Saylee M. wrote:
Hello all, 
Hope you are fine and safe in these times!

I can be easily considered as a novice in SQLAlchemy.
I am trying to pull data from a MYSQL database from a table having around 20 columns and more than 10 million rows. 
The table is partitioned and indexed. I am using a complex query, having UNION of multiple sub-queries, in the format:

with cte1 as <subquery1>, 
cte2 as <subquery2>, 
.. 
select * from cte1
union all
select * from cte2
;

The code demands it to be a dynamic query and number of sub-queries depends on the user input. 
For understanding purpose, I took a query having two sub-queries. 
So, when I passed the query to MySQL directly, it took very less time (around 0.016 seconds) but when I passed the same 
query through SQLAlchemy connector, it took around 600 seconds.

hi -

when you say "mysql directly" what are you refering to, your command line client, or the Python driver?  Also what python driver are you using?   

A comparison for how much time the operation needs to take should be produced by using the Python driver that you have with the *exact* SQL query you're starting with, then fully fetching all rows in Python.   It sounds a little off that the mysqlclient driver can fetch ten million rows in 0.016 seconds.     at the end of this message is a test script that fetches 5 million rows with just two columns each, using the raw mysqlclient driver.    mysqlclient is a native driver that is as fast as you can get.  I'm running it on my laptop with *no network*, local MySQL server, and it takes 7 seconds to fetch that many rows.     ten million rows is a lot and it's not going to be very fast in any case. 

That said, the ORM when it fetches full Python objects is *much* slower than fetching rows, because there is a lot of Python overhead in building up the objects.  if you have 10M rows, just for Python to set aside the memory to store 10M heavy Python objects at once will take many seconds, and once it churns into swap space the time will grow exponentially.    It's unlikely you need the full blown business object functionality on 10M rows at once so I would fetch columns instead of objects.  There's discussion about this in the FAQ at https://docs.sqlalchemy.org/en/13/faq/performance.html#result-fetching-slowness-orm .    Additionally, the ORM can yield out the ORM objects in batches, keeping in mind the database driver has probably buffered the raw rows in any case, using yield_per().

What I would suggest is take a look at all the example fetching suites at https://docs.sqlalchemy.org/en/13/_modules/examples/performance/large_resultsets.html .  This illustrates all the different ways you can fetch rows with SQLAlchemy and compares the speed of each.    you can run this suite straight from the distribution, and it will show the relative differences in speed between different kinds of fetches.     The ORM in particular has a lot of work to do in both generating Python objects and populating them, and if you fetch rows with columns instead of objects, that will save most of the time.

To get a feel for this suite, here's a run from my own laptop:

$ python -m examples.performance large_resultsets --dburl mysql://scott:tiger@localhost/test
Running setup once...
Tests to run: test_orm_full_objects_list, test_orm_full_objects_chunks, test_orm_bundles, test_orm_columns, test_core_fetchall, test_core_fetchmany_w_streaming, test_core_fetchmany, test_dbapi_fetchall_plus_append_objects, test_dbapi_fetchall_no_object
test_orm_full_objects_list : Load fully tracked ORM objects into one big list(). (500000 iterations); total time 6.135940 sec
test_orm_full_objects_chunks : Load fully tracked ORM objects a chunk at a time using yield_per(). (500000 iterations); total time 3.340366 sec
test_orm_bundles : Load lightweight "bundle" objects using the ORM. (500000 iterations); total time 0.949388 sec
test_orm_columns : Load individual columns into named tuples using the ORM. (500000 iterations); total time 0.560157 sec
test_core_fetchall : Load Core result rows using fetchall. (500000 iterations); total time 0.466407 sec
test_core_fetchmany_w_streaming : Load Core result rows using fetchmany/streaming. (500000 iterations); total time 0.339930 sec
test_core_fetchmany : Load Core result rows using Core / fetchmany. (500000 iterations); total time 0.470984 sec
test_dbapi_fetchall_plus_append_objects : Load rows using DBAPI fetchall(), generate an object for each row. (500000 iterations); total time 0.476398 sec
test_dbapi_fetchall_no_object : Load rows using DBAPI fetchall(), don't make any objects. (500000 iterations); total time 0.330984 sec





import random
import time

from sqlalchemy import create_engine

e = create_engine("mysql://scott:tiger@localhost/test")

with e.connect() as conn:
    conn.execute("drop table if exists data")
    conn.execute("create table data (x integer, y integer)")
    conn.execute(
        "insert into data (x, y) values (%s, %s)",
        [
            (random.randint(1, 10), random.randint(1, 10))
            for i in range(5000000)
        ],
    )

    raw_mysql_connection = conn.connection.connection

    cursor = raw_mysql_connection.cursor()

    now = time.perf_counter()
    cursor.execute("select x, y from data")
    cursor.fetchall()
    cursor.close()
    total = time.perf_counter() - now
    print("total time %s" % total)




I tried generating the query using only text clause in one approach and I also tried combination of .select(), .where() and .and_ query approach.
However, both were taking a very long time.

I am not sure what can be issue. It'll be great if I can get any pointers to reduce the time, preferably under 10 seconds!

Thanks in advance!

Regards,
Saylee


--
SQLAlchemy -
The Python SQL Toolkit and Object Relational Mapper
 
 
To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
---
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages