I am using sqlalchemy like this:
entries = session.query(User)
for entry in entries:
entry.check_it_is_all_right() # includes changing values of
columns if necessary
session.flush()
As you might have noticed, I am iterating over all rows in the
database. Since there are like thousands rows, each containing like
10kB of data (and I _do_ need all of the data for each entry), it is
unconvenient for me how sqlalchemy treats this:
it loads _all_ objects into memory - this takes approx. 30 seconds on
a dedicated machine with 100% CPU usage, taking away tens of MB of
RAM. (it is my virtual testing machine, it's fine there (although who
wants to wait 30 seconds to only realize that the processing fails
with exception on first entry :-)), but I most probably can't use such
a resource-eater on a production server).
It would however suffice that sqlalchemy made the access to the result
sequentially, without loading more than is really needed at the
moment. I thought the Python iterator protocol (__iter__ and next
methods) are exactly tailored for this efficient access (e.g.
iterating over lines of file objects (for line in open("file.txt",
"r")) works exactly efficiently how I would want sqlalchemy to in my
case :-) -
Is there a way for efficient "on-demand" iteration over all rows when
using orm object mapping, or do I have to use a more low-level
protocol (like sqlalchemy without orm (fetchone()), or even Python DB
API itself?)
Thanks for your suggestions,
Boris Dušek
>
> Hi,
>
> I am using sqlalchemy like this:
>
> entries = session.query(User)
> for entry in entries:
> entry.check_it_is_all_right() # includes changing values of
> columns if necessary
> session.flush()
>
> It would however suffice that sqlalchemy made the access to the result
> sequentially, without loading more than is really needed at the
> moment. I thought the Python iterator protocol (__iter__ and next
> methods) are exactly tailored for this efficient access (e.g.
> iterating over lines of file objects (for line in open("file.txt",
> "r")) works exactly efficiently how I would want sqlalchemy to in my
> case :-) -
I will show you some ways to do it; however, yes, I think you should
consider using SQL-level instead of ORM level results to do what you
want. The 30 seconds of overhead youre experiencing is probably not
due to memory consumption as it is due to the overhead ORM loads
require in order to instantiate objects, initialize and populate
their attributes, and also perform various decision-making with
regards to extension hooks and the like. Additionally, SQLAlchemy
does place some overhead on SQL-level results as well since we do
things like decode utf-8 into unicode and similar result processing
but this overhead is much smaller (but still not as fast as raw DBAPI).
So first, my thoughts on a "streaming" Query object. When looking to
"stream" results from a Query (which ultimately comes from its
instances() method), you need to consider the ORM's behavior and
design regarding sessions and units of work; when you load objects,
the full result is stored within a session, with the assumption that
youre going to manipulate and work with these objects. To iterate
through pieces of data and not hold onto it means youd want to
expunge as you load. There is no functionality built directly into
query.instances() to achieve this right now - a major reason its
difficult is because its not very easy to tell when an individual
instance is fully "loaded"; many subsequent rows may apply to a
single result instance as it loads related collections in due to a
join. Also its problematic to ensure the uniqueness of instances for
some kinds of queries, particularly those which eagerly load related
items (object # 1 references object A, then is returned and
expunged. object # 2 also references object A..but now you get a
*different* instance of A since the previous copy of it was
expunged. surprises ensue).
So embedding this feature directly into Query I fear would lead to
many confused users, who are trying to get a "quick" way to be more
efficient without really understanding the consequences...and as I
mentioned, i think the overhead is primarily just populating the
objects themselves, not the memory allocation part of it, so this
feature would probably not solve too many problems (not to mention
the increased complexity would slow it down even more).
Externally, the most straightforward way to achieve this with Query
would be by using LIMIT and OFFSET (typically by applying array
slices to a Query) such that you query only some results at a time:
query = session.query(MyObject).filter(<whatever>)
start = 0
while True:
result = query[start:start + 100]
<process result>
session.clear() # or session.expunge() each member in 'result'
if len(result) < 100:
break
else:
start += 100
The above approach would also be compatible with queries which uses
eager loading, since the LIMIT/OFFSET is applied inside of a subquery
(in the case of eager loads being present) so that eager LEFT OUTER
JOINS are tacked onto the correct core rowset.
But, the above approach issues many queries, namely number of rows /
"clump" size. To work around issuing clumped queries with LIMIT/
OFFSET, here is a variant on that idea (not tested, may need
adjusting), which uses just one SQL statement issued but is not
compatible with eager loading (unless you really tweaked it):
class FakeResult(object):
def __init__(self, result):
self.result = result
def fetchall(self):
"""Query.instances() calls fetchall() to retrieve
results. return only a 'slice' of results."""
return result.fetchmany(100)
q = session.query(MyObject).filter(<whatever>)
result = FakeResult(engine.execute(q.compile()))
while True:
results = q.instances(result)
<process result>
session.clear() # or session.expunge() each member in 'result'
if len(result) < 100:
break
>
> Is there a way for efficient "on-demand" iteration over all rows when
> using orm object mapping, or do I have to use a more low-level
> protocol (like sqlalchemy without orm (fetchone()), or even Python DB
> API itself?)
So onto the raw SQL idea. So yes, if you can adjust your processing
functions in this case to deal with rows and not ORM objects, you
will be *way* better off, since there is a lot of memory and
processing overhead associated with ORM row loading whether or not
its caching everything. ResultProxy objects are much faster and dont
cache anything (by default). Even with the approaches I've outlined
above, the ORM adds a lot of overhead. If you are just loading
simple objects, you can still get your objects back using a simple
paradigm such as:
for row in result:
instance = MyClass.__new__()
for key in row.keys():
instance.__dict__[key] = row[key]
*However*. From a memory/buffering standpoint, be aware that **many
DBAPIs are pre-buffering all the rows anyway, before you ever get
them**. So from a memory standpoint, you still might have
difficulties scaling this in an unlimited fashion unless you break
the request into multiple queries, since even though raw rows take up
far less memory than a mapped instance, theres still a limit. If
you are using cx_Oracle, I believe you get the rows unbuffered, but
with Psycopg2, the result sets are by default fully buffered; you
have to use SQLAlchemy's "server_side_cursors=True" flag with the
PGDialect, which uses some less-documented features of psycopg2 to
work around it and jumps thorugh some extra hoops to deal with PG
idiosyncrasies. other DBAPIs i havent tested recently to see if
theyre buffering or not.
so thats the deal with buffering, hope it helps.
- mike
thank you so much for such an extensive answer. It has provided me
with much better insight about the topic, so that I can now make a
qualified decision on how to proceed.
Best regards,
Boris