Working with large IN lists

Vlad K.

unread,

Feb 21, 2012, 8:07:50 AM2/21/12

to sqlal...@googlegroups.com

Hi all,

I have to read thousands of rows from a table and compile some data, but
in certain conditions update those rows, all with same value. The ratio
of reads and writes here is widest possible. Sometimes no rows,
sometimes few, and sometimes all rows that are read have to be updated.
The last case scenario is making me concerned.

for row in query.yield_per(100):
# Do something with data

if some_condition:
row.some_column = 123
session.flush()

I am thinking about just adding the row's ID to a list:

list_of_ids = []
for row in query.yield_per(100):
# Do something with data

if some_condition:
list_of_ids.append(row.primary_key)

and near the end of transaction do:

session.query(Model).filter(Model.primary_key.in_(list_of_ids)).update({"some_column"
: 123}, False)

Yes I'm aware of increased memory requirements to store the ID list on
the application side, and no I don't need to lock the rows for update,
the logic of atomic update at the end is sufficient for my case.

But I think, and the real use benchmarks will probably show, I haven't
tested yet, that single update query will work faster. I need lowest
transaction processing time on the application side for entire call,
even if takes more memory and more database iron.

What I'm concerned with here is if there are any limits or significant
overheads with "large" .in_ lists?

The backend is PostgreSQL via psycopg2.

Thanks

--

.oO V Oo.

Manav Goel

unread,

Feb 21, 2012, 10:24:24 AM2/21/12

to sqlalchemy

This depends upon the execution plan of the query and is more really a
postgresql question. Google postgresql IN performance and you will get
a good idea of it.
By the look of your code, Second option would obviously be faster as
it hits database once whereas first one flush after every change.

Regards,

Simon King

unread,

Feb 21, 2012, 10:35:17 AM2/21/12

to sqlal...@googlegroups.com

A *long* time ago (SQLALchemy 0.3), I had some performance problems
with large "IN" clauses, and the reason turned out to be SQLAlchemy
taking a lot of time to build long lists of bindparam objects. I've no
idea if this is still the case these days. The best thing you can do
is just try it.

Simon

Vlad K.

unread,

Feb 21, 2012, 8:30:42 PM2/21/12

to sqlal...@googlegroups.com

Thanks for your replies.

Using the IN list definitely speeds up the process, but I hate the
resulting query which uses bound variables for each and every element of
the list.

But I have another problem with this, there's a massive memory leak
somewhere. Take a look at this model:

class GatewayTransaction(Base):
__tablename__ = "gateway_transactions"

realestate_id = Column(Integer, ForeignKey("realestate.realestate_id",
ondelete="set null",
onupdate="cascade"),
primary_key=True)
portal_id = Column(Text, primary_key=True)
realestate_portal_id = Column(Unicode)
operation = Column(Text, nullable=False)
agency_id = Column(Integer, ForeignKey("agencies.agency_id",
ondelete="set null",
onupdate="cascade"),
nullable=False)
agency_portal_id = Column(Unicode, nullable=False)
agency_export_token = Column(Unicode, nullable=False)
user_id = Column(Integer, ForeignKey("users.user_id",
ondelete="set null",
onupdate="cascade"),
nullable=False)
mod_images = Column(Boolean)

agency = relationship("Agency", lazy="joined")
realestate = relationship("Realestate", lazy="joined")
user = relationship("User", lazy="joined")

Now, when I do this:

for row in some_query.all():
gt = session.query(GatewayTransaction)\

.filter(GatewayTransaction.realestate_id==row.realestate_id)\
.filter(GatewayTransaction.portal_id==k)\
.first() or GatewayTransaction()

# Do some data processing
#
#

# Update existing or insert as new
gt = session.merge(gt)
session.flush()

It is very, very slow, it takes minutes to process 2000 rows and memory
usage skyrockets into multiple GB range and I have to terminate it
before it starts swapping like hell. With lazy="select", it flies fast,
done in a couple of seconds with very little memory consumed, because at
this point there are no rows in the table so nothing is additionally
selected, instead inserted. Still, why would a join slow things down so
drastically and shoot Python memory usage (not DB's) skyhigh?

Also, even if I try session.expunge(gt) or expunge_all() (previously
preparing the "row" to be loaded one by one from a list of IDs), the
memory always keeps growing, as if the instance do not die, never get
garbage collected...

.oO V Oo.

Vlad K.

unread,

Feb 22, 2012, 8:06:51 AM2/22/12

to sqlal...@googlegroups.com

Okay, seems two things are going on here. First, I am probably abusing
relationships and should reconsider their use. Second, the "memory leak"
is still happening, the memory usage just keeps growing regardless of
the fact that per each of 2000 iterations I spawn an instance, modify
it, flush back to disk and/or even commit, expunge or explicitly delete
the objects at the end of iteration.

I must note that I'm using this in a Pyramid app which uses Transaction
to wrap around SQLA's transaction management. So maybe tehre's the
problem, although I don't see how because Transaction does not know or
care about individual session model instances, no?

I also seem to misunderstand the relationships. By default the model
defines lazy="joined" which is what I need it to do normally. In this
particular iteration I do not need relationships, so I add
.options(noload(...)) where for ... I tried everything: relationship
names as strings, as class attributes, or even "*". It still constructs
massive joined queries, as if either noload() can't override
lazy="joined" or there's something else that would override the noload()
given directly on the query:

gt = session.query(GatewayTransaction)\
.options(noload("*"))\

.filter(GatewayTransaction.realestate_id==row.realestate_id)\
.filter(GatewayTransaction.portal_id==k)\
.first() or GatewayTransaction()

.oO V Oo.

Vlad K.

unread,

Feb 22, 2012, 9:46:37 AM2/22/12

to sqlal...@googlegroups.com

Okay, after several test cases, various join combinations with or
without relationships, with or without cherrypicking columns that are
really used from the joined models, I've come to the conclusion that the
only problem I'm having here is that there is no garbage collection.
Python memory use just keeps growing at a rate that, of course, depends
on the size of models used and data queried, but it just keeps growing,
regardless of release/deletion of instances or isolating one row
processing in its own committed transaction.

I also found this:

http://permalink.gmane.org/gmane.comp.python.sqlalchemy.user/30087

So it appears I'm having the same problem.

Am I understanding correctly that because of this, SQLAlchemy ORM is in
my case useless if I have to process thousands of rows, because the
memory used to process each row (along with corresponding joined models
etc...) will not be released? So basically I'd have to use SQLA without
the ORM, for this particular use case?

Or is this some memory leak bug?

If so, any suggestions, examples on how do I switch from ORM use to
non-ORM if I want to retain the named tuples returned by queries and
avoid rewriting half the app?

Thanks.

.oO V Oo.

Michael Bayer

unread,

Feb 22, 2012, 1:23:38 PM2/22/12

to sqlal...@googlegroups.com

When we want to test if a Python program has a "leak", we do that via seeing how many uncollected objects are present. This is done via gc:

import gc
print "total number of objects:", len(gc.get_objects())

That's the only real way to measure if the memory used by Python objects is growing unbounded. Looking at the memory usage on "top" shows what the interpreter takes up - the CPython interpreter in more modern releases does release memory back, but only occasionally. Older versions don't.

If you're doing an operation that loads thousands of rows, those rows are virtually always loaded entirely into memory by the DBAPI, before your program or SQLAlchemy is ever given the chance to fetch a single row. I haven't yet looked closely at your case here, but that's often at the core of scripts that use much more memory than expected.

There's ways to get *some* DBAPIs to not do this (particularly psycopg2, if you're using Postgresql, see http://docs.sqlalchemy.org/en/latest/orm/query.html?highlight=yield_per#sqlalchemy.orm.query.Query.yield_per and http://docs.sqlalchemy.org/en/latest/core/connections.html?highlight=stream_results#sqlalchemy.engine.base.Connection.execution_options), though the better solution is to usually try loading chunks of records in at a time (one such recipe that I use for this is here: http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery) . Or better yet consider if the problem can be solved entirely on the SQL side (this entirely depends on exactly what you're trying to do with the data in question).

> --
> You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
> To post to this group, send email to sqlal...@googlegroups.com.
> To unsubscribe from this group, send email to sqlalchemy+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
>

Vlad K.

unread,

Feb 22, 2012, 2:12:55 PM2/22/12

to sqlal...@googlegroups.com

Hi,

thanks for your reply. I haven't yet tested this with a profiler to see
exactly what exactly is happening, but the bottom line is that the
overall memory use grows with each iteration (or transaction processed),
to the point of grinding the server to a halt, and top shows only the
Python process involved consuming all the memory.

I've already modified code to read one row at a time, by first creating
a list of IDs to be affected, then going through that list and selecting
+ updating/inserting one transaction at a time.

I suppose I can solve the problem entirely on the SQL side with a stored
function but that's a maintenance overhead I'd like to avoid if possible.

Meanwhile I've gotten rid of "convenience" relationships and in some
aspects decided on lazy=select instead of subquery or joined and have
brought down total memory use, now the entire process can finish with
the amount of RAM available on the server, but it still shows linear
growth from the start to the end of the process.

.oO V Oo.

Vlad K.

unread,

Feb 22, 2012, 2:19:04 PM2/22/12

to sqlal...@googlegroups.com

Yes, definitely growing at a rate of 700-800 per iteration.

.oO V Oo.

On 02/22/2012 07:23 PM, Michael Bayer wrote:

Michael Bayer

unread,

Feb 22, 2012, 2:29:30 PM2/22/12

to sqlal...@googlegroups.com

On Feb 22, 2012, at 2:12 PM, Vlad K. wrote:

>
> Hi,
>
> thanks for your reply. I haven't yet tested this with a profiler to see exactly what exactly is happening, but the bottom line is that the overall memory use grows with each iteration (or transaction processed), to the point of grinding the server to a halt, and top shows only the Python process involved consuming all the memory.

yeah like I said that tells you almost nothing until you start looking at gc.get_objects(). If the size of gc.get_objects() grows continuously for 50 iterations or more, never decreasing even when gc.collect() is called, then it's a leak. Otherwise it's just too much data being loaded at once.

Claudio Freire

unread,

Feb 22, 2012, 2:46:20 PM2/22/12

to sqlal...@googlegroups.com

On Wed, Feb 22, 2012 at 4:29 PM, Michael Bayer <mik...@zzzcomputing.com> wrote:
>> thanks for your reply. I haven't yet tested this with a profiler to see exactly what exactly is happening, but the bottom line is that the overall memory use grows with each iteration (or transaction processed), to the point of grinding the server to a halt, and top shows only the Python process involved consuming all the memory.
>
> yeah like I said that tells you almost nothing until you start looking at gc.get_objects(). If the size of gc.get_objects() grows continuously for 50 iterations or more, never decreasing even when gc.collect() is called, then it's a leak. Otherwise it's just too much data being loaded at once.

I've noticed compiling queries (either explicitly or implicitly) tends
to *fragment* memory. There seem to be long-lived caches in the PG
compiler at least. I can't remember exactly where, but I could take
another look.

I'm talking of rather old versions of SQLA, 0.3 and 0.5.

Michael Bayer

unread,

Feb 22, 2012, 3:18:34 PM2/22/12

to sqlal...@googlegroups.com

0.3's code is entirely gone, years ago. I wouldn't even know what silly things it was doing.

In 0.5 and beyond, theres a cache of identifiers for quoting purposes. If you are creating perhaps thousands of tables with hundreds of columns, all names being unique, then this cache might start to become a blip on the radar. For the expected use case of a schema with at most several hundred tables this should not be a significant size.

I don't know much what it means for a Python script to "fragment" memory, and I don't really think there's some kind of set of Python programming practices that deterministically link to whether or not a script fragments a lot. Alex Martelli talks about it here: http://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python . The suggestion there is if you truly need to load tons of data into memory, doing it in a subprocess is the only way to guarantee that memory is freed back to the OS.

As it stands, there are no known memory leaks in SQLAlchemy itself and if you look at our tests under aaa_profiling/test_memusage.py you can see we are exhaustively ensuring that the size of gc.get_objects() does not grow unbounded for all sorts of awkward situations. To illustrate potential new memory leaks we need succinct test cases that illustrate a simple ascending growth in memory usage.

Claudio Freire

unread,

Feb 22, 2012, 3:28:13 PM2/22/12

to sqlal...@googlegroups.com

On Wed, Feb 22, 2012 at 5:18 PM, Michael Bayer <mik...@zzzcomputing.com> wrote:
>
> On Feb 22, 2012, at 2:46 PM, Claudio Freire wrote:
>
>> On Wed, Feb 22, 2012 at 4:29 PM, Michael Bayer <mik...@zzzcomputing.com> wrote:
>>>> thanks for your reply. I haven't yet tested this with a profiler to see exactly what exactly is happening, but the bottom line is that the overall memory use grows with each iteration (or transaction processed), to the point of grinding the server to a halt, and top shows only the Python process involved consuming all the memory.
>>>
>>> yeah like I said that tells you almost nothing until you start looking at gc.get_objects(). If the size of gc.get_objects() grows continuously for 50 iterations or more, never decreasing even when gc.collect() is called, then it's a leak. Otherwise it's just too much data being loaded at once.
>>
>> I've noticed compiling queries (either explicitly or implicitly) tends
>> to *fragment* memory. There seem to be long-lived caches in the PG
>> compiler at least. I can't remember exactly where, but I could take
>> another look.
>>
>> I'm talking of rather old versions of SQLA, 0.3 and 0.5.
>
>
> 0.3's code is entirely gone, years ago. I wouldn't even know what silly things it was doing.

Like I said, I would have to take another look into the matter to
validate against 0.7

> In 0.5 and beyond, theres a cache of identifiers for quoting purposes. If you are creating perhaps thousands of tables with hundreds of columns, all names being unique, then this cache might start to become a blip on the radar. For the expected use case of a schema with at most several hundred tables this should not be a significant size.

Fixed schema, but the code did create lots of aliases for dynamic queries.

> I don't know much what it means for a Python script to "fragment" memory, and I don't really think there's some kind of set of Python programming practices that deterministically link to whether or not a script fragments a lot. Alex Martelli talks about it here: http://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python . The suggestion there is if you truly need to load tons of data into memory, doing it in a subprocess is the only way to guarantee that memory is freed back to the OS.

I wrote a bit about that[0].

The issue is with long-lived objects, big or small, many or few.

> As it stands, there are no known memory leaks in SQLAlchemy itself

I can attest to that. I have a backend running 24/7 and, barring some
external force, it can keep up for months with no noticeable leaks.

> and if you look at our tests under aaa_profiling/test_memusage.py you can see we are exhaustively ensuring that the size of gc.get_objects() does not grow unbounded for all sorts of awkward situations. To illustrate potential new memory leaks we need succinct test cases that illustrate a simple ascending growth in memory usage.

Like I said, it's not a leak situation as much of a fragmentation
situation, where long-lived objects in high memory positions can
prevent the process' heap from shrinking.

[0] http://revista.python.org.ar/2/en/html/memory-fragmentation.html

Michael Bayer

unread,

Feb 22, 2012, 3:40:53 PM2/22/12

to sqlal...@googlegroups.com

On Feb 22, 2012, at 3:28 PM, Claudio Freire wrote:

>
> Like I said, it's not a leak situation as much of a fragmentation
> situation, where long-lived objects in high memory positions can
> prevent the process' heap from shrinking.
>
> [0] http://revista.python.org.ar/2/en/html/memory-fragmentation.html

Saw that a bit, but looking at the "tips" at the bottom, concrete implementation changes are not coming to mind. An "eternal structure" is ubiquitous in any programming language. sys.modules is a big list of all the Python modules that have been imported, each one full of functions, classes, other data, these are all "eternal structures" - sys.modules is normally never cleaned out. I'm not seeing at what point you move beyond things that are in these modules into things that are so-called "eternal structures" that lead to inappropriate memory fragmentation.

Claudio Freire

unread,

Feb 22, 2012, 3:51:57 PM2/22/12

to sqlal...@googlegroups.com

On Wed, Feb 22, 2012 at 5:40 PM, Michael Bayer <mik...@zzzcomputing.com> wrote:
> Saw that a bit, but looking at the "tips" at the bottom, concrete implementation changes are not coming to mind. An "eternal structure" is ubiquitous in any programming language. sys.modules is a big list of all the Python modules that have been imported, each one full of functions, classes, other data, these are all "eternal structures" - sys.modules is normally never cleaned out. I'm not seeing at what point you move beyond things that are in these modules into things that are so-called "eternal structures" that lead to inappropriate memory fragmentation.

The thing to be careful about is when those eternal structures are created.

If they're created at the beginning (as sys.modules, which is
populated with imports, which most of the time happen in the preamble
of .py files), then the resulting objects will have lower memory
locations and thus not get in the way.

But if those structures are created after the program had time to fill
its address space with transient objects (say, lazy imports, caches),
then when the transient objects are deleted, the eternal structures
(with their high addresses) prevent the heap from shrinking.

Such caches, for instance, are better made limited in lifespan (say,
giving them a finite lifetime, making them expire, actively cleaning
them from time to time). Structures that are truly required to be
eternal are better populated at load time, early in the program's
lifecycle. In my backend, for instance, queries are precompiled at
startup, to make sure they have lower memory addresses. This has
mostly solved SQLA-related memory fragmentation issues for me.

Claudio Freire

unread,

Feb 22, 2012, 3:57:59 PM2/22/12

to sqlal...@googlegroups.com

On Wed, Feb 22, 2012 at 5:51 PM, Claudio Freire <klauss...@gmail.com> wrote:
> Such caches, for instance, are better made limited in lifespan (say,
> giving them a finite lifetime, making them expire, actively cleaning
> them from time to time). Structures that are truly required to be
> eternal are better populated at load time, early in the program's
> lifecycle. In my backend, for instance, queries are precompiled at
> startup, to make sure they have lower memory addresses. This has
> mostly solved SQLA-related memory fragmentation issues for me.

One source of trouble I've had here, is the inability to use bind
parameters inside .in_(...).

Queries that accept variable lists, thus, I had to "precompile to
string", and replace the inside of the condition by string
interpolation.

Ugly hack, but it served me well.

Michael Bayer

unread,

Feb 22, 2012, 4:21:33 PM2/22/12

to sqlal...@googlegroups.com

IMHO the whole point of using a high level, interpreted language like Python is that we don't have to be bogged down thinking like C programmers. How come I've never had a memory fragmentation issue before ? I've made "precompilation" an option for folks who really wanted it but I've never had a need for such a thing. And you can be sure I work on some very large and sprawling SQLAlchemy models these days.

There are some caches here and there like the identifier cache as well as caches inside of TypeEngine objects, but these caches are all intended to be of limited size.

Claudio Freire

unread,

Feb 22, 2012, 4:32:45 PM2/22/12

to sqlal...@googlegroups.com

On Wed, Feb 22, 2012 at 6:21 PM, Michael Bayer <mik...@zzzcomputing.com> wrote:
> IMHO the whole point of using a high level, interpreted language like Python is that we don't have to be bogged down thinking like C programmers. How come I've never had a memory fragmentation issue before ? I've made "precompilation" an option for folks who really wanted it but I've never had a need for such a thing. And you can be sure I work on some very large and sprawling SQLAlchemy models these days.

Maybe you never used big objects.

Memory fragmentation arises only when the application handles a
mixture of big and small objects, such that holes created by small
objects being freed don't serve big memory requirements.

If your application handles a homogenous workload (ie: every request
is pretty much the same), as is usual, then you won't probably
experience fragmentation.

My application does the usual small-object work, interspersed with
intense computation on big objects, hence my troubles.

Python's garbage collector has been a pending issue for a long time,
but, as I noticed in the linked page, past architectural decisions
prevent some widely desired improvements.

Vlad K.

unread,

Feb 22, 2012, 6:36:55 PM2/22/12

to sqlal...@googlegroups.com

Okay, thanks to this article:

http://neverfear.org/blog/view/155/Investigating_memory_leaks_in_Python

I made similar plot of object counts in time, showing top 50 types. The
resulting PDF is here (you might wish to download it first, Google
messes it up for me):

https://docs.google.com/open?id=0ByLiBlA59qDwYTY1MGIzYWEtYjMxZi00ZDVlLTk0OTEtOGI2ZjA3NDgyM2Y3

Everything seems to linearly grow in count. Something is keeping all
those objects reference somewhere. What could possibly be the cause?

.oO V Oo.

Michael Bayer

unread,

Feb 22, 2012, 8:41:57 PM2/22/12

to sqlal...@googlegroups.com

can you provide a self-contained, single file test case that illustrates the memory growth ?

Vlad K.

unread,

Feb 23, 2012, 8:18:57 AM2/23/12

to sqlal...@googlegroups.com

And yet again the problem is not in SQLAlchemy but in Pyramid.... Sorry,
from now on, I'll first try writing test scripts without Pyramid to see
where the problem is, I just don't have time for that kind of debugging
so I'm wasting yours. :)

Thanks. I'll write to Pylons Discuss list, but the test script is here
nevertheless. Comment out line 266 (and remove imports) to disable
Pyramid and see it work okay. Without Pyramid (but even with Transaction
and ZopeTransactionExtension), the gcdelta is 0 or negative per batch of
200 iterations. With Pyramid bootstrapped, gcdelta is in thousands per
batch of 200 iterations.

https://gist.github.com/d669e958c54869c69831

.oO V Oo.

Michael Bayer

unread,

Feb 23, 2012, 8:50:49 AM2/23/12

to sqlal...@googlegroups.com

I'd note that those subqueryloads() render the effectiveness of yield_per() to be almost nil.

This is how subqueryload works:

query(Parent).options(subqueryload(Parent.child)).all()

will give you:

"SELECT * FROM parent"

then, the moment a "parent" row is fetched, the first "child" collection is referenced to be loaded, then the query emitted is:

"SELECT * FROM child JOIN (SELECT * FROM parent) AS a on a.id=child.id_a"

that is, the second query loads all child rows for all parents in the entire result.

So let's say we do yield_per(), so that SQLAlchemy only processes the first 100 rows before handing them out. As soon as you hit either of those two subqueryloaded collections, the yield_per() is mostly thrown out the window - they will each load the entire list of child objects for the whole result, which right there will grow memory to be as big as your entire result.

The next thing I'd look at is that second query for ModelC. You can also load those upfront so that you don't need to do a query each time:

modelcs = dict(
sess.query(ModelC.id_a, ModelC).join(ModelC.a).filter(ModelA.parent_id=<same integer you're using against ModelA>)
)

then you have a dictionary of "id_a->ModelC" as you loop through your ModelA records. All the ModelC's are already in the session, so you don't need to use merge(), which is also not a super quick operation. If an "id_a" is not in the dictionary then you know to create a new ModelC and use Session.add() to put it in. If you're limiting the ModelA rows using a LIMIT or window recipe like I mentioned earlier, you'd apply that same criteria to the loading of the "modelcs".

Later on you're doing something with query(ModelC).first() in a loop which is also something I hope isn't in the real application - looking there is seems like you'd only need to say query(ModelC).delete().

Watching the SQL emitted with echo=True, and in some cases also seeing how large the results coming in are using echo='debug', is something I strongly recommend when first profiling an application.

Vlad K.

unread,

Feb 23, 2012, 9:10:43 AM2/23/12

to sqlal...@googlegroups.com

.oO V Oo.

On 02/23/2012 02:50 PM, Michael Bayer wrote:
> I'd note that those subqueryloads() render the effectiveness of yield_per() to be almost nil.

I know. I've replicated the real use case in the application which has
yet to see proper optimization which includes better query planning and
reduction of unnecessary joins and relationships. I'd rather investigate
in implementing prepared statements and do basically lazy="select"
instead of subqueries.

> The next thing I'd look at is that second query for ModelC. You can also load those upfront so that you don't need to do a query each time:
>
> modelcs = dict(
> sess.query(ModelC.id_a, ModelC).join(ModelC.a).filter(ModelA.parent_id=<same integer you're using against ModelA>)
> )
>
> then you have a dictionary of "id_a->ModelC" as you loop through your ModelA records. All the ModelC's are already in the session, so you don't need to use merge(), which is also not a super quick operation. If an "id_a" is not in the dictionary then you know to create a new ModelC and use Session.add() to put it in. If you're limiting the ModelA rows using a LIMIT or window recipe like I mentioned earlier, you'd apply that same criteria to the loading of the "modelcs".
>
> Later on you're doing something with query(ModelC).first() in a loop which is also something I hope isn't in the real application - looking there is seems like you'd only need to say query(ModelC).delete().

Might not be visible from this test script, but the scenario is this.
ModelA represents certain data that has to be exported to external
services (XMLRPC, REST and similar). In an ideal situation I just select
all ModelA that has to be exported (by looking at timestamp of last
modification vs timestamp of process run), but I can't do that because
if such a transaction fails, it has to remain "remembered" for next
batch run. So I use ModelC table which logs these pending transactions.

So the first phase selects rows from ModelA that are up for export and
creates transaction logs in ModelC.

The second phase then loads and exports one by one row from ModelC
(joined with ModelA and everything else required for the export).
However, if single transaction fails, the entire script exist and
continues when called next time. This I have to do for other reasons
(preventing overload on possibly downed external service etc.., so I
can't "skip" that row and fetch next).

It may happen, and does regularly, that on subsequent runs of the
process there are no new ModelA rows to load, but there are ModelC rows
that failed from last call (since they're in the table, it means they
were not processed)

And that's the logic in query(ModelC).first(), processing, and then
delete(). Also note that each row has to be an individual transaction
(load, send to external service, remove from ModelC table), which means
I can't rely on session/identity "caching" by pre-loading data instead
of joins and subqueries.

> Watching the SQL emitted with echo=True, and in some cases also seeing how large the results coming in are using echo='debug', is something I strongly recommend when first profiling an application.

Yes, I use logging and see all the SQL emitted.

Thanks for your input, I appreciate all the help and advice I can get.
Still a ton of stuff to learn about SQLA.

V

Reply all

Reply to author

Forward