session memory management

207 views
Skip to first unread message

kris

unread,
Dec 3, 2011, 2:31:10 PM12/3/11
to sqlal...@googlegroups.com
I have a database migration using sqlalchemy migrate that needs to migrate several million rows.
After running for several days, I see the job has grown to consume all available memory and 
is swapping.   

Can someone point me to a description of what is the best way to manage memory in the session
during a long transaction?
i.e.

1.  Does  session.flush remove instances of the objects in the session
2.  Should I break the migration into several subtransactions?
3.  use expunge at different times?

Any help appreciated.

Thanks

Michael Bayer

unread,
Dec 3, 2011, 4:40:04 PM12/3/11
to sqlal...@googlegroups.com

You should at least be calling flush() every thousand records or so, and ensure that none of the objects involved in that flush remain strongly referenced afterwards. When an object is "clean", it is only weakly referenced by the Session so will fall out of scope naturally. The Session also maintains a list of objects that came into the current transaction as "new", so that if a rollback occurs, these can be evicted, but that list is also self-cleaning when the newly inserted objects are dereferenced.

Making sure you flush() periodically and don't maintain references in memory should allow you an app that runs forever without memory growth.

The other thing to watch out for is very large selects. Most DBAPIs fully buffer a result set in memory, and the ORM by default does this as well. If you select tens of thousands of rows in one result set, the memory usage of the Python interpreter will grow to fit that space, and then generally never be returned to the OS (CPython behavior). So try to avoid having very large, single result sets - there are ways to get an ORM result to "stream" but there are a bunch of limitations (only certain backends, can't use eager loading, etc.), it's better to avoid it.


kris

unread,
Dec 5, 2011, 1:40:31 AM12/5/11
to sqlal...@googlegroups.com
I have added the flush every 1024 loads, and also set DBSession.autoflush = False, but I see that I have
gone and used up all available memory again.   There are no other references than the ones in the session.

A bit more information, I am getting stuck in a routine that adds 'root' pointer to set of records
so that I can use a loading pattern similar to example in elementtrree/optimized_al, but without
the separate document table.  Visiting all the nodes in hierarchy and adding the 'root' node.

nodes = Table('nodes'', metadata,
             ... 
                 Column('parent_id', Integer, ForeignKey('nodes.id')),
                 Column('root_id', Integer, ForeignKey('nodes.id')),

mapper( Node, nodes,
...
    'children' : relation(Node, lazy=True, cascade="all, delete-orphan",
                          backref = backref('parent', enable_typechecks=False, remote_side = [ nodes.c.id]),
   'docnodes': relation(Node, lazy=True,
                         cascade = "all, delete-orphan",
                         post_update=True,
                         primaryjoin = (taggable.c.id == taggable.c.root_id),
                         backref = backref('document', post_update=True,
                                           enable_typechecks=False, remote_side=[taggable.c.id]),


Could this be making a cycle that the session cannot get rid of?       
 

Michael Bayer

unread,
Dec 5, 2011, 11:36:17 AM12/5/11
to sqlal...@googlegroups.com


mmm not if you're on at least a recent 0.5 version or later, while there certainly can be reference cycles, the Session/collection system is extremely carefully designed and tested like crazy to not produce any uncollectable structures.

Though dealing with a hierarchical structure can certainly make it tough to keep only a part of it in memory at a time. You'd probably want to make sure you don't ever load any "children" collection since that ultimately refers to the whole list of subnodes, recursively down through the whole thing. It might be worth it to see what happens if you change those relationships to 'lazy="dynamic"', but overall you'd want to be aware of the structure you have in memory and how much of it is loaded. Sometimes I will pdb through obj.__dict__ to see what's present without tripping off any lazyloaders.


Reply all
Reply to author
Forward
0 new messages