SQLAlchemy ORM Object caching with relationships and invalidation

2,381 views
Skip to first unread message

Michael Kowalchik

unread,
Sep 27, 2012, 1:28:12 AM9/27/12
to sqlal...@googlegroups.com
Unlike most of the SQLAlchemy caching examples, I'm not using query-level caching but rather caching at the instance / object level. I'm using the ORM and I have a SQLAlchemy object that I load and occasionally store in a second level cache. I have a custom 'get' loading function that checks the second level cache based on primary key and returns the cached copy if present and populates the cache if not.

I also have an event handler on 'after_commit' that checks the session for dirty objects that are cached (with a simple class property of __cached__=True) and does either an invalidate or write-back to the cache when these objects are dirty and cached.

This pattern is pretty simple and works great.

My problem is that I'd like to be able to use this same (or similar) pattern for more complex SQLAlchemy objects containing relationships. I'd like the cache to contain not only the 'base' object but all (lazy) loaded related obejcts. I have no problem storing and retrieving these objects (and relationships) from the cache, my problem comes from the invalidation/write-back part.

Lets say I have object A that's in the cache and it has a relationship, A.B that was stored along with it in the cache. If I retrieve A from the cache I can get A.B and I get the cached copy of B. If B is modified, however, then my simple cache invalidator event handler doesn't see that B is cached (no __cached__ property on B) and B gets committed to the database without the cached copy of A being invalidated. Now subsequent cache hits of A will have a stale A.B relationship.

So my question is, is there a clean / simple way to mark A for invalidation when B is modified? I've come up with a few schemes but all of them seem brittle, complicated, and my intuition is telling me that I'm reinventing the wheel; that some facility in SQLAlchemy itself may be useful in walking this relationship graph to find loaded, connected relationships who's cached represenations might need to be invalidated. Alternatively, is there another pattern that would be better suited to this kind of object-level caching?

Thanks!
-Mike

David McKeone

unread,
Sep 27, 2012, 6:04:32 AM9/27/12
to sqlal...@googlegroups.com
Hi Mike,

I'll let the others add more detail about your questions, but for the broad strokes I thought I'd add that I ran into similar issues with my simple caching method and ultimately ended up using the new Dogpile.cache stuff that Mike recommended on his blog.   (The example file is here: https://groups.google.com/d/msg/sqlalchemy/MrKA6AygZ14/o95dmUdLS70J )  It is far more integrated with the internals of SQLAlchemy relationship management, so it may behave better.

Michael Bayer

unread,
Sep 29, 2012, 12:17:39 AM9/29/12
to sqlal...@googlegroups.com

On Sep 27, 2012, at 1:28 AM, Michael Kowalchik wrote:

> Lets say I have object A that's in the cache and it has a relationship, A.B that was stored along with it in the cache. If I retrieve A from the cache I can get A.B and I get the cached copy of B. If B is modified, however, then my simple cache invalidator event handler doesn't see that B is cached (no __cached__ property on B) and B gets committed to the database without the cached copy of A being invalidated. Now subsequent cache hits of A will have a stale A.B relationship.

> So my question is, is there a clean / simple way to mark A for invalidation when B is modified? I've come up with a few schemes but all of them seem brittle, complicated, and my intuition is telling me that I'm reinventing the wheel; that some facility in SQLAlchemy itself may be useful in walking this relationship graph to find loaded, connected relationships who's cached represenations might need to be invalidated. Alternatively, is there another pattern that would be better suited to this kind of object-level caching?

well you're checking __cached__ in an after_commit() hook. usually the technique here if you want relationships is to *cascade* the operation at that point. So assuming its OK that you'd be leading from A, when A is __cached__, checking it for "dirty" also means you need to traverse down its relationships. this is a straight ahead graph traversal problem, but you're right SQLA obviously has this function built into it. you can use the cascade_iterator() feature of mapper():

state = instance_state(myobject)
mapper = state.mapper
for related in mapper.cascade_iterator("save-update", state):
# do check

where above you're cascading along the "save-update" cascade setting which is enabled on all relationships by default.


The other pattern might be, to drive from how you phrased it, "mark A for invalidation when B is modified". That is, use event handlers so that when a B is attached to an A, it gets associated in some way such that a change on B propagates up to the A (which would mean more event handlers .... we dont as yet have an event "any change on class X", they are all per-attribute. this is more involved but could save you from lots of traversals in your after_commit hook depending on the circumstances.


mikepk

unread,
Sep 29, 2012, 6:36:08 PM9/29/12
to sqlal...@googlegroups.com
Thanks Michael, I really appreciate the response. This broke something of a mental log-jam I was having.

The original use I had in mind for this was for my web framework. I was using it to tag "Session" objects for caching and then attach user specific data and frequently accessed data to the session to minimize database requests for common data across all a users requests. The goal was to have the caching be fairly transparent to the application so that I wouldn’t have to track invalidations manually. This object cache pattern has been in the framework for a while but I haven't had lots of time to really make it work properly (kind of a side project).

I'd looked at using the cascade but I was fixated (for some reason) on checking only the child items with changes (session.dirty/new/deleted) in the before_flush and after_commit hooks. I was looking for some way to annotate children with relationships with the top level parent container so that on a child modification I could operate on the container (A.B.C, C is modified by someone, mark A for invalidation) or to somehow walk the relationship backwards from child to parent. That last bit was where I got stuck and abandoned this approach. I had some implementations where I was trying to pass parent references down the chain but it got complicated when multiple ‘containers’ were referencing the same child and I never got the passing to work properly. (Any tips on the proper event to listen for for attribute attach and append?)

Your comment gave me a forehead slap moment. I realized I could just check *all* of the objects tagged as cache container objects in the current session and cascade down from each and mark the container for write through or invalidation if any related objects were modified. The pattern I'm using only has a few of these 'container' cached objects per web request (like session above) so checking them all isn't crazy-painful. I did some quick timeit experiments and it looks like the overhead of these extra checks is pretty negligible and it will potentially save me a lot of complexity and db lookups in my current application.

Another optimization I'm thinking of is to actually record the related children object list on fetch from cache and then compare to the new related object list on flush. That way new loaded relationships would get written through to the cache to be re-used on subsequent requests as well. The current implementation can’t see when an object has been lazy-loaded (but not modified) on the container. Any thoughts on this approach?

Just for reference, here's a (slightly modified) version of how I'm doing the invalidation / modification check now. In my quick and dirty testing it seems to cover all of the stale data cases I was encountering before.

The only other pieces to make this work is a cache_key attribute and a __cached__ tag on objects that will be cached and act as containers. I then have a cache.get method that I use to request these special containers. The get function checks memcached first, merges on hit, and requests from the database on miss.


session = cache.get(Session, session_id="XXXXX")

The item gets written through to the cache on successful commit if it's new or modified (and now, has related children that are new or modified). Putting it in after_commit also guarantees that invalid data that's rolled back doesn't accidentally get written to the cache.


from sqlalchemy import event
from sqlalchemy.orm.attributes import instance_state

def is_modified(obj):
    '''Check if SQLAlchemy believes this instance is modified.'''
    return instance_state(obj).modified

def child_relationships(obj):
    '''Get instances with child relationships to this obj'''
    state = instance_state(obj)
    mapper = obj.__mapper__
    for related_obj, mapper, state, data in mapper.cascade_iterator("save-update",
                                                                         state):
       yield related_obj

# Caching / invalidation event handlers and functions.
def cached_objects(iter):
    '''Return objects from a session/iterator that are tagged for caching'''
    for obj in iter:
        # check if the sqlalchemy object is tagged for caching
        if hasattr(obj, '__cached__'):
            yield obj

def cached_objects_with_updates(iter):
    '''Return objects from a session/iterator that are tagged for caching and
    that need an update.'''
    for obj in iter:
        if getattr(obj, "__needs_update__", False):
            yield obj

def check_needs_update(session, flush_context, instances):
    '''Check for objects in the SQLAlchemy session that are modified and
    tag them for cache update.'''
    for obj in cached_objects(session):
        if obj in (session.new or session.dirty):
            obj.__needs_update__ = True
        for related_obj in child_relationships(obj):
            if (related_obj in (session.new or session.dirty or session.deleted)
                                                 and is_modified(related_obj)):
                obj.__needs_update__ = True

def update_cached_objects(session):
    '''Write out (or invalidate) cache entries for modified entitites'''
    for obj in cached_objects_with_updates(session):
        obj.__needs_update__ = False
        cache.write(obj.cache_key, obj)

event.listen(models.session, "before_flush", check_needs_update)
event.listen(models.session, "after_commit", update_cached_objects)
Reply all
Reply to author
Forward
0 new messages