Combining yield_per and eager loading

540 views
Skip to first unread message

Martijn van Oosterhout

unread,
Jul 15, 2016, 12:08:25 PM7/15/16
to sqlalchemy
Ok, so SQLAlchemy has this nice feature where you can eager load relationships to significantly reduce the number of queries during processing.

On the other hand, to reduce memory usage you can use yield_per() (on Postgres) to significantly reduce the memory usage by not loading the entire database in memory at once.

For very good reasons mentioned in the documentation you can't use both of these in the same query, yet that is kind of my goal. What I'd like to achieve, for a given query which goes over a big table:

while not end of resultset:
   take 1000 results
   eagerload all the relationships
   process them

Now, the eager loading part is posing difficulties (or I'm not reading the documentation carefully enough). I found the attributes.set_committed_value() function which solves half the problem, but I still need to generate the actual query to return the necessary objects. So perhaps (pseudo-code):

def eagerload_for_set(object_list, relationship)
   ids
= set(o.get(relationship.left_id) for o in object_list)
   lookup
= Query(relationship.right_table).filter_by(relationship.right_column.in_(ids)).all()
   
for o in object_list:
       o
.set_committed_value(o, relationship.left, lookup[relationship.left_id])

Before I go diving into the SQLAlchemy to make the above actually work, does it seem reasonable? Are there are handy utils somewhere that might help?

Thanks for any ideas,

Have a nice day,

--
Martijn van Oosterhout



Mike Bayer

unread,
Jul 15, 2016, 12:46:37 PM7/15/16
to sqlal...@googlegroups.com


On 07/15/2016 12:08 PM, Martijn van Oosterhout wrote:
> Ok, so SQLAlchemy has this nice feature where you can eager load
> relationships to significantly reduce the number of queries during
> processing.
>
> On the other hand, to reduce memory usage you can use yield_per() (on
> Postgres) to significantly reduce the memory usage by not loading the
> entire database in memory at once.
>
> For very good reasons mentioned in the documentation you can't use both
> of these in the same query, yet that is kind of my goal. What I'd like
> to achieve, for a given query which goes over a big table:
>
> while not end of resultset:
> take 1000 results
> eagerload all the relationships
> process them


Here's the problem that cant be solved:

1. fetch rows 0-1000

2. start loading Foo objects:

Foo(1) -> eagerload Foo.related with 400 members
Foo(2) -> eagerload Foo.related with 500 members
Foo(3) -> eagerload Foo.related with the first 100 of 250 members

3. yield. That is, stop reading more rows. Send out Foo(1), Foo(2),
Foo(3). The application now can go nuts with these. It can read them,
write them, iterate through .related.

Right there, we're broken. Foo(3) is out in the world with less than
half of its ".related" collection - they are still sitting on the cursor!

So let's try, don't actually yield Foo(3) until we've seen Foo(4), or
the result has ended. That is, let's try to be clever and look at the
next row to see that we're definitely done with Foo(3). This is
completely complicated to do, but I have made it do this when I tried to
make this work.

This is more insidious, because it will actually work most of the time.
However, it won't work when we don't actually get objects in order
like that. If Foo.related is a many-to-many, we could see a Bar in one
yield batch, then see it all over again in another batch later. What if
the user application did all kinds of things to that Bar() in the first
place, now we're seeing it again, but the identity map has already been
garbage collected. Now that Bar() is stale *before it ever left its
result set*, not to mention if that Bar() has more eagerloaded
collections, they're all wrong too! It's a disaster.

Vastly simpler is to use the recommended window functions to do
pagination of any scale with no chance of complex failures.






>
> Now, the eager loading part is posing difficulties (or I'm not reading
> the documentation carefully enough). I found the
> |attributes.set_committed_value()|
> <http://docs.sqlalchemy.org/en/latest/orm/session_api.html#sqlalchemy.orm.attributes.set_committed_value>
> function which solves half the problem, but I still need to generate the
> actual query to return the necessary objects. So perhaps (pseudo-code):
>
> |
> defeagerload_for_set(object_list,relationship)
> ids =set(o.get(relationship.left_id)foro inobject_list)
> lookup
> =Query(relationship.right_table).filter_by(relationship.right_column.in_(ids)).all()
> foro inobject_list:
>
> o.set_committed_value(o,relationship.left,lookup[relationship.left_id])
> |
>
> Before I go diving into the SQLAlchemy to make the above actually work,
> does it seem reasonable? Are there are handy utils somewhere that might
> help?
>
> Thanks for any ideas,
>
> Have a nice day,
>
> --
> Martijn van Oosterhout
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sqlalchemy+...@googlegroups.com
> <mailto:sqlalchemy+...@googlegroups.com>.
> To post to this group, send email to sqlal...@googlegroups.com
> <mailto:sqlal...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

Martijn van Oosterhout

unread,
Jul 15, 2016, 5:14:33 PM7/15/16
to sqlal...@googlegroups.com
On 15 July 2016 at 18:46, Mike Bayer <mik...@zzzcomputing.com> wrote:


Here's the problem that cant be solved:

1. fetch rows 0-1000

2. start loading Foo objects:

     Foo(1) ->  eagerload Foo.related with 400 members
     Foo(2) ->  eagerload Foo.related with 500 members
     Foo(3) -> eagerload Foo.related with the first 100 of 250 members

3. yield.  That is, stop reading more rows.  Send out Foo(1), Foo(2), Foo(3).  The application now can go nuts with these.  It can read them, write them, iterate through .related.

Right there, we're broken.  Foo(3) is out in the world with less than half of its ".related" collection - they are still sitting on the cursor!


Ok, I think I see what you're getting at but I don't think that a problem here. It looks like you have a query which joins all the relationships in one go. That's not what I'm doing, I'm relying on the fact that I can interleave the queries. On SQL level it would look like so:

DECLARE my_cursor CURSOR FOR SELECT * FROM bigtable;
FETCH 1000 FROM my_cursor;
SELECT * FROM related_table WHERE bigtable_id IN (1,2,3,4,5,....)
SELECT * FROM related_table2 WHERE bigtable_id IN (1,2,3,4,5,....)
SELECT * FROM related_table JOIN second_order_relationship USING (pkey) WHERE bigtable_id IN (1,2,3,4,5,....)

FETCH 1000 FROM my_cursor;
SELECT * FROM related_table WHERE bigtable_id IN (11,12,13,14,15,....)
SELECT * FROM related_table2 WHERE bigtable_id IN (11,12,13,14,15,....)
SELECT * FROM related_table JOIN second_order_relationship USING (pkey) WHERE bigtable_id IN (11,12,13,14,15,....)

<... repeat ...>
(this may be a postgresql specific feature, not sure).

Yes, this may mean that some of the related objects may be fetched multiple times, but this is not a problem for me (the whole thing is read-only anyway). What I need however is for the secondary queries to populate the relationships in the original BigTable objects.

Vastly simpler is to use the recommended window functions to do pagination of any scale with no chance of complex failures.


A bit of googling suggests you are referring to this: https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/WindowedRangeQuery
which indeed looks very neat and much easier than what I was thinking. It doesn't say explicitly, but it looks like it will work transparently with eager loading. It basically does the above, but skips the cursor and replaces it with queries on ranges of the primary key (which is shorter and probably more efficient to boot).

Thanks for the tip!
--

Mike Bayer

unread,
Jul 15, 2016, 5:57:09 PM7/15/16
to sqlal...@googlegroups.com


On 07/15/2016 05:14 PM, Martijn van Oosterhout wrote:
>
> On 15 July 2016 at 18:46, Mike Bayer <mik...@zzzcomputing.com
OK, but the IN loading you're doing, that's good too, I'd like to add a
loader which uses that someday, and you're right, if we did, we could
make it work with yield_per too due to the nature of it. If you want to
use that approach take a look at
https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/DisjointEagerLoading.

>
> Thanks for the tip!
> --
> Martijn van Oosterhout <kle...@gmail.com <mailto:kle...@gmail.com>>
> http://svana.org/kleptog/

Martijn van Oosterhout

unread,
Jul 17, 2016, 2:47:11 AM7/17/16
to sqlal...@googlegroups.com

On 15 Jul 2016 23:57, "Mike Bayer" <mik...@zzzcomputing.com> wrote:
>
>
>
> OK, but the IN loading you're doing, that's good too, I'd like to add a loader which uses that someday, and you're right, if we did, we could make it work with yield_per too due to the nature of it.  If you want to use that approach take a look at https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/DisjointEagerLoading.

Wow, that's another really nice example. Those usage examples really need some more prominence in the google search results.

The windowed approach has the advantage of the queries being simpler, but requires scanning the table twice (although the first might be cheaper). Using the IN() approach is more general perhaps, but the queries would be much longer so more traffic to the server. It would be nice if it could all be made to Just Work.

I'll play a bit and see what I can get to work. Thanks again.

Martijn

Martijn van Oosterhout

unread,
Jul 19, 2016, 12:33:13 PM7/19/16
to sqlalchemy
On Sunday, July 17, 2016 at 8:47:11 AM UTC+2, Martijn van Oosterhout wrote:


I'll play a bit and see what I can get to work. Thanks again.


So, I have a chance to play and got something that actually works quite nicely, see below. Two things:

- I switched to referencing the primary key of the original object directly, because some of our relationships are a bit more complex.
- Chained relationships don't work.

But in any case, even this improves performance greatly.


from itertools import groupby, islice
from sqlalchemy.orm import attributes, object_session
from sqlalchemy import tuple_


def yielded_load(query, attrs, N=1000):
   
# Note: query must return only a single object (for now anyway)
    main_query
= query.yield_per(N)

    main_res
= iter(main_query)

   
while True:
       
# Fetch block of results from query
        objs
= list(islice(main_res, N))

       
if not objs:
           
break
           
       
for attr in attrs:
            target
= attr.prop.mapper
            pk
= attr.prop.parent.primary_key

           
# Generate query that joins against original table
            child_q
= object_session(objs[0]).query(target, *pk).order_by(*pk)
           
if attr.prop.order_by:
                child_q
= child_q.order_by(*attr.prop.order_by)

            keys
= [[getattr(obj, col.key) for col in pk] for obj in objs]

            child_q
= child_q.join(attr).filter(tuple_(*pk).in_(keys))

            collections
= dict((k, [r[0] for r in v]) for k, v in groupby(
                            child_q
,
                           
lambda x:tuple([getattr(x, c.key) for c in pk])
                       
))

           
for obj in objs:
                attributes
.set_committed_value(
                    obj
,
                    attr
.key,
                    collections
.get(
                        tuple
(getattr(obj, c.key) for c in pk),
                       
())
               
)

       
for obj in objs:
           
yield obj


 
--
Martijn

Mike Bayer

unread,
Jul 19, 2016, 12:42:39 PM7/19/16
to sqlal...@googlegroups.com


On 07/19/2016 12:33 PM, Martijn van Oosterhout wrote:
> On Sunday, July 17, 2016 at 8:47:11 AM UTC+2, Martijn van Oosterhout wrote:
>
>
> I'll play a bit and see what I can get to work. Thanks again.
>
>
> So, I have a chance to play and got something that actually works quite
> nicely, see below. Two things:
>
> - I switched to referencing the primary key of the original object
> directly, because some of our relationships are a bit more complex.
> - Chained relationships don't work.
>
> But in any case, even this improves performance greatly.

this is adequate for a new recipe if you are interested in adding it.
Also I think sqlalchemy-utils provides a feature along these lines but
I'm not sure how well it works or handles those harder cases like
chained relationships.



>
>
> |
> fromitertools importgroupby,islice
> fromsqlalchemy.orm importattributes,object_session
> fromsqlalchemy importtuple_
>
>
> defyielded_load(query,attrs,N=1000):

Martijn van Oosterhout

unread,
Jul 19, 2016, 5:20:22 PM7/19/16
to sqlal...@googlegroups.com
On 19 July 2016 at 18:42, Mike Bayer <mik...@zzzcomputing.com> wrote:


On 07/19/2016 12:33 PM, Martijn van Oosterhout wrote:
On Sunday, July 17, 2016 at 8:47:11 AM UTC+2, Martijn van Oosterhout wrote:

But in any case, even this improves performance greatly.

this is adequate for a new recipe if you are interested in adding it. Also I think sqlalchemy-utils provides a feature along these lines but I'm not sure how well it works or handles those harder cases like chained relationships.


Thanks. On the way home though I had a thought: wouldn't it be simpler to run the original query with yield_from(), and then after each block run the query with a filter on the primary keys returned, and add all the joinedload/subqueryload/etc options to this query, run it and rely on the identity map to fix it for the objects returned the first time. Or is that something we cannot rely on?

Have a nice day,
--

Mike Bayer

unread,
Jul 19, 2016, 5:22:56 PM7/19/16
to sqlal...@googlegroups.com


On 07/19/2016 05:20 PM, Martijn van Oosterhout wrote:
>
> On 19 July 2016 at 18:42, Mike Bayer <mik...@zzzcomputing.com
> <mailto:mik...@zzzcomputing.com>> wrote:
>
>
>
> On 07/19/2016 12:33 PM, Martijn van Oosterhout wrote:
>
> On Sunday, July 17, 2016 at 8:47:11 AM UTC+2, Martijn van
> Oosterhout wrote:
>
> But in any case, even this improves performance greatly.
>
>
> this is adequate for a new recipe if you are interested in adding
> it. Also I think sqlalchemy-utils provides a feature along these
> lines but I'm not sure how well it works or handles those harder
> cases like chained relationships.
>
>
> Thanks. On the way home though I had a thought: wouldn't it be simpler
> to run the original query with yield_from(), and then after each block
> run the query with a filter on the primary keys returned, and add all
> the joinedload/subqueryload/etc options to this query, run it and rely
> on the identity map to fix it for the objects returned the first time.
> Or is that something we cannot rely on?

it works for the loading you're doing, where the primary keys of what's
been fetched are fed into the subsequent query. But it doesnt work for
current subquery loading which does not make use of those identifiers,
nor for joined loading which does OUTER JOIN onto the original query at
once (doing "joinedload" as a separate query is essentially what
subquery loading already is).



>
> Have a nice day,
> --
> Martijn van Oosterhout <kle...@gmail.com <mailto:kle...@gmail.com>>
> http://svana.org/kleptog/

Martijn van Oosterhout

unread,
Jul 20, 2016, 3:03:05 AM7/20/16
to sqlal...@googlegroups.com
On 19 July 2016 at 23:22, Mike Bayer <mik...@zzzcomputing.com> wrote:


On 07/19/2016 05:20 PM, Martijn van Oosterhout wrote:


Thanks. On the way home though I had a thought: wouldn't it be simpler
to run the original query with yield_from(), and then after each block
run the query with a filter on the primary keys returned, and add all
the joinedload/subqueryload/etc options to this query, run it and rely
on the identity map to fix it for the objects returned the first time.
Or is that something we cannot rely on?

it works for the loading you're doing, where the primary keys of what's been fetched are fed into the subsequent query.  But it doesnt work for current subquery loading which does not make use of those identifiers, nor for joined loading which does OUTER JOIN onto the original query at once (doing "joinedload" as a separate query is essentially what subquery loading already is).


Ah, good point. Pity. I like the whole generative interface for the joinedload/subqueryload/etc and would have liked to reuse that machinery somehow. Given I'm trying to eager load several levels of relationships, it'd be nice to automate that somehow...
 
Have a nice day,
--

Martijn van Oosterhout

unread,
Jul 20, 2016, 11:04:59 AM7/20/16
to sqlalchemy
Ok, so this is what I have for today. It works, and handles all kinds of corner cases and yet it's not quite what I want. It does everything as a joinedload. It's much easier to use now though.

You can do things like:

q = Session.query(Animal)

for animal in yielded_load(q, (joinedload(Animal.owner).joinedload(Human.family),
                               joinedload
(Animal.species).joinedload(Species.phylum)):
    do_something
(animal)

It says joinedload() but it doesn't actually pay attention to that, it just uses it to determine the path. It would be really nice to be able to specify that some things should be fetched using subqueryload(), but that would require unpacking/manipulating the Load objects and I don't think there's a supported interface for that. Additionally, it would be nice if could notice that paths share a common prefix and only fetch those once. Still, for the amount of code it's pretty effective.

from itertools import groupby, islice
from sqlalchemy.orm import attributes, Load, aliased
from sqlalchemy import tuple_


def yielded_load(query, load_options, N=1000):

   
# Note: query must return only a single object (for now anyway)

    main_query
= query.yield_per(N)

    main_res
= iter(main_query)

   
while True:

       
# Fetch block of results from query

        objs
= list(islice(main_res, N))

       
if not objs:
           
break


       
for load_option in load_options:
           
# Get path of attributes to follow
            path
= load_option.path
            pk
= path[0].prop.parent.primary_key

           
# Generate query that joins against original table
            child_q
= main_query.session.query().order_by(*pk)

           
for i, attr in enumerate(path):
               
if i == 0:
                   
# For the first relationship we add the target and the pkey columns
                   
# Note: add_columns() doesn't work here? with_entities() does....
                    next_table
= attr.prop.target
                    child_q
= child_q.join(next_table, attr)
                    child_q
= child_q.with_entities(attr.prop.mapper).add_columns(*pk)

                   
if attr.prop.order_by:
                        child_q
= child_q.order_by(*attr.prop.order_by)

                    opts
= Load(attr.prop.mapper)
               
else:
                   
# For all relationships after the first we can use contains_eager
                   
# Note: The aliasing is to handle cases where the relationships loop
                    next_table
= aliased(attr.prop.target)
                    child_q
= child_q.join(next_table, attr, isouter=True)
                    opts
= opts.contains_eager(attr, alias=next_table)

            child_q
= child_q.options(opts)


            keys
= [[getattr(obj, col.key) for col in pk] for obj in objs]


            child_q
= child_q.filter(tuple_(*pk).in_(keys))

           
# Here we use the fact that the first column is the target object

            collections
= dict((k, [r[0] for r in v]) for k, v in groupby(
                child_q
,
               
lambda x: tuple([getattr(x, c.key) for c in pk])
           
))

           
for obj in objs:

               
# We can traverse many-to-one and one-to-many
               
if path[0].prop.uselist:
                    attributes
.set_committed_value(
                        obj
,
                        path
[0].key,

                        collections
.get(
                            tuple
(getattr(obj, c.key) for c in pk),
                           
())
                   
)

               
else:
                    attributes
.set_committed_value(
                        obj
,
                        path
[0].key,

                        collections
.get(
                            tuple
(getattr(obj, c.key) for c in pk),

                           
[None])[0]

                   
)

       
for obj in objs:
           
yield obj


--
Martijn
Reply all
Reply to author
Forward
0 new messages