attributes.set_committed_value() function which solves half the problem, but I still need to generate the actual query to return the necessary objects. So perhaps (pseudo-code):def eagerload_for_set(object_list, relationship)
ids = set(o.get(relationship.left_id) for o in object_list)
lookup = Query(relationship.right_table).filter_by(relationship.right_column.in_(ids)).all()
for o in object_list:
o.set_committed_value(o, relationship.left, lookup[relationship.left_id])
Here's the problem that cant be solved:
1. fetch rows 0-1000
2. start loading Foo objects:
Foo(1) -> eagerload Foo.related with 400 members
Foo(2) -> eagerload Foo.related with 500 members
Foo(3) -> eagerload Foo.related with the first 100 of 250 members
3. yield. That is, stop reading more rows. Send out Foo(1), Foo(2), Foo(3). The application now can go nuts with these. It can read them, write them, iterate through .related.
Right there, we're broken. Foo(3) is out in the world with less than half of its ".related" collection - they are still sitting on the cursor!
Vastly simpler is to use the recommended window functions to do pagination of any scale with no chance of complex failures.
On 15 Jul 2016 23:57, "Mike Bayer" <mik...@zzzcomputing.com> wrote:
>
>
>
> OK, but the IN loading you're doing, that's good too, I'd like to add a loader which uses that someday, and you're right, if we did, we could make it work with yield_per too due to the nature of it. If you want to use that approach take a look at https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/DisjointEagerLoading.
Wow, that's another really nice example. Those usage examples really need some more prominence in the google search results.
The windowed approach has the advantage of the queries being simpler, but requires scanning the table twice (although the first might be cheaper). Using the IN() approach is more general perhaps, but the queries would be much longer so more traffic to the server. It would be nice if it could all be made to Just Work.
I'll play a bit and see what I can get to work. Thanks again.
Martijn
I'll play a bit and see what I can get to work. Thanks again.
from itertools import groupby, islice
from sqlalchemy.orm import attributes, object_session
from sqlalchemy import tuple_
def yielded_load(query, attrs, N=1000):
# Note: query must return only a single object (for now anyway)
main_query = query.yield_per(N)
main_res = iter(main_query)
while True:
# Fetch block of results from query
objs = list(islice(main_res, N))
if not objs:
break
for attr in attrs:
target = attr.prop.mapper
pk = attr.prop.parent.primary_key
# Generate query that joins against original table
child_q = object_session(objs[0]).query(target, *pk).order_by(*pk)
if attr.prop.order_by:
child_q = child_q.order_by(*attr.prop.order_by)
keys = [[getattr(obj, col.key) for col in pk] for obj in objs]
child_q = child_q.join(attr).filter(tuple_(*pk).in_(keys))
collections = dict((k, [r[0] for r in v]) for k, v in groupby(
child_q,
lambda x:tuple([getattr(x, c.key) for c in pk])
))
for obj in objs:
attributes.set_committed_value(
obj,
attr.key,
collections.get(
tuple(getattr(obj, c.key) for c in pk),
())
)
for obj in objs:
yield obj
On 07/19/2016 12:33 PM, Martijn van Oosterhout wrote:
On Sunday, July 17, 2016 at 8:47:11 AM UTC+2, Martijn van Oosterhout wrote:
But in any case, even this improves performance greatly.
this is adequate for a new recipe if you are interested in adding it. Also I think sqlalchemy-utils provides a feature along these lines but I'm not sure how well it works or handles those harder cases like chained relationships.
On 07/19/2016 05:20 PM, Martijn van Oosterhout wrote:
Thanks. On the way home though I had a thought: wouldn't it be simpler
to run the original query with yield_from(), and then after each block
run the query with a filter on the primary keys returned, and add all
the joinedload/subqueryload/etc options to this query, run it and rely
on the identity map to fix it for the objects returned the first time.
Or is that something we cannot rely on?
it works for the loading you're doing, where the primary keys of what's been fetched are fed into the subsequent query. But it doesnt work for current subquery loading which does not make use of those identifiers, nor for joined loading which does OUTER JOIN onto the original query at once (doing "joinedload" as a separate query is essentially what subquery loading already is).
q = Session.query(Animal)
for animal in yielded_load(q, (joinedload(Animal.owner).joinedload(Human.family),
joinedload(Animal.species).joinedload(Species.phylum)):
do_something(animal)from itertools import groupby, islice
from sqlalchemy.orm import attributes, Load, aliased
from sqlalchemy import tuple_
def yielded_load(query, load_options, N=1000):
# Note: query must return only a single object (for now anyway)
main_query = query.yield_per(N)
main_res = iter(main_query)
while True:
# Fetch block of results from query
objs = list(islice(main_res, N))
if not objs:
break
for load_option in load_options:
# Get path of attributes to follow
path = load_option.path
pk = path[0].prop.parent.primary_key
# Generate query that joins against original table
child_q = main_query.session.query().order_by(*pk)
for i, attr in enumerate(path):
if i == 0:
# For the first relationship we add the target and the pkey columns
# Note: add_columns() doesn't work here? with_entities() does....
next_table = attr.prop.target
child_q = child_q.join(next_table, attr)
child_q = child_q.with_entities(attr.prop.mapper).add_columns(*pk)
if attr.prop.order_by:
child_q = child_q.order_by(*attr.prop.order_by)
opts = Load(attr.prop.mapper)
else:
# For all relationships after the first we can use contains_eager
# Note: The aliasing is to handle cases where the relationships loop
next_table = aliased(attr.prop.target)
child_q = child_q.join(next_table, attr, isouter=True)
opts = opts.contains_eager(attr, alias=next_table)
child_q = child_q.options(opts)
keys = [[getattr(obj, col.key) for col in pk] for obj in objs]
child_q = child_q.filter(tuple_(*pk).in_(keys))
# Here we use the fact that the first column is the target object
collections = dict((k, [r[0] for r in v]) for k, v in groupby(
child_q,
lambda x: tuple([getattr(x, c.key) for c in pk])
))
for obj in objs:
# We can traverse many-to-one and one-to-many
if path[0].prop.uselist:
attributes.set_committed_value(
obj,
path[0].key,
collections.get(
tuple(getattr(obj, c.key) for c in pk),
())
)
else:
attributes.set_committed_value(
obj,
path[0].key,
collections.get(
tuple(getattr(obj, c.key) for c in pk),
[None])[0]
)
for obj in objs:
yield obj