[Q][SA 0.7.9] yield_per() vs. joinedload() and subqueryload()

916 views
Skip to first unread message

Ladislav Lenart

unread,
Jun 4, 2013, 8:18:09 AM6/4/13
to sqlal...@googlegroups.com
Hello.

I have a hard time to understand the following comment for Query.yield_per():

Yield only ``count`` rows at a time.

WARNING: use this method with caution; if the same instance is present
in more than one batch of rows, end-user changes to attributes will be
overwritten.

In particular, it's usually impossible to use this setting with
eagerly loaded collections (i.e. any lazy='joined' or 'subquery')
since those collections will be cleared for a new load when
encountered in a subsequent result batch. In the case of 'subquery'
loading, the full result for all rows is fetched which generally
defeats the purpose of :meth:`~sqlalchemy.orm.query.Query.yield_per`.

Also note that many DBAPIs do not "stream" results, pre-buffering
all rows before making them available, including mysql-python and
psycopg2. :meth:`~sqlalchemy.orm.query.Query.yield_per` will also
set the ``stream_results`` execution
option to ``True``, which currently is only understood by psycopg2
and causes server side cursors to be used.

Suppose I have a code like this:

q = session.query(cls).filter(...)
q = q.options(
joinedload(cls.foo),
subqueryload(cls.bars),
)
for each in q.yield_per(50):
# access each.foo or each.bars

Does it work? Is so, why? If not, why?

I am using PostgreSQL 9.1 with psycopg2.


Thank you in advance,

Ladislav Lenart

Michael Bayer

unread,
Jun 4, 2013, 10:19:52 AM6/4/13
to sqlal...@googlegroups.com
assuming cls.foo is a many-to-one, it will produce the correct result, but will be far worse in terms of memory and performance, as the subqueryload() call will be invoked for each distinct batch of 50 rows, across the *full* result set. So if your result has 1000 rows, and the number of "bars" total is 10000, you will load 10000 additional rows for *each* yield per of 50.

Ladislav Lenart

unread,
Jun 4, 2013, 10:45:10 AM6/4/13
to sqlal...@googlegroups.com, Michael Bayer
Hello.

> assuming cls.foo is a many-to-one, it will produce the correct result,
> but will be far worse in terms of memory and performance, as the
> subqueryload() call will be invoked for each distinct batch of 50 rows,
> across the *full* result set. So if your result has 1000 rows, and the
> number of "bars" total is 10000, you will load 10000 additional rows for
> *each* yield per of 50.

Ok, I think I get it. Is there a way to make it all work without the performance
penalty of subqueryload? For example, what will happen if I replace it with
joinedload(cls.bars)?

Ladislav Lenart

Michael Bayer

unread,
Jun 4, 2013, 11:18:33 AM6/4/13
to sqlal...@googlegroups.com

On Jun 4, 2013, at 10:45 AM, Ladislav Lenart <lena...@volny.cz> wrote:

> Hello.
>
>> assuming cls.foo is a many-to-one, it will produce the correct result,
>> but will be far worse in terms of memory and performance, as the
>> subqueryload() call will be invoked for each distinct batch of 50 rows,
>> across the *full* result set. So if your result has 1000 rows, and the
>> number of "bars" total is 10000, you will load 10000 additional rows for
>> *each* yield per of 50.
>
> Ok, I think I get it. Is there a way to make it all work without the performance
> penalty of subqueryload? For example, what will happen if I replace it with
> joinedload(cls.bars)?

You will then get the wrong results. The docstring tries to explain this - a joinedload uses a JOIN. For each "cls" instance, there are many rows, one for each "bar". If you cut off the results in the middle of populating that collection, the collection is incomplete, you'll see the wrong collection on your cls.bars. On the next load, cls.bars will be wiped out and populated with the remaining "bar" objects.

Don't use yield_per. Use windowing instead, see http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery.




Ladislav Lenart

unread,
Jun 4, 2013, 11:41:06 AM6/4/13
to sqlal...@googlegroups.com, Michael Bayer
Hello.

> You will then get the wrong results. The docstring tries to explain this -
> a joinedload uses a JOIN. For each "cls" instance, there are many rows, one
> for each "bar". If you cut off the results in the middle of populating that
> collection, the collection is incomplete, you'll see the wrong collection on
> your cls.bars. On the next load, cls.bars will be wiped out and populated
> with the remaining "bar" objects.

Ok, I think I understand this too.

I've tried WindowedRangeQuery. It looked promising at first but it is (much)
slower than yield_per() with all its quirks, at least for my usecase. If I
understand the WindowedRangeQuery recipe, it does a full scan of the target
table first to read all the ids and calculate the bounds of all the windows. I
don't want to it like this. I am working with relatively large datasets but it
is still far less than all rows in the table. Something like 10-50000 rows from
a table with 1-2 million rows. The windowed query iterates over many completely
empty windows.

Can I modify the recipe so it preserves the filtering and creates windows only
for the interesting subset of the table?

Thank you,

Ladislav Lenart

Michael Bayer

unread,
Jun 4, 2013, 11:57:30 AM6/4/13
to sqlal...@googlegroups.com

On Jun 4, 2013, at 11:41 AM, Ladislav Lenart <lena...@volny.cz> wrote:

> Hello.
>
>> You will then get the wrong results. The docstring tries to explain this -
>> a joinedload uses a JOIN. For each "cls" instance, there are many rows, one
>> for each "bar". If you cut off the results in the middle of populating that
>> collection, the collection is incomplete, you'll see the wrong collection on
>> your cls.bars. On the next load, cls.bars will be wiped out and populated
>> with the remaining "bar" objects.
>
> Ok, I think I understand this too.
>
> I've tried WindowedRangeQuery. It looked promising at first but it is (much)
> slower than yield_per() with all its quirks, at least for my usecase.

OK, but with yield_per() you want to use eagerloading also, so yield_per() not fast enough either, I guess....


> If I
> understand the WindowedRangeQuery recipe, it does a full scan of the target
> table first to read all the ids and calculate the bounds of all the windows. I
> don't want to it like this. I am working with relatively large datasets but it
> is still far less than all rows in the table. Something like 10-50000 rows from
> a table with 1-2 million rows. The windowed query iterates over many completely
> empty windows.
>
> Can I modify the recipe so it preserves the filtering and creates windows only
> for the interesting subset of the table?

Absolutely, you should do whatever you have to in order to get the range you want, in fact the recipe even says this:

Enhance this yourself ! Add a "where" argument
so that windows of just a subset of rows can
be computed.

if your situation is even simpler than that, such as just querying from PKs 50-1000, you could just make up your own integer ranges within those two endpoints.


Ladislav Lenart

unread,
Jun 4, 2013, 12:28:22 PM6/4/13
to sqlal...@googlegroups.com, Michael Bayer
Hello.

> OK, but with yield_per() you want to use eagerloading also, so yield_per()
> not fast enough either, I guess....

No. I use yield_per() on complex queries with join(), filter() and both
joinedload() and subqueryload(). It is possible that they sometimes returns
wrong results because of yield_per(). I am not sure about that, but it is
definitely much faster than the original WindowedRangeQuery recipe. I can only
speculate that postgres caches subqueryload results...


> Absolutely, you should do whatever you have to in order to get the range
> you want, in fact the recipe even says this...

Ok. What I want to do is basically the following:
* Suppose a complex query q with join(...), filter(...) and options(...).
* I need to create q2 from q such that:
* It has no options.
* Can I reset the options with q = q.options(None)?
* It has select with the over by magic taken from WindowedRangeQuery recipe.
* I know I can use with_entities() for this.
* I will use q2 to obtain the window ranges.
* I will iterate over the window ranges and apply each to the original q and
execute it in a loop.

Can this strategy work?


Thank you,

Ladislav Lenart

Michael Bayer

unread,
Jun 4, 2013, 12:49:00 PM6/4/13
to sqlal...@googlegroups.com

On Jun 4, 2013, at 12:28 PM, Ladislav Lenart <lena...@volny.cz> wrote:

>
>> Absolutely, you should do whatever you have to in order to get the range
>> you want, in fact the recipe even says this...
>
> Ok. What I want to do is basically the following:
> * Suppose a complex query q with join(...), filter(...) and options(...).
> * I need to create q2 from q such that:
> * It has no options.
> * Can I reset the options with q = q.options(None)?
> * It has select with the over by magic taken from WindowedRangeQuery recipe.
> * I know I can use with_entities() for this.
> * I will use q2 to obtain the window ranges.
> * I will iterate over the window ranges and apply each to the original q and
> execute it in a loop.
>
> Can this strategy work?

there's not a publicly supported feature to "reset" the options right now so you'd probably need to apply them after you get your window ranging query. You can probably remove the effect of eager loaders if you were to just assign a blank _attributes dictionary to the new query though (just no long term guarantees).

other than that, I use the windowing concept extensively and it works very well.



Ladislav Lenart

unread,
Jun 4, 2013, 1:15:20 PM6/4/13
to sqlal...@googlegroups.com, Michael Bayer
Ok, I will try it.

Thank you very much for your invaluable insights,

Ladislav Lenart

Ladislav Lenart

unread,
Jun 5, 2013, 5:26:39 AM6/5/13
to sqlal...@googlegroups.com, Michael Bayer
Hello.

I have one more question about my approach to WindowedRangeQuery:

> * Suppose a complex query q with join(...), filter(...) and options(...).
> * I need to create q2 from q such that:
> * It has no options.
> * Can I reset the options with q = q.options(None)?
> * It has select with the over by magic taken from WindowedRangeQuery recipe.
> * I know I can use with_entities() for this.
> * I will use q2 to obtain the window ranges.
> * I will iterate over the window ranges and apply each to the original q and
> execute it in a loop.

Suppose I have a complex query (without options() / order_by()) to obtain the
objects I want. I use this query to calculate window intervals. I don't have to
use it again to fetch the objects, because I already have their id intervals. Am
I right?


Thank you,

Ladislav Lenart

Ladislav Lenart

unread,
Jun 5, 2013, 8:03:22 AM6/5/13
to sqlal...@googlegroups.com, Michael Bayer
Hello.

I've tried an experiment to verify that yield_per() with subqueryload() behaves
as badly as you described, but according to my practical observation, it issues
ONE subqueryload() and everything behaves as I would expect. It emits two
SELECTs, one to fetch the objects and the second one to fetch the related data
and that's about it, see below. So back to square one, what is wrong with it?


Base = declarative_base()

class Contact(Base):
# Just a sketch, I hope you get the picture.
id,
name,
phones = relationship(Phone) # many

class Phone(Base):
# Just a sketch, I hope you get the picture.
id,
number,
contact = relationship(Contact) # one

# Setup engine with echo set to True.

phones = ['123456789', '987654321', '555777999']
for i in range(1, 11):
c = Contact(name=u' '.join([u'Contact', unicode(i)]))
session.add(c)
session.add_all(Phone(contact=c, number=e) for e in phones)
session.flush()
session.expunge_all()

q = session.query(Contact).options(subqueryload(Contact.phones))
for each in q.yield_per(2):
print each.last_name
for e in each.phones:
print e

The output is like this:
SA info about all the inserts after session.flush().
SA info about select for contacts.
SA info about select for their phones.
Contact 10
123456789
987654321
555777999
Contact 9
123456789
987654321
555777999
Contact 8
123456789
987654321
555777999
Contact 7
123456789
987654321
555777999
Contact 6
123456789
987654321
555777999
Contact 5
123456789
987654321
555777999
Contact 4
123456789
987654321
555777999
Contact 3
123456789
987654321
555777999
Contact 2
123456789
987654321
555777999
Contact 1
123456789
987654321
555777999


Thank you,

Ladislav Lenart

Ladislav Lenart

unread,
Jun 5, 2013, 10:20:40 AM6/5/13
to sqlal...@googlegroups.com, Michael Bayer
Hello.

One more note. I've just tried the below experiment with joinedload() instead of
subqueryload() and that does NOT work just as you expected. One contact is
returned several times and the first occurrences have incomplete phones.

However my experiments with subqueryload() suggest that it works just fine with
yield_per(). Can you elaborate on that one please?


Thank you,

Ladislav Lenart

Michael Bayer

unread,
Jun 5, 2013, 10:27:27 AM6/5/13
to sqlal...@googlegroups.com


the Query emitted by subqueryload does not use yield_per. so if your total result set is 1000 rows, and the total rows represented by all the collections is 10000, the first time that query is emitted, 10000 rows will be fully loaded and processed into memory at once. This would occur typically somewhere in the first few rows of your 50 -row yield_per batch. So the intent of yield_per, which is to conserve memory and upfront loading overhead, would be defeated entirely by this.

The subqueryload is emitting once if I recall correctly because I probably at some point have the query result being memoized in the query context to prevent it from being emitted many times in a yield_per scenario.
> --
> You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
> To post to this group, send email to sqlal...@googlegroups.com.
> Visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Ladislav Lenart

unread,
Jun 5, 2013, 10:37:38 AM6/5/13
to sqlal...@googlegroups.com, Michael Bayer
Ok, so yield_per() is useless when subqueryload() is used.

Thank you,

Ladislav Lenart
Reply all
Reply to author
Forward
0 new messages