LIMIT the number of children in relationship dynamically

3,048 views
Skip to first unread message

Isaac Martin

unread,
May 1, 2017, 8:05:42 PM5/1/17
to sqlalchemy

I am building an api which can return children of resources if the user requests it. For example, user has messages. I want the query to be able to limit the number of message objects that are returned.

I found a useful tip about limiting the number of objects in child collections here. Basically, it indicates the following flow:

class User(...):
    # ...
    messages = relationship('Messages', order_by='desc(Messages.date)', lazy='dynamic')

user = User.query.one()
users.messages.limit(10)

My use case involves returning sometimes large numbers of users.

If I were to follow the advice in that link and used .limit() then I would need to iterate over the entire collection of users calling .limit() on each one. This is much less efficient then, say, using LIMITin the original sql expression which created the collection.

My question is whether it is possible using declarative to efficiently(N+0) load a large collection of objects while limiting the number of children in their child collections using sqlalchemy?


To be clear, the below is what I am trying to avoid.

users = User.query.all()
messages = {}
for user in users:
    messages[user.id] = user.messages.limit(10).all()

I want to do something more like:

users = User.query.option(User.messages.limit(10)).all()

SO post here: http://stackoverflow.com/questions/43727268/limit-child-collections-in-initial-query-sqlalchemy/43727705?noredirect=1#comment74499641_43727705

mike bayer

unread,
May 1, 2017, 10:34:29 PM5/1/17
to sqlal...@googlegroups.com


On 05/01/2017 08:05 PM, Isaac Martin wrote:
>
> I am building an api which can return children of resources if the user
> requests it. For example, |user| has |messages|. I want the query to be
> able to limit the number of |message| objects that are returned.
>
> I found a useful tip about limiting the number of objects in child
> collections here
> <http://stackoverflow.com/questions/9148316/how-to-limit-offset-sqlalchemy-orm-relations-result?noredirect=1&lq=1>.
> Basically, it indicates the following flow:
>
> |classUser(...):# ...messages
> =relationship('Messages',order_by='desc(Messages.date)',lazy='dynamic')user
> =User.query.one()users.messages.limit(10)|
>
> My use case involves returning sometimes large numbers of users.
>
> If I were to follow the advice in that link and used |.limit()| then I
> would need to iterate over the entire collection of users calling
> |.limit()| on each one. This is much less efficient then, say, using
> |LIMIT|in the original sql expression which created the collection.
>
> My question is whether it is possible using declarative to
> efficiently(N+0) load a large collection of objects while limiting the
> number of children in their child collections using sqlalchemy?
>
>
> To be clear, the below is what I am trying to /avoid/.
>
> |users =User.query.all()messages ={}foruser
> inusers:messages[user.id]=user.messages.limit(10).all()|
>
> I want to do something more like:
>
> |users =User.query.option(User.messages.limit(10)).all()|

so you should ignore whether or not it uses "declarative", which has
nothing to do with querying, and in fact at first ignore Query too,
because first and foremost this is a SQL problem. You want one SQL
statement that does this. What query in SQL would load lots of rows
from the primary table, joined to the first ten rows of the secondary
table for each primary?

LIMIT is tricky because it's not actually part of the usual "relational
algebra" calculation. It's outside of that because it's an artificial
limit on rows. For example, my first thought on how to do this was wrong:

select * from users left outer join (select * from messages limit 10)
as anon_1 on users.id = anon_1.user_id

This is wrong because it only gets the first ten messages in the
aggregate, disregarding user. We want to get the first ten messages for
each user, which means we need to do this "select from messages limit
10" individually for each user. That is, we need to correlate somehow.
A correlated subquery though is not usually allowed as a FROM element,
and is only allowed as a SQL expression, it can only return a single
column and a single row; we can't normally JOIN to a correlated subquery
in plain vanilla SQL. We can however, correlate inside the ON clause
of the JOIN to make this possible in vanilla SQL.

But first, if we are on a modern Postgresql version, we *can* break that
usual rule of correlation and use a keyword called LATERAL, which allows
correlation in a FROM clause. LATERAL is only supported by modern
Postgresql versions, and it makes this easy:

select * from users left outer join lateral
(select * from message where message.user_id = users.id order
by messages.date desc limit 10) as anon1 on users.id = anon_1.user_id

we support the LATERAL keyword. The query above looks like this:


subq = s.query(Messages).\
filter(Messages.user_id == User.id).\
order_by(Messages.date.desc()).limit(10).subquery().lateral()

q = s.query(User).outerjoin(subq).\
options(contains_eager(User.messages, alias=subq))

Note that above, in order to SELECT both users and messages and produce
them into the User.messages collection, the "contains_eager()" option
must be used and for that the "dynamic" has to go away. This is not the
only option, you can for example build a second relationship for
User.messages that doesn't have the "dynamic" or you can just load from
query(User, Message) separately and organize the result tuples as needed.

if you aren't using Postgresql, or a version of Postgresql that doesn't
support LATERAL, the correlation has to be worked into the ON clause of
the join instead. The SQL looks like:

select * from users left outer join messages on
users.id = messages.user_id and messages.date > (select date
from messages where messages.user_id = users.id order by date desc limit
1 offset 10)

Here, in order to jam the LIMIT in there, we are actually stepping
through the first 10 rows with OFFSET and then doing LIMIT 1 to get the
date that represents the lower bound date we want for each user. Then
we have to join while comparing on that date, which can be expensive if
this column isn't indexed and also can be inaccurate if there are
duplicate dates.

This query looks like:

subq = s.query(Messages.date).\
filter(Messages.user_id == User.id).\
order_by(Messages.date.desc()).\
limit(1).offset(10).correlate(User).as_scalar()

q = s.query(User).join(
Messages,
and_(User.id == Messages.user_id, Messages.date >= subq)
).options(contains_eager(User.messages))


These kinds of queries are the kind that I don't trust without a good
test, so POC below includes both versions including a sanity check.


from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base
import datetime

Base = declarative_base()


class User(Base):
__tablename__ = 'user'
id = Column(Integer, primary_key=True)
messages = relationship(
'Messages', order_by='desc(Messages.date)')

class Messages(Base):
__tablename__ = 'message'
id = Column(Integer, primary_key=True)
user_id = Column(ForeignKey('user.id'))
date = Column(Date)

e = create_engine("postgresql://scott:tiger@localhost/test", echo=True)
Base.metadata.drop_all(e)
Base.metadata.create_all(e)

s = Session(e)

s.add_all([
User(id=i, messages=[
Messages(id=(i * 20) + j, date=datetime.date(2017, 3, j))
for j in range(1, 20)
]) for i in range(1, 51)
])

s.commit()

top_ten_dates = set(datetime.date(2017, 3, j) for j in range(10, 20))


def run_test(q):
all_u = q.all()
assert len(all_u) == 50
for u in all_u:

messages = u.messages
assert len(messages) == 10

for m in messages:
assert m.user_id == u.id

received = set(m.date for m in messages)

assert received == top_ten_dates

# version 1. no LATERAL

s.close()

subq = s.query(Messages.date).\
filter(Messages.user_id == User.id).\
order_by(Messages.date.desc()).\
limit(1).offset(10).correlate(User).as_scalar()

q = s.query(User).join(
Messages,
and_(User.id == Messages.user_id, Messages.date > subq)
).options(contains_eager(User.messages))

run_test(q)

# version 2. LATERAL

s.close()

subq = s.query(Messages).\
filter(Messages.user_id == User.id).\
order_by(Messages.date.desc()).limit(10).subquery().lateral()

q = s.query(User).outerjoin(subq).\
options(contains_eager(User.messages, alias=subq))

run_test(q)
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sqlalchemy+...@googlegroups.com
> <mailto:sqlalchemy+...@googlegroups.com>.
> To post to this group, send email to sqlal...@googlegroups.com
> <mailto:sqlal...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

Isaac Martin

unread,
May 2, 2017, 12:39:08 AM5/2/17
to sqlal...@googlegroups.com
Thank you very much for your response. This solution isn't working for me, and I'm 99% sure I'm not translating what you've written into something that works for my use case. My situation is somewhat more complicated than I initially wrote. My first question was in the vain hope that there was some easy way using declarative to tell anything loaded through a relationship() to load with a LIMIT. Obviously that isn't the case.

In my situation, I have a relationship which I have defined by hand. There are not foreign keys between the messages and the objects being loaded (Connections). When I try to use your lateral solution I get an error thrown that there are no foreign key relationships, which is true. 

Here is my model edited down to what I believe are the minimal relevant fields. I'm trying to return Connections with a limited number of objects in the Connection.messages field:

class Connection(Base):
    __tablename__ = 'Connection'

    user_1_id = Column(Integer, ForeignKey('User.uid'))
    user_2_id = Column(Integer, ForeignKey('User.uid'))
    messages =  relationship('Message',
                             secondary="join(MessageRecipient, Message, MessageRecipient.message_id == Message.uid)",
                             primaryjoin= 'or_(Connection.user_1_id == MessageRecipient.recipient_id,'
                                           'Connection.user_2_id == MessageRecipient.recipient_id)',
                             order_by="Message.created.desc()")

class MessageRecipient(Base):
    __tablename__ = 'MessageRecipient'

    recipient_id = Column(Integer, ForeignKey('User.uid'))

class Message(Base):
    __tablename__ = 'Message'

    created = Column(DateTime, default=func.current_timestamp())
    body = Column(String(2000))


Now, I'm using Postgres, so I did attempt to translate your lateral example: 

        subq = Message.query.\
            filter(and_(Message.uid == MessageRecipient.message_id, or_(MessageRecipient.uid == Connection.user_1_id,
                                            MessageRecipient.uid == Connection.user_2_id))).\
            order_by(Message.created.desc()).limit(1).subquery().lateral()

        q = Connection.query.outerjoin(subq). \
            options(contains_eager(Connection.messages, alias=subq))

The above throws an error:
Can't find any foreign key relationships between 'Connection' and '%(140122530861688 anon)s'.

I also tried your non-lateral example
        subq = self.db.session.query(Message.created). \
            filter(and_(Message.uid == MessageRecipient.message_id, or_(MessageRecipient.recipient_id == Connection.user_1_id,
                            MessageRecipient.recipient_id == Connection.user_2_id))).\
            order_by(Message.created.desc()). \
            limit(1).offset(10).correlate(Connection).as_scalar()

        q = self.db.session.query(Connection).join(
            MessageRecipient,
            or_(MessageRecipient.recipient_id == Connection.user_1_id,
                MessageRecipient.recipient_id == Connection.user_2_id)).\
            join( Message,
            and_(MessageRecipient.message_id == Message.uid, Message.created > subq)
        ).options(contains_eager(Connection.messages)).all()

The above does not throw an error, but it also does not return any entities. I'm going to keep working on the non-lateral example because I don't understand your example well enough to confidently say I'm being faithful to it. If you have other advice based on this new information then I'd really appreciate it.  If I figure out what I was doing wrong I'll post here for posterity.

To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscribe@googlegroups.com <mailto:sqlalchemy+unsubscribe@googlegroups.com>.
To post to this group, send email to sqlal...@googlegroups.com <mailto:sqlalchemy@googlegroups.com>.
--
SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- You received this message because you are subscribed to a topic in the Google Groups "sqlalchemy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sqlalchemy/lOedjqu61G4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sqlalchemy+unsubscribe@googlegroups.com.
To post to this group, send email to sqlal...@googlegroups.com.

Isaac Martin

unread,
May 2, 2017, 2:35:04 AM5/2/17
to sqlal...@googlegroups.com
Well that was a very silly mistake. End of the day oversight is what happened here. I didn't notice that "offset" was set to 10 and my unit test only had 2 values. When I set offset to 0 it worked great. 

Sincerest thanks for this! I hadn't seen as_scalar used before. I'm looking forward to trying subqueries like this myself in the future.

mike bayer

unread,
May 2, 2017, 9:08:48 AM5/2/17
to sqlal...@googlegroups.com


On 05/02/2017 12:38 AM, Isaac Martin wrote:
> Thank you very much for your response. This solution isn't working for
> me, and I'm 99% sure I'm not translating what you've written into
> something that works for my use case. My situation is somewhat more
> complicated than I initially wrote. My first question was in the vain
> hope that there was some easy way using declarative to tell anything
> loaded through a relationship() to load with a LIMIT. Obviously that
> isn't the case.

we can absolutely do that:


def message_join():
subq = select([Messages.date]).\
where(Messages.user_id == User.id).\
order_by(Messages.date.desc()).\
limit(1).offset(10).correlate(User).as_scalar()

return and_(User.id == Messages.user_id, Messages.date > subq)


class User(Base):
__tablename__ = 'user'
id = Column(Integer, primary_key=True)
messages = relationship(
'Messages', order_by='desc(Messages.date)',
primaryjoin=message_join)


class Messages(Base):
__tablename__ = 'message'
id = Column(Integer, primary_key=True)
user_id = Column(ForeignKey('user.id'))
date = Column(Date)



done, works great with joinedload(), lazy loading, etc:

q = s.query(User).options(joinedload(User.messages))



>
> In my situation, I have a relationship which I have defined by hand.
> There are not foreign keys between the messages and the objects being
> loaded (Connections). When I try to use your lateral solution I get an
> error thrown that there are no foreign key relationships, which is true.
>
> Here is my model edited down to what I believe are the minimal relevant
> fields. I'm trying to return Connections with a limited number of
> objects in the Connection.messages field:
>
> /class Connection(Base):/
> / __tablename__ = 'Connection'/
> /
> /
> / user_1_id = Column(Integer, ForeignKey('User.uid'))/
> / user_2_id = Column(Integer, ForeignKey('User.uid'))/
> / messages = relationship('Message',/
> / secondary="join(MessageRecipient, Message,
> MessageRecipient.message_id == Message.uid)",/
> / primaryjoin= 'or_(Connection.user_1_id ==
> MessageRecipient.recipient_id,'/
> / 'Connection.user_2_id ==
> MessageRecipient.recipient_id)',/
> / order_by="Message.created.desc()")/
> /
> /
> /class MessageRecipient(Base):/
> / __tablename__ = 'MessageRecipient'/
> /
> /
> / recipient_id = Column(Integer, ForeignKey('User.uid'))/
> /
> /
> /class Message(Base):/
> / __tablename__ = 'Message'/
> /
> /
> / created = Column(DateTime, default=func.current_timestamp())/
> / body = Column(String(2000))
> /
> /
> /
>
> Now, I'm using Postgres, so I did attempt to translate your lateral
> example:
>
> / subq = Message.query.\/
> / filter(and_(Message.uid == MessageRecipient.message_id,
> or_(MessageRecipient.uid == Connection.user_1_id,/
> / MessageRecipient.uid ==
> Connection.user_2_id))).\/
> / order_by(Message.created.desc()).limit(1).subquery().lateral()/
> /
> /
> / q = Connection.query.outerjoin(subq). \/
> / options(contains_eager(Connection.messages, alias=subq))/
> /
> /
> The above throws an error:
> Can't find any foreign key relationships between 'Connection' and
> '%(140122530861688 anon)s'.
>
> I also tried your non-lateral example
> / subq = self.db.session.query(Message.created). \/
> / filter(and_(Message.uid == MessageRecipient.message_id,
> or_(MessageRecipient.recipient_id == Connection.user_1_id,/
> / MessageRecipient.recipient_id ==
> Connection.user_2_id))).\/
> / order_by(Message.created.desc()). \/
> / limit(1).offset(10).correlate(Connection).as_scalar()/
> /
> /
> / q = self.db.session.query(Connection).join(/
> / MessageRecipient,/
> / or_(MessageRecipient.recipient_id == Connection.user_1_id,/
> / MessageRecipient.recipient_id == Connection.user_2_id)).\/
> / join( Message,/
> / and_(MessageRecipient.message_id == Message.uid,
> Message.created > subq)/
> / ).options(contains_eager(Connection.messages)).all()/
> /
> /
> <http://user.id>]=user.messages.limit(10).all()|
>
> I want to do something more like:
>
> |users =User.query.option(User.messages.limit(10)).all()|
>
>
> so you should ignore whether or not it uses "declarative", which has
> nothing to do with querying, and in fact at first ignore Query too,
> because first and foremost this is a SQL problem. You want one SQL
> statement that does this. What query in SQL would load lots of rows
> from the primary table, joined to the first ten rows of the
> secondary table for each primary?
>
> LIMIT is tricky because it's not actually part of the usual
> "relational algebra" calculation. It's outside of that because it's
> an artificial limit on rows. For example, my first thought on how
> to do this was wrong:
>
> select * from users left outer join (select * from messages
> limit 10) as anon_1 on users.id <http://users.id> = anon_1.user_id
>
> This is wrong because it only gets the first ten messages in the
> aggregate, disregarding user. We want to get the first ten messages
> for each user, which means we need to do this "select from messages
> limit 10" individually for each user. That is, we need to correlate
> somehow. A correlated subquery though is not usually allowed as a
> FROM element, and is only allowed as a SQL expression, it can only
> return a single column and a single row; we can't normally JOIN to a
> correlated subquery in plain vanilla SQL. We can however,
> correlate inside the ON clause of the JOIN to make this possible in
> vanilla SQL.
>
> But first, if we are on a modern Postgresql version, we *can* break
> that usual rule of correlation and use a keyword called LATERAL,
> which allows correlation in a FROM clause. LATERAL is only
> supported by modern Postgresql versions, and it makes this easy:
>
> select * from users left outer join lateral
> (select * from message where message.user_id = users.id
> <http://users.id> order by messages.date desc limit 10) as anon1 on
> users.id <http://users.id> = anon_1.user_id
>
> we support the LATERAL keyword. The query above looks like this:
>
>
> subq = s.query(Messages).\
> filter(Messages.user_id == User.id).\
> order_by(Messages.date.desc()).limit(10).subquery().lateral()
>
> q = s.query(User).outerjoin(subq).\
> options(contains_eager(User.messages, alias=subq))
>
> Note that above, in order to SELECT both users and messages and
> produce them into the User.messages collection, the
> "contains_eager()" option must be used and for that the "dynamic"
> has to go away. This is not the only option, you can for example
> build a second relationship for User.messages that doesn't have the
> "dynamic" or you can just load from query(User, Message) separately
> and organize the result tuples as needed.
>
> if you aren't using Postgresql, or a version of Postgresql that
> doesn't support LATERAL, the correlation has to be worked into the
> ON clause of the join instead. The SQL looks like:
>
> select * from users left outer join messages on
> users.id <http://users.id> = messages.user_id and messages.date >
> (select date from messages where messages.user_id = users.id
> <http://users.id> order by date desc limit 1 offset 10)
> user_id = Column(ForeignKey('user.id <http://user.id>'))
> date = Column(Date)
>
> e = create_engine("postgresql://scott:tiger@localhost/test", echo=True)
> Base.metadata.drop_all(e)
> Base.metadata.create_all(e)
>
> s = Session(e)
>
> s.add_all([
> User(id=i, messages=[
> Messages(id=(i * 20) + j, date=datetime.date(2017, 3, j))
> for j in range(1, 20)
> ]) for i in range(1, 51)
> ])
>
> s.commit()
>
> top_ten_dates = set(datetime.date(2017, 3, j) for j in range(10, 20))
>
>
> def run_test(q):
> all_u = q.all()
> assert len(all_u) == 50
> for u in all_u:
>
> messages = u.messages
> assert len(messages) == 10
>
> for m in messages:
> assert m.user_id == u.id <http://u.id>
> it, send an email to sqlalchemy+...@googlegroups.com
> <mailto:sqlalchemy%2Bunsu...@googlegroups.com>
> <mailto:sqlalchemy+...@googlegroups.com
> <mailto:sqlalchemy%2Bunsu...@googlegroups.com>>.
> To post to this group, send email to sqlal...@googlegroups.com
> <mailto:sqlal...@googlegroups.com>
> <mailto:sqlal...@googlegroups.com
> <mailto:sqlal...@googlegroups.com>>.
> <https://groups.google.com/group/sqlalchemy>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
> --
> SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve
> <http://stackoverflow.com/help/mcve> for a full description.
> --- You received this message because you are subscribed to a topic
> in the Google Groups "sqlalchemy" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/sqlalchemy/lOedjqu61G4/unsubscribe
> <https://groups.google.com/d/topic/sqlalchemy/lOedjqu61G4/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to
> sqlalchemy+...@googlegroups.com
> <mailto:sqlalchemy%2Bunsu...@googlegroups.com>.
> To post to this group, send email to sqlal...@googlegroups.com
> <mailto:sqlal...@googlegroups.com>.
> <https://groups.google.com/group/sqlalchemy>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sqlalchemy+...@googlegroups.com
> <mailto:sqlalchemy+...@googlegroups.com>.
> To post to this group, send email to sqlal...@googlegroups.com
> <mailto:sqlal...@googlegroups.com>.

Isaac Martin

unread,
May 2, 2017, 5:40:14 PM5/2/17
to sqlalchemy
Yep, I ended up doing exactly what you describe. Thanks again!
Reply all
Reply to author
Forward
0 new messages