Best practices for sub-dividing a large collection relationship

19 views
Skip to first unread message

Dane K Barney

unread,
Dec 30, 2021, 8:10:50 PM12/30/21
to sqlalchemy
Given the following example model:

class Node(Base):
    __tablename__ = "node"
    id = Column(Integer, primary_key=True, autoincrement=True)

class Edge(Base):
    __tablename__ = "edge"
    id = Column(Integer, primary_key=True, autoincrement=True)
    color = Column(Enum(Color))
    src_node_id = Column(Integer, ForeignKey("node.id"))
    dst_node_id = Column(Integer, ForeignKey("node.id"))
    src_node = relationship(
        "Node",
        foreign_keys=[src_node_id],
        backref=backref("downstream_edges")
    )
    dst_node = relationship(
        "Node",
        foreign_keys=[dst_node_id],
        backref=backref("upstream_edges")
    )

The collections Node.downstream_edges and Node.upstream_edges are so large that it is detrimental to load them entirely on a regular basis (although I still need the ability to do so). So for performance reasons, I am trying to come up with a way of sub-dividing these relationships based on the "color" attribute. One way I have tried to do this is by defining additional relationships on the Node class which filter the query using a custom primaryjoin. For example:

blue_downstream_edges = relationship(
    "Edge",
    primaryjoin="and_(Node.id == Edge.src_node_id, Edge.color == 'blue'",
    viewonly=True,
)

The downside to this approach is that the viewonly=True attribute means I cannot directly make modifications to this collection. When I want to add new Edges, I have to do it on the complete downstream_edges relationship, which will trigger the full load that I'm trying to avoid.

Another idea I had was to use the lazy="dynamic" attribute on the downstream_edges and upstream_edges relationships, so that I could do something like:

blue_downstream_edges = node.downstream_edges.filter(Edge.color == 'blue')

As I understand it, I would then be able to append items to this subset collection. However, those changes would not be seen in the complete downstream_edges collection.

In short, what I'm trying to achieve is a way to provide subsets of a large collection, which are modifiable, and those modifications are visible in both the larger collection and the subset collection.

I'm guessing this would need to involve some sort of shared access to the same in-memory ORM mapper class, because a modification made to one representation of the collection that has not yet been flushed/persisted to the database would not be visible to another representation of that collection, assuming it requires a database query to load.

Is this possible at all?

Mike Bayer

unread,
Dec 30, 2021, 8:33:38 PM12/30/21
to noreply-spamdigest via sqlalchemy
I would try to add new Edge objects without appending them to any list, just create a new Edge(src_node=n1, dest_node=n2).  



Another idea I had was to use the lazy="dynamic" attribute on the downstream_edges and upstream_edges relationships, so that I could do something like:

blue_downstream_edges = node.downstream_edges.filter(Edge.color == 'blue')

yes, you should put "dynamic" on these relationships.


As I understand it, I would then be able to append items to this subset collection. However, those changes would not be seen in the complete downstream_edges collection.

they would ?  they need to be flushed first, but that's how it works, if you want to see the "complete collection" then you are loading that complete collection.

OTOH if you want to just have the "complete collection" all in memory at once, OK, then ...you wouldn't use dynamic?   is it kind of like two different modes of operation you want to use, one where collections are fully in memory and another where they arent ?  



In short, what I'm trying to achieve is a way to provide subsets of a large collection, which are modifiable, and those modifications are visible in both the larger collection and the subset collection.

I'm guessing this would need to involve some sort of shared access to the same in-memory ORM mapper class, because a modification made to one representation of the collection that has not yet been flushed/persisted to the database would not be visible to another representation of that collection, assuming it requires a database query to load.

Is this possible at all?


--
SQLAlchemy -
The Python SQL Toolkit and Object Relational Mapper
 
 
To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
---
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.

Dane K Barney

unread,
Dec 30, 2021, 9:04:22 PM12/30/21
to sqlalchemy
Hey Mike, thanks for your speedy reply.

I guess I left out some key details and simplified my examples a little too much, which is probably why you're confused.

For one, I have actually implemented downstream_edges and upstream_edges as dictionary collections, i.e.:

src_node = relationship(
    "Node",
    foreign_keys=[src_node_id],
    backref=backref(
        "downstream_edges",
        collection_class=attribute_mapped_collection("downstream_key"),
    )
)

@property
def downstream_key(self):
    return self.dst_node, self.color

The purpose of this is to provide an easy way of doing "upserts" -- that is, checking if an edge already exists before inserting, since the edge table has a UNIQUE KEY (`src_node_id`,`dst_node_id`,`color`) that would be violated otherwise.

This allows me to do:

downstream_key = (dst_node, color)
try:
edge = node.downstream_edges[downstream_key]
except KeyError:
edge = Edge(dst_node=dst_node, color=color)
node.downstream_edges[downstream_key] = edge

I realize that this means self.downstream_edges is being fully loaded every time I do this "upsert". But this is why I can't follow your suggestion of simply adding Edges with Edge(src_node=src_node, dst_node=dst_node, color=color), because there's no check to prevent violating the UNIQUE KEY. So the question then is what is the most performant way of achieving this "upsert" behavior without triggering a full load of the downstream_edges.

There is a second problem I'm running into when trying to make use of dynamic relationships, or any relationship definition which requires flush + query to load. When I add a new edge to a collection, if I want to have event listeners respond to those modifications, for example by using @event.listens_for(Session, "before_flush") then I don't seem to be allowed to reference any relationship inside that event listener which requires database querying to fetch. I'm assuming this is because the "before_flush" event listener cannot itself load a relationship that would require a flush to load, because we are already in mid-flush.
Given that, if I've added a new edge to downstream_edges, which is lazy="dynamic" let's say, how can I see all the items in that collection, or in a subset of that collection (which could be a lazy="dynamic" relationship or a primaryjoin + viewonly=True relationship) without requiring an additional flush at an a point in time when one is not allowed?

Note that I have also tried to accomplish this using "after_flush" and "after_flush_postexec" event listeners, but the problem with these is I don't seem to be allowed to make further ORM modifications inside these listeners which would require more UPDATE statements, as I get an SAWarning telling me what I'm doing is not recommended.
Thanks,
-Dane

Mike Bayer

unread,
Dec 31, 2021, 1:21:16 PM12/31/21
to noreply-spamdigest via sqlalchemy
if you need an in-memory dictionary of all the edges at once, then where is the part that you dont want the whole thing loaded at once?  or is the idea that this dictionary is checking for uniqueness within the scope of what's being added in the current session (I'd use a separate dictionary for that, if that's the case)?

your message has broken formatting (not line wrapping) so it's hard to read.   I dont have a clear picture of what's needed - you can always maintain separate relationships where one is "dynamic" and the other not,  or you could maintain a separate dictionary of Edge objects entirely.   Such as, when you make a new Edge object, just put it in a global weak-referencing dictionary  with a key (srcnode, destnode), and that's how you can check for duplicate Edge objects.
Reply all
Reply to author
Forward
0 new messages