Possible regression?

22 views
Skip to first unread message

lyla...@gmail.com

unread,
Apr 28, 2019, 8:56:40 AM4/28/19
to sqlalchemy
Hi!

I recently came across some confusing behavior in relations and cascading using sqllite, and I was hoping that I might get some help explaining what the behavior is here. I put together a minimum failing script here. I'm trying to commit one instance of each of two classes, but what ends up happening is that I commit two copies of the many part of a one-to-many relation. I suspect that this has something to do with cascading, but I found a bug report for similar behavior that claims to have been fixed several years ago, and I'm wondering if there was some kind of regression? I'm running SQLAlchemy 1.3.1 on Ubuntu and I'm still using sqllite at this stage of development.

from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, Text, ForeignKey
from sqlalchemy.orm import relationship

import os

db_engine
= create_engine('sqlite:///sample.db', convert_unicode=True)
db_session
= scoped_session(sessionmaker(autocommit=False,
                                         autoflush
=False,
                                         bind
=db_engine))

Base = declarative_base()
Base.query = db_session.query_property()

class Head(Base):
    __tablename__
= 'head'
    id
= Column(String, primary_key=True)
    tassel_threads
= relationship("TasselThread", back_populates="head")
   
def __init__(self, id):
       
self.id=id

class TasselThread(Base):
    __tablename__
= 'tassel_thread'
    id
= Column(Integer, primary_key=True)
    head_id
= Column(Integer, ForeignKey('head.id'), nullable=False)
    head
= relationship("Head", back_populates="tassel_threads")
   
def __init__(self, head):
       
self.head = head

def init_db():
   
Base.metadata.create_all(bind=db_engine)


def do_db_work():

    my_head
= Head(id="foobar")
    my_head
= db_session.merge(my_head)
    db_session
.commit()

    my_tassel_thread
= TasselThread(head=my_head)
    db_session
.merge(my_tassel_thread)
    db_session
.commit()


if os.path.exists("sample_data.db"):
    os
.remove("sample_data.db")
init_db
()
do_db_work
()
a
= db_session.query(TasselThread).all()
print(len(a))
# output: 2, should be 1

Thanks for any help you might be able to provide!

-Lyla Fischer

peter.m...@gmail.com

unread,
Apr 28, 2019, 10:26:57 AM4/28/19
to sqlalchemy
Hi Lyla,

I notice Head.id is defined as String:

>     id = Column(String, primary_key=True)

While TasselThread.head_id is defined as Integer:

>     head_id = Column(Integer, ForeignKey('head.id'), nullable=False)

Could it be that causes the merge to not recognize the existing instance and a new one is created?

James Fennell

unread,
Apr 28, 2019, 10:51:42 AM4/28/19
to sqlalchemy
It seems to be related to the cascades happening recursively. The merge cascade goes from the tassel thread to the head, and then again down from the head to the tassel thread - which is kind of strange, I would expect the cascades to only visit each node in the object graph at most once. The second cascade is when the second tassel thread is created.

There are a couple of ways that I can get this to work:
  • Instead of setting head when creating the tassel thread, set head_id. This avoids the first cascade being triggered.
  • Add cascade=None on the relationship in the Head to avoid the second cascade being triggered. However the merge operation on the tassel thread now returns a warning that the "add operation along 'Head.tassel_threads' will not proceed"
  • The your second merge, instead of merging in the tassel_thread, merge in the head again.
Btw, I'm using SQL Alchemy 1.2.16 and the thing you're seeing is still repro-able.

Mike Bayer

unread,
Apr 28, 2019, 12:26:59 PM4/28/19
to sqlal...@googlegroups.com
Hi there,

I appreciate everyone chiming in to look at this!

However, it's something simple and has to do with one of the API
quirks that we have to decide if we want to keep long term.
Background is at
https://docs.sqlalchemy.org/en/13/orm/cascades.html#backref-cascade.

Basically, when you do this:

my_head = Head(id="foobar")
my_head = db_session.merge(my_head)
db_session.commit()

my_tassel_thread = TasselThread(head=my_head)

"my_tassel_thread" is now in the Session due to the backref cascade,
which I think I really might consider defaulting to False at some
point. it also has no primary key value yet. So when you merge()
it, it gets put in a second time, again with no primary key.

Another thing that makes this really bad on my part, is that if you
flush the session before the merge, then my_tassel_thread gets a new
primary key, then the merge is of itself and it works. This is bad
because it suggests merge() should be calling flush() automatically,
but im not sure that's a good idea in the bigger scheme of things.

Short answer, set up the relationships like:

tassel_threads = relationship("TasselThread",
back_populates="head", cascade_backrefs=False)

# ...

head = relationship("Head", back_populates="tassel_threads",
cascade_backrefs=False)


and then my_tassel_thread stays out of the Session.

Also:

> db_engine = create_engine('sqlite:///sample.db', convert_unicode=True)

don't use convert_unicode, it's deprecated, has no purpose in the
modern Python ecosystem, and is going away. SQLite in particular
is a fully Python unicode backend that's impossible to get a plain
string out of.
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
> ---
> You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
> To post to this group, send email to sqlal...@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

Mike Bayer

unread,
Apr 28, 2019, 12:46:06 PM4/28/19
to sqlal...@googlegroups.com
the next 1.3 release will include a warning for this case, whether or
not the primary key on the transient object is set up or not; it does
a double insert in any case otherwise. It's just when the primary
key is already present, the double insert fails on the primary key
constraint.

https://github.com/sqlalchemy/sqlalchemy/issues/4647

lyla...@gmail.com

unread,
Apr 28, 2019, 9:17:14 PM4/28/19
to sqlalchemy
Thanks for the quick responses, and for adding the warning for others!

I tried adding the lines as suggested (
   
head = relationship("Head", backref="tassel_threads", cascade_backrefs=False)

, but I keep getting the same error. I also tried cascade=None, as James suggested, but that seemed to mess up the ability of the relationships to get at each other's primary keys or establish relationships with each other. I ended up solving the issue by removing some of the merges. That means that I'm depending on the cascading to get everything into the database and linked up appropriately, andt I might need to spend some time studying exactly how cascading works in order to be able to reliably use the relationship feature in the future.

-Lyla
> > To unsubscribe from this group and stop receiving emails from it, send an email to sqlal...@googlegroups.com.

Mike Bayer

unread,
Apr 28, 2019, 9:57:38 PM4/28/19
to sqlal...@googlegroups.com
On Sun, Apr 28, 2019 at 9:17 PM <lyla...@gmail.com> wrote:
>
> Thanks for the quick responses, and for adding the warning for others!
>
> I tried adding the lines as suggested (
>
> head = relationship("Head", backref="tassel_threads", cascade_backrefs=False)
>
> , but I keep getting the same error.

with the specific test case you gave, adding cascade_backrefs to
*both* relationships as above resolves the assertion at the end. Is
this assertion "the error" you're referring towards ? I've tested in
1.3 as well as 1.2. Is there a new test case you can provide ?
> To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.

Lyla Fischer

unread,
Apr 28, 2019, 10:20:12 PM4/28/19
to sqlal...@googlegroups.com
I just tried it again, and it works. Sorry for the confusion. Knowing myself, I misspelled something when I tried it the first time.

-Lyla

You received this message because you are subscribed to a topic in the Google Groups "sqlalchemy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sqlalchemy/oVVdbCzsNQg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sqlalchemy+...@googlegroups.com.

James Fennell

unread,
Apr 28, 2019, 10:54:07 PM4/28/19
to sqlal...@googlegroups.com
Thanks for the explanation Mike! Seeing it now, I actually think there’s a decent reason to want the current backerefs:

My understanding is that with session.merge in SQL Alchemy it’s possible to draw a very clean line between entities that are persisted (or about to be persisted on the next flush) and entities which will never be persisted. This is owing to the design choice whereby SQL Alchemy doesn’t persist the entity you pass into the merge; instead, that is kept alone and a new entity is created.

With this in mind, there are two ways to see Lyla’s example.

One way: as soon as the tassel_thread was related to the persisted my_head (persisted because of the line my_head=session.merge(my_head)) then tassel_thread should be seen as in the session already. In this view, the merge is necessary and possibly error-prone, as here.

Another way: instead of assigning my_head=session.merge(my_head), keep the unpersisted head around with say persisted_head = session.merge(my_head). Then relating the new tassel_thread to my_head won’t add it to the session. To get a record into the DB, then do a session.merge on it - everything works correctly this way.


In both cases, there is the idea of a persisted object graph and a distinct unpersisted object graph. Once you relate a new entity to something in the persisted object graph, it becomes persistent.

Lyla Fischer

unread,
Apr 29, 2019, 7:22:35 AM4/29/19
to sqlal...@googlegroups.com
If I can comment on what seems instinctive to someone who just recently read the documentation, and made assumptions based off of it: My understanding was that the relationship abstractions were supposed to be about making foreign keys easier to keep track of, and being able to go both ways on a relationship in a object-oriented way. The fact that there were any implications at all for persistence was surprising to me, and it took me a significant amount of time to understand that SQL Alchemy was trying to do some magic behind the scenes when it came to persistence. I support the idea of making cascade_backrefs=False by default, because it keeps the separate objects separate, it makes me less nervous about eventual performance (which is a constant concern when there is magic I didn't intend happening in a library), and it's basically just working the way that I expected things to work initially. I'm fine with dealing with errors that might come about from the implications of updating of objects independently. It seems like part of the responsibilities of dealing with persistence, ever.

</my two cents>

-Lyla

Mike Bayer

unread,
Apr 29, 2019, 11:14:20 AM4/29/19
to sqlal...@googlegroups.com
On Mon, Apr 29, 2019 at 7:22 AM Lyla Fischer <lyla...@gmail.com> wrote:
>
> If I can comment on what seems instinctive to someone who just recently read the documentation, and made assumptions based off of it: My understanding was that the relationship abstractions were supposed to be about making foreign keys easier to keep track of, and being able to go both ways on a relationship in a object-oriented way. The fact that there were any implications at all for persistence was surprising to me, and it took me a significant amount of time to understand that SQL Alchemy was trying to do some magic behind the scenes when it came to persistence. I support the idea of making cascade_backrefs=False by default, because it keeps the separate objects separate, it makes me less nervous about eventual performance (which is a constant concern when there is magic I didn't intend happening in a library), and it's basically just working the way that I expected things to work initially. I'm fine with dealing with errors that might come about from the implications of updating of objects independently. It seems like part of the responsibilities of dealing with persistence, ever.

I'm trying to gather the rationale that you are referring towards in
your comment, however I'm not able to work with the reason "magic I
didn't intend", because that reason doesn't actually say anything.
Of course every issue is about telling a library what your intent is,
and the library fulfullling that intent or not. But to refer to the
intents that you didn't expect as "magic" and the intents that you
*did* expect as "not magic", I guess the latter is implicit, only
illustrates how much the library is doing completely correctly such
that you aren't noticing it.

The thing that relationship() does is coordinating Python objects that
are linked together in terms of a foreign key relationship in a
database. Then there is the backref concept, which expands on the
relationship concept, to more fully emulate the behavior of relational
database foreign keys with Python objects. When working completely
normally, SQLAlchemy produces Python object behavior as follows:

object_a = ObjectA()
object_b = ObjectB()
object_a.some_collection.append(object_b)
assert object_b.parent is object_a

Someone working only with Python objects would probably call most of
the above interaction as "magic", that ObjectA has a collection
automatically instantiated on that (this incidentally is also going to
be scaled back in SQLAlchemy 2.0) and that when I append an object to
this collection, the object now has a ".parent" that refers back to
that original object. The reason it has to do that in SQLAlchemy is
because we are emulating a relational database foreign key
relationship, where containment on one side implies association in the
other direction.

Where SQLAlchemy works hard to acknowledge that the above behaviors
are not normal for Python (that is, are reasonably perceived as
"magic"), is that the user explicitly configures attributes named
"some_collection" and "parent" on their classes, an they even have to
point them together at each other, and that the foreign key and
primary key columns that are involved in the persistence for the above
are also explicitly configured. This is to reduce the amount of
assumptions and implicit decisions SQLAlchemy has to make thereby
reducing the possibility of surprise.

Now to the issue of casade_backrefs. This flag is in a tough spot,
because either way, it leads to a situation that can be non-intutive
(that is, surprising). Let's assume above we've turned it off, and
we do this:

object_a = ObjectA()
object_b = ObjectB()
session.add(object_b)
object_a.some_collection.append(object_b)
assert object_b.parent is object_a

What happens above when we persist object_b by calling
session.commit()? In fact the operation will fail, either silently
if the foreign key column is nullable, or explicitly if the column is
not nullable, because object_a will *not* be persisted and the foreign
key constraint on object_b cannot be satisfied.

The reason we get into this issue in the first place is that
SQLAlchemy also has an explicit concept of a Session, e.g. a database
transaction, where objects must be explicitly associated with that
transaction, and that this Session uses a unit of work pattern, which
is exactly the thing here that reduces the exposure to "persistence
implications" that you refer towards. If we were working like other
ORMs, we'd just say "object_b.save()", where it likely would raise a
constraint error because we didn't call "object_a.save()" first, which
in SQLAlchemy's view is a significant persistence detail that is
needlessly exposed. SQLAlchemy's very first version in fact
didn't have a Session, and all of object_a/ object_b would be
automatically associated with an implicit database connection, very
much the way active record .save() style ORMs work. That was also
too magical.

So the "cascade_backrefs" flag is kind of a "seam" in an otherwise
smooth surface. I would not characterize it as any more "magical"
than anything else, and instead the discussion should be about which
use case is less surprising.
> You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.

Lyla Fischer

unread,
Apr 29, 2019, 12:04:11 PM4/29/19
to sqlal...@googlegroups.com
Sorry about the language. I'll try to stick to "intuitive" and "surprising" in the future.

I guess that I am familiar enough with the concept that something needs to persist in the database before it can be referenced in the database that I was surprised when an object was already in the database even when I didn't tell it to persist.

This is especially true in the case of a one-to-many relationship, where the head of a tassel can exist without any tassel threads. It just happened to be the case that I added a tassel thread and... now it is in the session?? Quoting from a previous email, for easy reference:
"""
    my_head = Head(id="foobar")
    my_head = db_session.merge(my_head)
    db_session.commit()

    my_tassel_thread = TasselThread(head=my_head)

"my_tassel_thread" is now in the Session due to the backref cascade,
which I think I really might consider defaulting to False at some
point.
"""

What would happen if I were editing an existing TasselThread, instead of adding a new one? Would those changes be persisted as well, even though I didn't tell them to explicitly?

I get that you are doing stuff in the backend that involves database persistence in order to make the collection references work, but if it were implemented using only foreign keys, then the head would have no way of accessing it's tassel threads short of doing a table scan for it's own primary key in the foreign key spot. I get that's massively inefficient, and therefore not the way that you are doing things, but I was thinking of SQL Alchemy relationship as being something that very well could have been implemented without creating new things to persist in the database beyond what is in the bare minimum of common explanations of SQL-level one-to-many relationships, often used in intro classes. Then the relationship just means that I don't have to do a table scan - the object just knows which TasselThreads are its own. That doesn't fit the common university model of the way foreign key only one-to-many relationships work, but I didn't think about it too hard. I was just thinking that "yes, of course you need to persist the head before persisting a tassel thread, because the tassel thread needs the id for its foreign key", while the reverse didn't occur to me.

-Lyla

Mike Bayer

unread,
Apr 29, 2019, 1:04:18 PM4/29/19
to sqlal...@googlegroups.com
On Mon, Apr 29, 2019 at 12:04 PM Lyla Fischer <lyla...@gmail.com> wrote:
>
> Sorry about the language. I'll try to stick to "intuitive" and "surprising" in the future.
>
> I guess that I am familiar enough with the concept that something needs to persist in the database before it can be referenced in the database that I was surprised when an object was already in the database even when I didn't tell it to persist.
>
> This is especially true in the case of a one-to-many relationship, where the head of a tassel can exist without any tassel threads. It just happened to be the case that I added a tassel thread and... now it is in the session?? Quoting from a previous email, for easy reference:
> """
> my_head = Head(id="foobar")
> my_head = db_session.merge(my_head)
> db_session.commit()
>
> my_tassel_thread = TasselThread(head=my_head)
>
> "my_tassel_thread" is now in the Session due to the backref cascade,
> which I think I really might consider defaulting to False at some
> point.
> """
>
> What would happen if I were editing an existing TasselThread, instead of adding a new one? Would those changes be persisted as well, even though I didn't tell them to explicitly?

so the idea of "an existing TasselThread" in terms of the ORM means
you would have accessed it from the database, which means it would
have a primary key value and already be associated with that Session.
So you would expect that changes on this object are persisted. If
you expunged the object so that it became detached, then yes again the
cascade_backrefs thing gets in the way (or not, depending on how you
look at it) if you re-associate it with a persistent object.

>
> I get that you are doing stuff in the backend that involves database persistence in order to make the collection references work, but if it were implemented using only foreign keys, then the head would have no way of accessing it's tassel threads short of doing a table scan for it's own primary key in the foreign key spot. I get that's massively inefficient, and therefore not the way that you are doing things,

if you don't use the backref feature at all, then it actually does
work that way. you can wait for flushes to happen, then expire
attributes on things once they are persistent. this way of working
will expose the persistence mechanics a lot more explicitly and
inconveniently though.

> but I was thinking of SQL Alchemy relationship as being something that very well could have been implemented without creating new things to persist in the database beyond what is in the bare minimum of common explanations of SQL-level one-to-many relationships, often used in intro classes.

what are the "new things to persist" you are referring towards ?
Are you referring to the fact that in order to set
TasselThread.some_foreign_key to a value, you need a Tassel object?
That is also optional. You can set foreign key values directly as
well, and there is no issue with that, except that it does not
automatically refresh the related relationships; you'd need to expire
them directly or wait until everything is expired on commit. The FAQ
entry https://docs.sqlalchemy.org/en/13/faq/sessions.html#i-set-the-foo-id-attribute-on-my-instance-to-7-but-the-foo-attribute-is-still-none-shouldn-t-it-have-loaded-foo-with-id-7
discusses this in detail.

There is also a related recipe
https://github.com/sqlalchemy/sqlalchemy/wiki/ExpireRelationshipOnFKChange
(the FAQ has an apparently broken link to it) which expands upon the
"set foreign key attributes" way of working so that the related
objects *are* automatically refreshed immediately. There is at
least one major production-level application that uses this recipe,
which is why it's so complicated because I helped them find every
possible edge case, however I don't recommend using this recipe
because it is complicated and works in opposition to the design of the
ORM. However, a lot of folks find it appealing because they are in
fact coming from the "SQL-level" view of things to which you refer and
prefer to stay at that level, yet they still want their relationships
to load automatically.

Lyla Fischer

unread,
Apr 29, 2019, 1:37:11 PM4/29/19
to sqlal...@googlegroups.com
Ah. I think that I have been thinking of a session as a cache instead of a transaction, which is incorrect. (It's similar because you have to worry about persistence in the face of failures, but not-similar, because you're not worried about communication with other processes. That is, a commit is not "expose this data to other cores/machines". A commit is a checkpoint to rollback to.) I was thinking that adding something to the session was taking something from a local and putting it into a cache, and then committing it was taking it from a cache and putting it into the database. Totally wrong abstraction, sorry.

-Lyla

Mike Bayer

unread,
Apr 29, 2019, 1:55:23 PM4/29/19
to sqlal...@googlegroups.com
On Mon, Apr 29, 2019 at 1:37 PM Lyla Fischer <lyla...@gmail.com> wrote:
>
> Ah. I think that I have been thinking of a session as a cache instead of a transaction, which is incorrect. (It's similar because you have to worry about persistence in the face of failures, but not-similar, because you're not worried about communication with other processes. That is, a commit is not "expose this data to other cores/machines". A commit is a checkpoint to rollback to.) I was thinking that adding something to the session was taking something from a local and putting it into a cache, and then committing it was taking it from a cache and putting it into the database. Totally wrong abstraction, sorry.

OK, this is more thinking than I usually do about this on a daily
basis and I don't think much about it anymore, but you might want to
watch my video at
https://www.sqlalchemy.org/library.html#thesqlalchemysessionindepth
where I went through an amount of effort I would never have time for
these days in order to show what the general idea of it is. Also
note the Session concept is mostly from Hibernate, including the API
for merge() as well as cascades which I largely copied without deeply
considering it beyond the heavy thinking they did up front; the pycon
talk I did after several years of learning how the Session was
actually supposed to work which itself required that I redesigned half
the thing in version 0.5.

Lyla Fischer

unread,
Apr 29, 2019, 2:19:26 PM4/29/19
to sqlal...@googlegroups.com
Awesome! Thanks!

-Lyla
Reply all
Reply to author
Forward
0 new messages