Why does Session.merge only look at primary key and not all unique keys?

3,292 views
Skip to first unread message

Russell Warren

unread,
Dec 15, 2010, 5:14:15 PM12/15/10
to sqlalchemy
Why does Session.merge only look at primary key and not all unique
keys?

Leaving aside some irritating DBMS restrictions on PKs and some
automatic indexing that tends to happen, the PK is not fundamentally
different than other unique keys and I don't see why SQLA
distinguishes them from an integrity/relationship perspective. In
databases where it is already frustrating that they have funky PK
restrictions it is tough to make merge() work the way it seems it
should. For example, in the code below this post, Sqlite requires the
autoincrementing field to be the PK, and you can't composite it with
another field... with these restrictions I can't get merge() to work
the way "it should".

I was looking for a clean way in SQLAlchemy to do an "insert if not
exists" pattern, and merge() looked perfect, but I can't make it work
at the moment.

I'm also aware that in the sample code the 'name' field should really
just be the "primary" key and the problem goes away, but the reality
of the grander/real scheme is that the linking id is needed in
addition to other unique keys.

In addition to the docs, these existing threads are also very
relevant:

http://groups.google.com/group/sqlalchemy/browse_frm/thread/7483736b46d56943
http://groups.google.com/group/sqlalchemy/browse_thread/thread/79736ff7ef81d1b9/0b80b54dc45ecc28

To make the "insert if not exists" pattern work I'll likely/
reluctantly be doing the __new__ hack referred to in the latter thread
to achieve what I'm after in the end, but I really don't get why the
PK is getting special treatment.

Thanks,
Russ

Sample code:
####

from sqlalchemy import Column, Integer, String, MetaData, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker, scoped_session
import logging

engine = create_engine('sqlite:///:memory:')
metadata = MetaData()
DeclarativeBase = declarative_base()

#Configure some clean and indented SQL logging...
class SqlFormatter(logging.Formatter):
def format(self, record):
prefix = " SQL >> "
record.msg = prefix + record.msg.replace("\n", "\n" + prefix)
return logging.Formatter.format(self, record)
sqlaLogger = logging.getLogger('sqlalchemy.engine')
sqlaLogger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(SqlFormatter("%(message)s"))
sqlaLogger.addHandler(handler)

class MyStuff(DeclarativeBase):
__tablename__ = 'mystuff'

#Config below
id = Column(Integer, primary_key = True, autoincrement = True)
name = Column(String(100), nullable = False, unique = True)

#Config below no good due to composite PK...
#id = Column(Integer, primary_key = True, autoincrement = True)
#name = Column(String(100), nullable = False, primary_key = True)

#Config below doesn't give autoincrement...
#id = Column(Integer, primary_key = False, autoincrement = True)
#name = Column(String(100), nullable = False, primary_key = True)

def __init__(self, Name):
self.name = Name

DeclarativeBase.metadata.create_all(engine)
Session = sessionmaker(bind = engine)

print "Attempting 'foo' merge into empty DB..."
s1 = Session()
foo = s1.merge(MyStuff("foo"))
s1.commit()
s1.close()

print "Attempting 'foo' merge after it exists already..."
s2 = Session()
foo = s2.merge(MyStuff("foo"))
s2.commit()
s2.close()

Russell Warren

unread,
Dec 15, 2010, 5:25:13 PM12/15/10
to sqlalchemy
Whoops - my clipboard had an old pre-cleaned sample code in it that I
pasted. Sorry about the excess lines... maybe the logging mod will be
useful for some people, though?

The code is still right, just not fully cleaned up. The first Column
config is the one that works around the Sqlite PK/autoincrement
restriction, but doesn't work with merge() because it doesn't pick up
on the fact that 'foo' already exists and the merge tries to insert it
again, throwing an IntegrityError.

Michael Bayer

unread,
Dec 15, 2010, 5:38:09 PM12/15/10
to sqlal...@googlegroups.com

On Dec 15, 2010, at 5:14 PM, Russell Warren wrote:

> Why does Session.merge only look at primary key and not all unique
> keys?

Well the theory of operation regarding merge() is based on that of the identity map, which is linked to object/row identity. Consider that it also cascades along relationship paths. It would be a difficult operation to define if it had to choose among multiple ways to determine the "identity" of each object along the cascade chain.


>
> Leaving aside some irritating DBMS restrictions on PKs and some
> automatic indexing that tends to happen, the PK is not fundamentally
> different than other unique keys

It is fundamentally different in that a database row within a reasonable schema has only one "identity". The usage of surrogate primary keys perhaps pollutes this concept to some degree.


> and I don't see why SQLA
> distinguishes them from an integrity/relationship perspective.

SQLA at the ORM level doesn't really know about any other attributes being "unique" and it would incur excessive complexity to implement that as built-in, where "complexity" here means the bookkeeping associated with storing, retrieving, and modifying items in the identity map would become a much more time consuming affair (for some recent insight into my epic battle with time consumption, see http://techspot.zzzeek.org/2010/12/12/a-tale-of-three-profiles/ ). It also would refer to all those unintended side effects of doing such, such as two objects that are swapping attribute values, etc. The database does a perfectly good job of maintaining UNIQUE constraints so we leave that whole affair out of the Python side.


> In
> databases where it is already frustrating that they have funky PK
> restrictions it is tough to make merge() work the way it seems it
> should. For example, in the code below this post, Sqlite requires the
> autoincrementing field to be the PK, and you can't composite it with
> another field... with these restrictions I can't get merge() to work
> the way "it should".
> I was looking for a clean way in SQLAlchemy to do an "insert if not
> exists" pattern, and merge() looked perfect, but I can't make it work
> at the moment.

The generic "insert if not exists" pattern that is extensible to whatever attributes you want is at: http://www.sqlalchemy.org/trac/wiki/UsageRecipes/UniqueObject .

>
> I'm also aware that in the sample code the 'name' field should really
> just be the "primary" key and the problem goes away,

mmm the consensus I've noted for the past several years, as well as with my own experiences, is that we're better off with surrogate primary keys. SQLA does support natural primary keys fully, and note that foreign keys which reference natural primary keys are entirely valid. Mutation of these keys is supported naturally through ON UPDATE CASCADE and ON DELETE CASCADE. But I find myself usually never using them (well actually I did a yearlong project a year ago that was all on natural PKs and it was really not worth it).


Russell Warren

unread,
Dec 15, 2010, 11:50:34 PM12/15/10
to sqlalchemy
>> Why does Session.merge only look at primary key and not all unique
>> keys?
>
> Well the theory of operation regarding merge() is based on that of
> the identity map, which is linked to object/row identity.
> Consider that it also cascades along relationship paths. It would
> be a difficult operation to define if it had to choose among
> multiple ways to determine the "identity" of each object along the
> cascade chain.

Ok. That certainly makes sense for following relationships in the
merge (and in general). But for the basic existence checking that is
required in the first step(s) of the merge to figure out whether the
object being merged already exists or not, it does not seem
unreasonable for it to check all unique keys. ie: in your docs you
say the first merge step is "It examines the primary key of the
instance". Can't it be "It checks any provided unique elements of the
instance". From that point, normal/sensible identity map rules could
resume for cascade. "Is a" checking is not the same as "is related
to" checking, which is clearly nuttier.

Although... you do mention later that ORM level has no knowledge of
"unique" attributes, so perhaps this is also impossible? Is the
"unique = True" kwarg on the Column not kept anywhere? Is it just
used for table creation and then turfed?

>> Leaving aside some irritating DBMS restrictions on PKs and some
>> automatic indexing that tends to happen, the PK is not fundamentally
>> different than other unique keys
>
> It is fundamentally different in that a database row within a
> reasonable schema has only one "identity". The usage of surrogate
> primary keys perhaps pollutes this concept to some degree.

Ok again... but you also agree that the use of surrogate keys is
standard (and some say "necessary") practice. In the case of Sqlite
(as in my example), adding this surrogate key automatically makes the
schema "unreasonable" because you now need to have the primary key as
the meaningless surrogate key (single id column), and the natural key
ends up being relegated to just another unique index. Unfortunately,
the latter renders the object/table useless for use with the useful
session.merge() function. I don't recall the details, but I think
there may be a similar PostgreSQL limitation regarding autoincrements
as well.

> for some recent insight into my epic battle with time consumption,
> see http://techspot.zzzeek.org/2010/12/12/a-tale-of-three-profiles

Great link/post! Seeing successful optimizations is always good, and
the writeup is quite informative.

Regarding RunSnakeRun, I've used it as well and like it. However, I'm
a bit of a resolution junkie and the box frames in RunSnakeRun's
visual output are limited in how they will squish all the box borders
together. If you haven't given kcachegrind a shot yet for viewing
cProfile results, you might want to give it a trial run as the visual
output is a better representation of timing scale, and the viewing is
more powerful as well (if resolution is too weak if an argument).
Some tips to get it to work well for python are here:
http://stackoverflow.com/questions/1896032/using-cprofile-results-with-kcachegrind

> The generic "insert if not exists" pattern that is extensible to
> whatever attributes you want is at:
> http://www.sqlalchemy.org/trac/wiki/UsageRecipes/UniqueObject

Thanks for the example... I would have been worried about how
fiddling with the __new__ constructor would interfere with query
loads, but that example shows how to make it work.

Michael Bayer

unread,
Dec 16, 2010, 10:25:25 AM12/16/10
to sqlal...@googlegroups.com
On Dec 15, 2010, at 11:50 PM, Russell Warren wrote:


Ok again... but you also agree that the use of surrogate keys is
standard (and some say "necessary") practice.  In the case of Sqlite
(as in my example), adding this surrogate key automatically makes the
schema "unreasonable" because you now need to have the primary key as
the meaningless surrogate key (single id column), and the natural key
ends up being relegated to just another unique index.  Unfortunately,
the latter renders the object/table useless for use with the useful
session.merge() function.  I don't recall the details, but I think
there may be a similar PostgreSQL limitation regarding autoincrements
as well.

Well that's only if you're trying to use merge() in some way where the objects you place in it ahead of time are "the same" based on unique keys but are not "the same" based on identity.     That's not an invalid use case but its beyond the scope of what merge() and the identity map is designed to work with, hence the external recipes.  With merge() you would typically pass in an object for the first time, get its .id back via flush or autoflush, then associate that .id with your unique key combination before continuing to merge() further objects.   Unique object is doing mostly this.

Regarding RunSnakeRun, I've used it as well and like it.  However, I'm
a bit of a resolution junkie and the box frames in RunSnakeRun's
visual output are limited in how they will squish all the box borders
together.  
If you haven't given kcachegrind a shot yet for viewing
cProfile results, you might want to give it a trial run as the visual
output is a better representation of timing scale, and the viewing is
more powerful as well (if resolution is too weak if an argument).
Some tips to get it to work well for python are here:
http://stackoverflow.com/questions/1896032/using-cprofile-results-with-kcachegrind

if i understand correctly i would imagine that same issue exists with any display of that type....I double click on a particular box to get a detailed view of everything inside of it.   kcachegrind seems to have the same kind of output in the lower left corner, if i can figure out how to run this on OSX ill give it a shot, thanks.



Reply all
Reply to author
Forward
0 new messages