How to keep ORM Sessions in sync when using the SQL expression language as well?

321 views
Skip to first unread message

Russ

unread,
Oct 18, 2011, 3:40:02 PM10/18/11
to sqlal...@googlegroups.com
I often mix up the SQL expression language with the use of an ORM session, and it is great that SQLAlchemy more than supports this.

But... what are the recommended ways to keep the session in sync with what you do with the SQL expression stuff?

For example, with the ORM you can't really do a batch/bulk insert (that I know of), but you can mix in some SQL expression language to achieve it.  Here is a complete example that shows this for the standard Users/Addresses example:

In there I do a semi-bulk insert of addresses using both the "ORM way" and the "SQL Expression Language way".  With the former way, the inserted addresses are available with their ids without ever seeming to have a query emitted to get them.  With the latter way (SQLEL) the ORM needed to emit a query to get the addresses (even when I don't ask for id), presumably to update it's identity map.

Although the latter behaviour isn't really surprising, can the query be avoided somehow?  Is there a way to manually update the session/identity-map with the info?

In general, is there a better way to do what I'm doing?  The example is obviously a simplified one... my actual use case is batch inserting thousands of records this way and I'd like to be able to keep the ORM and avoid the extra query if I could.

Russ

unread,
Oct 19, 2011, 2:41:23 PM10/19/11
to sqlal...@googlegroups.com
Another wrinkle to this is that if I already have relationship data within the ORM, but then add records outside of the ORM with the expression language, I can't figure out how to reconcile this efficiently.

As a specific example, if I add the snippet below to my original example, you can see the ORM falls out of sync.

#Add some new emails to Joe outside the ORM...
initial_mail_count = len(joe.addresses)
new_mail_count = 2
new_emails = ["newjoe%0...@example.com" % i for i in range(new_mail_count)]
emailValues = [dict(FK_user_id = joe.id, email = addr) for addr in new_emails]
sess.execute(Address.__table__.insert(), emailValues)

#Joe's new emails are not in joe.addresses...
# - because the ORM has zero awareness of the direct execute call
assert len(joe.addresses) == initial_mail_count

#I can get them by expiring all of Joe...
# - is there a way to do this without expiring Joe?
# - Can the relation be expired directly?
# - Better would be to to inform the ORM of new data (instead of expiring old
#   data), but I'm looking for workarounds.
sess.expire(joe)
assert len(joe.addresses) == (initial_mail_count + new_mail_count)

As per the code and comments, I can bring it back into sync by expiring the top level object, but this is clearly not very efficient since it results in a whole query for the top level object, as well as any additional eager loads that have been configured for that object.

Can a relation be expired directly?  I'm searching hard for how to do this and failing.  I'm also still very interested in whether there is some other efficient way to re-sync the ORM with transactions performed outside of it.

I have updated the pastebin sample code to include the above snippet as well:

Russ

Michael Bayer

unread,
Oct 19, 2011, 10:20:50 PM10/19/11
to sqlal...@googlegroups.com
On Oct 18, 2011, at 3:40 PM, Russ wrote:

I often mix up the SQL expression language with the use of an ORM session, and it is great that SQLAlchemy more than supports this.

But... what are the recommended ways to keep the session in sync with what you do with the SQL expression stuff?

well pretty much being saavy about expiration is the primary approach.   The rows you affect via an execute(), if they've been loaded in the session they'd need to be expired from memory.    Unless you can be coarse grained about it and expire everything (as occurs normally after a commit()), or large chunks of things, you'd need to work out specific situations as needed.  Of course the best scenario is when you aren't intermingling ORM state with SQL-emitted UPDATE statements so much.  Query.update() and Query.delete() were introduced to help with this as they can apply the same filter() criteria to the objects within the Session, expiring those that match - it can evaluate simple SQL criterion in memory for this purpose.



For example, with the ORM you can't really do a batch/bulk insert (that I know of), but you can mix in some SQL expression language to achieve it.  Here is a complete example that shows this for the standard Users/Addresses example:

so there, your options are:

1. expire bob.addresses ( see below, it appears you weren't aware of this)
2. use lazy="dynamic" on bob.addresses, so that SQL is emitted each time the relationship is accessed.


Although the latter behaviour isn't really surprising, can the query be avoided somehow?  Is there a way to manually update the session/identity-map with the info?

You can put things straight into the identity map using Session.add(), assuming the object has a database identity.   Session.merge(, load=False) does it too and is normally used to merge in state from a cache into the Session where the target identities may already be present.    

But to get identity on the object and have it be "clean" as far as pending changes you need to tinker.  The identity key part needs to be via instance_state(obj).key = object_mapper(obj).identity_key_from_instance(obj), so you can see there you're starting to dive into private APIs.   The "fixed" attributes with no history you can assign via set_committed_value: http://www.sqlalchemy.org/docs/orm/session.html#sqlalchemy.orm.attributes.set_committed_value , this is more of a public API as it's there to help people write custom loader logic for special cases.   There would be nothing controversial about making a more "public" api to assign identity to an object but we don't usually add things without a use case we see fit; otherwise people don't know why such a method is there and if/when they should be using it.   The use cases also drive how the API would look, as there are several ways a "put an object in the identity map" API could look here.  

You can see this approach is starting to tinker with things, it will work completely fine and the patterns there aren't changing anytime soon, but this is not the usual level that people work at.  The APIs in this regard are not as well suited (or widely tested) towards the use case of manipulating the Session directly to that fine a degree.  That said we already have "make_transient()" which is the opposite of "set_my_identity()", so, not too controversial as long as we don't confuse people with it.


In general, is there a better way to do what I'm doing?  The example is obviously a simplified one... my actual use case is batch inserting thousands of records this way and I'd like to be able to keep the ORM and avoid the extra query if I could.

Wondering though why I've never had this use case.  Batch install thousands of records with straight INSERT, great. But then you need all of those thousands in memory via an ORM state all at once ?   Why's that ?   Are doing more "batch" work with all N-thousand items in the ORM sense ?    Usually if I write an installer, then its done.   The ORM use case comes later, loads in just a handful of those records, then works with them.  If I need to batch insert and work with the full set of objects in an ORM sense simultaneously, I work with the ORM all the way through.    In modern SQLAlchemy this can be nearly as fast  - just assign the primary key value to your objects ahead of time so that no "post fetch of primary key" is needed, and the ORM will batch the INSERT statements just as you're doing with your example.   Tons and tons of work has gone into getting the ORM to be better at batch inserts, since I use it in this way myself quite a lot.

Another wrinkle to this is that if I already have relationship data within the ORM, but then add records outside of the ORM with the expression language, I can't figure out how to reconcile this efficiently.
As a specific example, if I add the snippet below to my original example, you can see the ORM falls out of sync.
#I can get them by expiring all of Joe...
# - is there a way to do this without expiring Joe?

the set_committed_value() above can do this, though id be careful of complexity here

# - Can the relation be expired directly?

absolutely, Session.expire accepts a second parameter that's a list of attribute names (for the things you're looking to do here, and since you have a very good sense of how the ORM does things, start digging into those API docs !   not sure if you looked around but its here:   http://www.sqlalchemy.org/docs/orm/session.html#refreshing-expiring as well as docstrings at http://www.sqlalchemy.org/docs/orm/session.html#sqlalchemy.orm.session.Session.expire )

# - Better would be to to inform the ORM of new data (instead of expiring old
#   data), but I'm looking for workarounds.

yeah its not something I've dealt with much since I try to find another way to solve the problem without introducing a lot of brittle back and forth.   There's a larger problem here, which is why do you really need to use direct SQL to emit inserts and updates within code that is complex enough to otherwise require the ORM.


Russ

unread,
Oct 20, 2011, 1:49:15 AM10/20/11
to sqlal...@googlegroups.com
Thanks very much for the response... lots to chew on here.

well pretty much being saavy about expiration is the primary approach.   The rows you affect via an execute(), if they've been loaded in the session they'd need to be expired from memory.

I understand this somewhat and had done that, but am still confused by one thing: in my pure ORM case (for "joe"), how did the identity map get a hold of the auto-increment ids when there was no apparent emitted SQL?  It seems there wasn't one done, although I suppose it could have been done behind the scenes and not logged?  Either that or the insert statements are returning the ids somehow... although I didn't see anything indicating that.  Whatever method the ORM uses to get the ids from the INSERT statements it emits is where I was wondering if I could get the data to stuff the identity-map without needing to expire anything and/or re-query.

Unless you can be coarse grained about it and expire everything (as occurs normally after a commit()), or large chunks of things, you'd need to work out specific situations as needed.  Of course the best scenario is when you aren't intermingling ORM state with SQL-emitted UPDATE statements so much.  Query.update() and Query.delete() were introduced to help with this as they can apply the same filter() criteria to the objects within the Session, expiring those that match - it can evaluate simple SQL criterion in memory for this purpose.

I had been looking at Query.update and it looked promising and dug around for an insert equivalent, but there wasn't one so I ended up dropping down to the SQL expression language.
For example, with the ORM you can't really do a batch/bulk insert (that I know of), but you can mix in some SQL expression language to achieve it.  Here is a complete example that shows this for the standard Users/Addresses example:
so there, your options are:

1. expire bob.addresses ( see below, it appears you weren't aware of this)
2. use lazy="dynamic" on bob.addresses, so that SQL is emitted each time the relationship is accessed.

I'm not sure how those apply to doing a bulk insert with the ORM... I must be missing something?  "dynamic" isn't an option for me as I use the relation frequently... I just need to sync up after a few selected insert locations.  For the expiry of bob.addresses I don't know what to say... I had some sort of weird/idiotic mental block as I read and re-read the docs on expire and the docstring and still didn't get it for some reason.  I feel shame.  Thanks for the pointer to obvious second argument. :(
Although the latter behaviour isn't really surprising, can the query be avoided somehow?  Is there a way to manually update the session/identity-map with the info?
You can put things straight into the identity map using Session.add(), assuming the object has a database identity.   Session.merge(, load=False) does it too and is normally used to merge in state from a cache into the Session where the target identities 
may already be present.

Thanks - I'm not sure I 100% get it.  I think it is simply that I can make an ORM object instance not associated with a session (I think you call this "detached"), directly assign the PK (if I know it), and then when I call Session.add() the ORM will trust me and use the provided PK for integrating the object into the identity-map.  Is that right?

If this is the case, I still have the unfortunate problem of not knowing the autoincrement PK values without a fresh query.  And before it comes up, I do actually need autoincrement PKs versus a more "natural" candidate key in almost all cases.  Reasons here include space/time efficient joining to many separate tables, as well as allow renaming of the "natural" candidate keys without changing of record identity (or the many FKs).  I'm continually evaluating whether I can ditch autoinc sequence PKs and it does not make sense in most cases.

But to get identity on the object and have it be "clean" as far as pending changes you need to tinker.  The identity key part needs to be via instance_state(obj).key = object_mapper(obj).identity_key_from_instance(obj), so you can see there you're starting to dive into private APIs.   The "fixed" attributes with no history you can assign via set_committed_value: http://www.sqlalchemy.org/docs/orm/session.html#sqlalchemy.orm.attributes.set_committed_value , this is more of a public API as it's there to help people write custom loader logic for special cases.   There would be nothing controversial about making a more "public" api to assign identity to an object but we don't usually add things without a use case we see fit; otherwise people don't know why such a method is there and if/when they should be using it.   The use cases also drive how the API would look, as there are several ways a "put an object in the identity map" API could look here.
 
You can see this approach is starting to tinker with things, it will work completely fine and the patterns there aren't changing anytime soon, but this is not the usual level that people work at.  The APIs in this regard are not as well suited (or widely tested) towards the use case of manipulating the Session directly to that fine a degree.  That said we already have "make_transient()" which is the opposite of "set_my_identity()", so, not too controversial as long as we don't confuse people with it.

set_committed_value looks very promising... I'll tinker with it.  I want to stay away from deeper APIs for fear of it being brittle, as you say (great adjective).  
In general, is there a better way to do what I'm doing?  The example is obviously a simplified one... my actual use case is batch inserting thousands of records this way and I'd like to be able to keep the ORM and avoid the extra query if I could.
Wondering though why I've never had this use case.  Batch install thousands of records with straight INSERT, great. But then you need all of those thousands in memory via an ORM state all at once ?   Why's that ?   Are doing more "batch" work with all N-thousand items in the ORM sense ?    Usually if I write an installer, then its done.   The ORM use case comes later, loads in just a handful of those records, then works with them.  If I need to batch insert and work with the full set of objects in an ORM sense simultaneously, I work with the ORM all the way through.    In modern SQLAlchemy this can be nearly as fast  - just assign the primary key value to your objects ahead of time so that no "post fetch of primary key" is needed, and the ORM will batch the INSERT statements just as you're doing with your example.   Tons and tons of work has gone into getting the ORM to be better at batch inserts, since I use it in this way myself quite a lot.

Those darn autoincrements foil me again here... I can't assign PKs ahead of time,

Regarding the use case and other comments on why I need to do it, I'll hum a few bars.  Basically I'm working on a relatively large application that has been built from the ground up to use SQLAlchemy and only SQLAlchemy to define and work with the database (although schema definition, ORM definitions, and business logic have been kept as separate as possible... while still using declarative, which may make you cringe).  The use of the ORM has made some particular tricky areas a heck of a lot easier to develop than it would have been with direct SQL.  eg: when I have a complicated query to do that includes a complex (but nicely organized) spiderweb of a set of 10 tables, I can do this really easily with the ORM,  When I'm cruising the emitted SQL for optimizations I almost weep with joy that I didn't have to fabricate the queries in there by hand.

As to why there is a mix of ORM and emitted SQL, it is really only an optimization thing, and not a small one.  The large batch inserts are not an initial data setup, but are a result of the application requirements, and they occur frequently.  All business logic is initially coded using the ORM, since it is by far the easiest way to get it coded up (recall the weeping?).  The SQL expression language is only hacked in to address issues found during profiling, and so far this has only been a requirement for bulk inserts.  For example, a recent case had a particular function configuration taking ~8.5s to complete the operation.  Checking the emitted SQL showed huge floods of the ORM doing several different batch inserts one insert at a time.  Hacking those code segments to do batch inserts with the expression language in a way similar to my joe/bob example cut the query down to < 0.5s, even with inefficient identity-map refreshing.  Fiddling with some eager loading config will carve that further down due to some remaining selects that could be merged, but the main point was the resolution of the "whoah that is a heck of a lot of unnecessary inserts" problem.

Michael Bayer

unread,
Oct 20, 2011, 10:00:31 AM10/20/11
to sqlal...@googlegroups.com
On Oct 20, 2011, at 1:49 AM, Russ wrote:


I understand this somewhat and had done that, but am still confused by one thing: in my pure ORM case (for "joe"), how did the identity map get a hold of the auto-increment ids when there was no apparent emitted SQL?  It seems there wasn't one done, although I suppose it could have been done behind the scenes and not logged?  

the examples you showed me on pastebin all seemed to involve the ORM going out to fetch rows, so when they are fetched via SELECT all the info is there.


I had been looking at Query.update and it looked promising and dug around for an insert equivalent, but there wasn't one so I ended up dropping down to the SQL expression language.

so really you're only concerned about a bulk INSERT, ok


1. expire bob.addresses ( see below, it appears you weren't aware of this)
2. use lazy="dynamic" on bob.addresses, so that SQL is emitted each time the relationship is accessed.

I'm not sure how those apply to doing a bulk insert with the ORM...

assuming you had "bob", then you did a mass-insert of Address objects, both of the above methods would allow it such that the collection of "addresses" on bob would indicate the current data


Thanks - I'm not sure I 100% get it.  I think it is simply that I can make an ORM object instance not associated with a session (I think you call this "detached"), directly assign the PK (if I know it), and then when I call Session.add() the ORM will trust me and use the provided PK for integrating the object into the identity-map.  Is that right?

Correct, but I've never had anyone do this before.    I'm not yet sure I like advising someone to manipulate the Session directly in this way.   We need to be careful what we advise because in two weeks it will show up on tutorials everywhere then we get complaints that "SQLAlchemy is too hard" because everyone thinks they're supposed to put things straight into the identity map (exaggerating here, but only slightly).

If this is the case, I still have the unfortunate problem of not knowing the autoincrement PK values without a fresh query.  

well therein lies why the ORM most of the time can't use bulk INSERT calls in a backend-neutral way.   Your approach, which essentially is that of swapping in some optimized code where flush() would normally take place, is subject to the same rules of the road.  Assuming the backend is one that uses a "get the most recent ID" approach, which includes SQLite, MySQL and MSSQL, the Python DBAPI, and in fact every database client API I've ever worked with, only supports returning you the "last inserted id" one INSERT at at time - that is, only via cursor.execute(), and not cursor.executemany()  (background on DBAPI at http://www.python.org/dev/peps/pep-0249/).   On Postgresql, Oracle, Firebird, the underlying construct to generate ids is the sequence - on those, you could run the sequence externally to the inserts for the full range of integer IDs first.



As to why there is a mix of ORM and emitted SQL, it is really only an optimization thing, and not a small one.  The large batch inserts are not an initial data setup, but are a result of the application requirements, and they occur frequently.  All business logic is initially coded using the ORM, since it is by far the easiest way to get it coded up (recall the weeping?).  The SQL expression language is only hacked in to address issues found during profiling, and so far this has only been a requirement for bulk inserts.  For example, a recent case had a particular function configuration taking ~8.5s to complete the operation.  Checking the emitted SQL showed huge floods of the ORM doing several different batch inserts one insert at a time.  Hacking those code segments to do batch inserts with the expression language in a way similar to my joe/bob example cut the query down to < 0.5s, even with inefficient identity-map refreshing.  Fiddling with some eager loading config will carve that further down due to some remaining selects that could be merged, but the main point was the resolution of the "whoah that is a heck of a lot of unnecessary inserts" problem.

so to do this, you'd need:

1. a way to make primary key ids fast, i.e. a sequence.   or not using autoinc if you're on MySQL, etc. 
2. a function that will copy those inserted rows into ORM objects, where you use set_committed_state() for those attributes you'd like loaded (must include at least the primary key identifying attributes)) and state.key = instance_key(obj) (paraphrased) to set up the state and use session.add().   At some point SQLA could grow a make_detached(obj) function.
2a:  Fine here's make_detached:

from sqlalchemy.orm.attributes import instance_state
def make_detached(obj):
    """Given a transient object with state, turn it into 'detached', that is, has an instance key
       based on its primary key identity.  obj must have primary key attributes populated."""
    state = instance_state(obj)
    state.key = state.mapper._identity_key_from_state(state)

   
3. really good practices to package this optimization cleanly and tidily away from everything else !


Russell Warren

unread,
Oct 20, 2011, 4:49:54 PM10/20/11
to sqlal...@googlegroups.com
I understand this somewhat and had done that, but am still confused by one thing: in my pure ORM case (for "joe"), how did the identity map get a hold of the auto-increment ids when there was no apparent emitted SQL?  It seems there wasn't one done, although I suppose it could have been done behind the scenes and not logged?  
the examples you showed me on pastebin all seemed to involve the ORM going out to fetch rows, so when they are fetched via SELECT all the info is there.

But there was no apparent SELECT!  I don't think I expressed myself properly... below is a dump of the SQL logging for my pastebin example (some prefix cruft removed):

BEGIN (implicit)
INSERT INTO users (name) VALUES (?)
('joe',)
INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
('jo...@example.com', 1)
INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
('jo...@example.com', 1)
INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
('jo...@example.com', 1)
INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
('jo...@example.com', 1)
INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
('jo...@example.com', 1)
INSERT INTO users (name) VALUES (?)
('bob',)
INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
(('bo...@example.com', 2), ('bo...@example.com', 2), ('bo...@example.com', 2), ('bo...@example.com', 2), ('bo...@example.com', 2))

And this is immediately followed by the "for addr in joe.addresses: print addr.id, addr" print loop, which prints:
1 <Address('jo...@example.com')>
2 <Address('jo...@example.com')>
3 <Address('jo...@example.com')>
4 <Address('jo...@example.com')>
5 <Address('jo...@example.com')>

Note that the ids are there and there wasn't a SELECT to be found anywhere in the SQL emission logs.  This is what leads me to believe that either A) there was one, it just didn't get logged, or B) they are somehow coming back on the insert.  If B, I'd like to get a hold of that for my bulk inserts.

so really you're only concerned about a bulk INSERT, ok

Yes, and this comes exclusively from actual measurements and observed query behavior (and that is really only the INSERT flood when using the ORM only).
Thanks - I'm not sure I 100% get it.  I think it is simply that I can make an ORM object instance not associated with a session (I think you call this "detached"), directly assign the PK (if I know it), and then when I call Session.add() the ORM will trust me and use the provided PK for integrating the object into the identity-map.  Is that right?
Correct, but I've never had anyone do this before.    I'm not yet sure I like advising someone to manipulate the Session directly in this way.   We need to be careful what we advise because in two weeks it will show up on tutorials everywhere then we get complaints that "SQLAlchemy is too hard" because everyone thinks they're supposed to put things straight into the identity map (exaggerating here, but only slightly).

No worries - I will steer very clear of anything that is not advised.  My initial objective was to find what *is* advised when cheating on the ORM and doing things (bulk inserts) directly  As well as trying to figure out some other things like how autoinc ids are magically known without an apparent SELECT.  You are being very helpful on all counts.

well therein lies why the ORM most of the time can't use bulk INSERT calls in a backend-neutral way.   Your approach, which essentially is that of swapping in some optimized code where flush() would normally take place, is subject to the same rules of the road.  Assuming the backend is one that uses a "get the most recent ID" approach, which includes SQLite, MySQL and MSSQL, the Python DBAPI, and in fact every database client API I've ever worked with, only supports returning you the "last inserted id" one INSERT at at time - that is, only via cursor.execute(), and not cursor.executemany()  (background on DBAPI at http://www.python.org/dev/peps/pep-0249/).   On Postgresql, Oracle, Firebird, the underlying construct to generate ids is the sequence - on those, you could run the sequence externally to the inserts for the full range of integer IDs first.

The main application is using PostgreSQL, but at the moment there is no Postgres specific code, and everything still works perfectly (and is tested regularly) on SQLite.  The latter being mostly for quick/easy debugging.

Snagging a bunch if ids from the sequence in advance is an interesting idea to avoid the "post fetch of primary key" you mentioned.  That will be blazing fast, versus reading back the ids after the inserts (assuming no magic id returns as per earlier).  The added bonus/appeal with that is that the ORM can work normally without having to resort to the trickery that merging non-ORM operations into the ORM requires.

so to do this, you'd need:

1. a way to make primary key ids fast, i.e. a sequence.   or not using autoinc if you're on MySQL, etc. 
2. a function that will copy those inserted rows into ORM objects, where you use set_committed_state() for those attributes you'd like loaded (must include at least the primary key identifying attributes)) and state.key = instance_key(obj) (paraphrased) to set up the state and use session.add().   At some point SQLA could grow a make_detached(obj) function.
2a:  Fine here's make_detached:

from sqlalchemy.orm.attributes import instance_state
def make_detached(obj):
    """Given a transient object with state, turn it into 'detached', that is, has an instance key
       based on its primary key identity.  obj must have primary key attributes populated."""
    state = instance_state(obj)
    state.key = state.mapper._identity_key_from_state(state)

Thanks!  Although after really chewing on all of this and trying out different things, I'm now content to stay in the ORM and keep away from the need for that make_detached function (see below).  Your pointing out of the batch insert behaviour when pre-assigning PKs with ORM inserts is the big win here I think.  I confirmed this batch insert ORM behaviour with the following simple snippet:

numEmails = 5
joeEmails = [Address("joe%0...@example.com" % i) for i in range(numEmails)]
#Now manually assign some known unique PK ids in advance...
# - this is Bad code and is only a cheat for this example
for i, mail in enumerate(joeEmails, 10000):
    mail.id = i
joe.addresses = joeEmails
sess.add(joe)
#flush and see that all addresses are bulk inserted with a single INSERT...
sess.flush()

So I never need to leave the ORM, and don't even need to do trickery like forcing unnatural transient->detached state transitions (although all of this has been extremely informative).

The only remaining thing is to make sure my manually assigned PKs are legit/unique.  With my case of PostgreSQL I can hard code the fetching of a block of ids from a sequence with:

def GetSequenceBlock(ORMSession, SeqName, BlockSize):
    q = "SELECT nextval('%s') FROM generate_series(1, %d)" % (SeqName, BlockSize)

...
Going further, I've now successfully implemented this technique with PostgreSQL in a bunch of code, sticking with the ORM the whole way.  I've added this classmethod to the base class I use for my DeclarativeBase:

class MyOrmBase(object):
    @classmethod
    def GetIdSequenceBlock(cls, ORMSession, BlockSize):
        """Returns a list of BlockSize ids from the sequence for the current
        table's id. If no sequence can be found (dialect doesn't support it?)
        an empty list is returned.
        
        Assumptions:
         - the column with the sequence is named "id"
         - standard SQLAlchemy sequence naming convention
        
        Currently only supports PostgreSQL.
        
        """
        seqName = "%s_id_seq" % cls.__tablename__
        q = "SELECT nextval('%s') FROM generate_series(1, %d)" % \
            (seqName, BlockSize)
        try:
            ids = [v[0] for v in ORMSession.execute(q).fetchall()]
        except Exception as exc: #could be various exceptions
            ids = []
        return ids

And when I want to do a bulk insert using this I do something like:

mails = [Address("joe%0...@example.com" % i) for i in range(mail_count)]
pregen_ids = cls.GetIdSequenceBlock(sess, mail_count)
for i, id in enumerate(pregen_ids):
    mails[i].id = id
#and this flush does it all in one INSERT...
# - for PostgreSQL, anyway
# - For SQLite it does an INSERT per row due to no PK pre-assign
sess.flush() 

I'm quite happy with that, except for the fact that it breaks the seal and introduces PostgreSQL-specific code (raw SQL at that) to the code.

I really wish there was a way to do this with the stock autoincrementing PK in SQLite, but it seems I'll have to use a different method of pre-fetching ids with SQLite.  On the raw SQL point, I see that calling nextid = connection.execute(seq)will get the next value in the sequence for all relevant dialects, but there doesn't seem to be an equivalent way in the dialects to get a batch of them in one shot.

To the original question and subject header, in future if I come across any cases where I feel the need to drop to SQL expression language I now have some extra tools available to work with it.  I expect the #1 approach will be simply to expiring any related attributes and let the ORM refresh as needed as that seems like the simplest and less sketchy route.  If I'm desperately trying to avoid the extra selects at some point, I'll dip into something like make_detached(), but I agree with you that the simple existence of such a thing makes it more complicated when you don't know what you are doing.

Once again - thanks!

Russ

Michael Bayer

unread,
Oct 20, 2011, 4:58:18 PM10/20/11
to sqlal...@googlegroups.com
The "joe" addresses are there quite simply because you assigned them on line 44:

joeEmails = [Address("joe%0...@example.com" % i) for i in range(numEmails)]
joe.addresses = joeEmails
sess.add(joe)
#flush and see an emitted insert statement per email...
sess.flush()

the comment regarding "NO SELECT !" is against the same "joe" object.   the "addresses" collection hasn't changed, no expiration has occurred either, therefore no SELECT is needed here:

# we can get all Joe's info without a query... BUT THERE WAS NEVER A SELECT
# EMITTED ANYWHERE ABOVE!
for addr in joe.addresses:
    print addr.id, addr
Thanks!  Although after really chewing on all of this and trying out different things, I'm now content to stay in the ORM and keep away from the need for that make_detached function (see below).  Your pointing out of the batch insert behaviour when pre-assigning PKs with ORM inserts is the big win here I think.  I confirmed this batch insert ORM behaviour with the following simple snippet:

numEmails = 5
joeEmails = [Address("joe%0...@example.com" % i) for i in range(numEmails)]
#Now manually assign some known unique PK ids in advance...
# - this is Bad code and is only a cheat for this example
for i, mail in enumerate(joeEmails, 10000):
    mail.id = i
joe.addresses = joeEmails
sess.add(joe)
#flush and see that all addresses are bulk inserted with a single INSERT...
sess.flush()

So I never need to leave the ORM, and don't even need to do trickery like forcing unnatural transient->detached state transitions (although all of this has been extremely informative).

wow what a win huh?   the ORM is still slower than the straight SQL due to the internal bookkeeping but glad this one is working OK.


Russell Warren

unread,
Oct 20, 2011, 5:38:03 PM10/20/11
to sqlal...@googlegroups.com
The "joe" addresses are there quite simply because you assigned them on line 44:

The addresses are obviously there, yes.  The ids are also there, though.  With no apparent select.  That's the mystery to me.

Michael Bayer

unread,
Oct 20, 2011, 5:44:17 PM10/20/11
to sqlal...@googlegroups.com
they were flushed !   the primary key is always fetched when the ORM does an INSERT.


On Oct 20, 2011, at 5:38 PM, Russell Warren wrote:

The "joe" addresses are there quite simply because you assigned them on line 44:

The addresses are obviously there, yes.  The ids are also there, though.  With no apparent select.  That's the mystery to me.

--
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To post to this group, send email to sqlal...@googlegroups.com.
To unsubscribe from this group, send email to sqlalchemy+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.

Russell Warren

unread,
Oct 21, 2011, 11:05:07 AM10/21/11
to sqlal...@googlegroups.com
they were flushed !   the primary key is always fetched when the ORM does an INSERT.

I know it does happen, just not how.

It seems that some of my confusion was mostly based entirely on a peculiarity of the sqlite dialect.  When I switch to Postgres it is perfectly clear that the ids come back on INSERT through the use of INSERT...RETURNING addresses.id.  Sqlite logging has no such clarity in where they come from, so it presume it is a fundamental feature of sqlite's automatic ROWID (http://www.sqlite.org/autoinc.html).  ie: what I considered as "magic" option B, although I don't see it in the sqlite3 docs either.

The prefetch of PKs for postrges is great.  But when I investigate the other path and follow flush()'s lead of efficiently retrieving PK ids on INSERT (to achieve ORM integration with a minimal set of queries) I still hit two dead ends.  It turns out that can't do it in batch with PostgreSQL (outside of the great pre-fetch solution) or with SQLite.

With PostgreSQL, I can add a returning clause to the insert to try and get the ids back at the same time as the insert.  Unfortunately, this does not work since when you insert multiple rows, the returned list only has the first element populated and the rest are None:

new_mail_count = 3
new_emails = ["newjoe%0...@example.com" % i for i in range(new_mail_count)]
ins_values = [dict(FK_user_id = joe.id, email = addr) for addr in new_emails]
insert = Address.__table__.insert(returning = [Address.__table__.c.id,
                                               Address.__table__.c.email])
ret = sess.execute(insert, ins_values).fetchall()

and if you look at ret, it is:

[(13, u'newj...@example.com'), (None, None), (None, None)]

None?  I don't know if this is a misunderstanding by me, a limitation of RETURNING, or a bug.

For SQLite I'm back to the original problem of not knowing how to get the ids, and there is no "returning" clause there.

Argh.  I think I need to quit while I'm ahead.  The pre-fetch for Postgres solves a lot for me.  I still can't bulk insert (without post-query) in SQLite, but I'll live with that or switch to client-side PK generation or something.  The rest is mostly just me trying to understand all of this (with a dab of over-optimization) and my frustration will pass.

Michael Bayer

unread,
Oct 21, 2011, 11:24:18 AM10/21/11
to sqlal...@googlegroups.com
On Oct 21, 2011, at 11:05 AM, Russell Warren wrote:

they were flushed !   the primary key is always fetched when the ORM does an INSERT.

I know it does happen, just not how.

It seems that some of my confusion was mostly based entirely on a peculiarity of the sqlite dialect.  When I switch to Postgres it is perfectly clear that the ids come back on INSERT through the use of INSERT...RETURNING addresses.id.  Sqlite logging has no such clarity in where they come from, so it presume it is a fundamental feature of sqlite's automatic ROWID (http://www.sqlite.org/autoinc.html).  ie: what I considered as "magic" option B, although I don't see it in the sqlite3 docs either.

its cursor.lastrowid:


this is part of the DBAPI.   It's not used with psycopg2 (psycopg2 returns the row's OID which is mostly useless).

integration with a minimal set of queries) I still hit two dead ends.  It turns out that can't do it in batch with PostgreSQL (outside of the great pre-fetch solution) or with SQLite.

that's correct, I mentioned this in my original email


With PostgreSQL, I can add a returning clause to the insert to try and get the ids back at the same time as the insert.  Unfortunately, this does not work since when you insert multiple rows, the returned list only has the first element populated and the rest are None:

that's correct


None?  I don't know if this is a misunderstanding by me, a limitation of RETURNING, or a bug.

its a limitation of PG's client library and/or psycopg2.    If I recall correctly I really tried to get an answer from them on this one and i think they were pretty much against it.


For SQLite I'm back to the original problem of not knowing how to get the ids, and there is no "returning" clause there.

you can get one at a time, and that's it.    the DBAPI does not support getting back autogen ids with executemany(), nor does RETURNING work reliably in that scenario.

Russell Warren

unread,
Oct 21, 2011, 11:37:23 AM10/21/11
to sqlal...@googlegroups.com
Ok thanks.  I actually feel better knowing I'm up against underlying limitations.

Sorry for being a few steps behind you... I didn't initially understand your initial comments about why you "can't use bulk INSERT calls in a backend-neutral way" but with hindsight they are perfectly clear.

its cursor.lastrowid:

Aha!

All loose ends now wrapped up neatly.  Thanks again.

Reply all
Reply to author
Forward
0 new messages