I often mix up the SQL expression language with the use of an ORM session, and it is great that SQLAlchemy more than supports this.But... what are the recommended ways to keep the session in sync with what you do with the SQL expression stuff?
For example, with the ORM you can't really do a batch/bulk insert (that I know of), but you can mix in some SQL expression language to achieve it. Here is a complete example that shows this for the standard Users/Addresses example:
Although the latter behaviour isn't really surprising, can the query be avoided somehow? Is there a way to manually update the session/identity-map with the info?
In general, is there a better way to do what I'm doing? The example is obviously a simplified one... my actual use case is batch inserting thousands of records this way and I'd like to be able to keep the ORM and avoid the extra query if I could.
Another wrinkle to this is that if I already have relationship data within the ORM, but then add records outside of the ORM with the expression language, I can't figure out how to reconcile this efficiently.
As a specific example, if I add the snippet below to my original example, you can see the ORM falls out of sync.
#I can get them by expiring all of Joe...
# - is there a way to do this without expiring Joe?
# - Can the relation be expired directly?
# - Better would be to to inform the ORM of new data (instead of expiring old# data), but I'm looking for workarounds.
well pretty much being saavy about expiration is the primary approach. The rows you affect via an execute(), if they've been loaded in the session they'd need to be expired from memory.
Unless you can be coarse grained about it and expire everything (as occurs normally after a commit()), or large chunks of things, you'd need to work out specific situations as needed. Of course the best scenario is when you aren't intermingling ORM state with SQL-emitted UPDATE statements so much. Query.update() and Query.delete() were introduced to help with this as they can apply the same filter() criteria to the objects within the Session, expiring those that match - it can evaluate simple SQL criterion in memory for this purpose.
For example, with the ORM you can't really do a batch/bulk insert (that I know of), but you can mix in some SQL expression language to achieve it. Here is a complete example that shows this for the standard Users/Addresses example:so there, your options are:1. expire bob.addresses ( see below, it appears you weren't aware of this)2. use lazy="dynamic" on bob.addresses, so that SQL is emitted each time the relationship is accessed.
Although the latter behaviour isn't really surprising, can the query be avoided somehow? Is there a way to manually update the session/identity-map with the info?You can put things straight into the identity map using Session.add(), assuming the object has a database identity. Session.merge(, load=False) does it too and is normally used to merge in state from a cache into the Session where the target identities
may already be present.
But to get identity on the object and have it be "clean" as far as pending changes you need to tinker. The identity key part needs to be via instance_state(obj).key = object_mapper(obj).identity_key_from_instance(obj), so you can see there you're starting to dive into private APIs. The "fixed" attributes with no history you can assign via set_committed_value: http://www.sqlalchemy.org/docs/orm/session.html#sqlalchemy.orm.attributes.set_committed_value , this is more of a public API as it's there to help people write custom loader logic for special cases. There would be nothing controversial about making a more "public" api to assign identity to an object but we don't usually add things without a use case we see fit; otherwise people don't know why such a method is there and if/when they should be using it. The use cases also drive how the API would look, as there are several ways a "put an object in the identity map" API could look here.
You can see this approach is starting to tinker with things, it will work completely fine and the patterns there aren't changing anytime soon, but this is not the usual level that people work at. The APIs in this regard are not as well suited (or widely tested) towards the use case of manipulating the Session directly to that fine a degree. That said we already have "make_transient()" which is the opposite of "set_my_identity()", so, not too controversial as long as we don't confuse people with it.
In general, is there a better way to do what I'm doing? The example is obviously a simplified one... my actual use case is batch inserting thousands of records this way and I'd like to be able to keep the ORM and avoid the extra query if I could.Wondering though why I've never had this use case. Batch install thousands of records with straight INSERT, great. But then you need all of those thousands in memory via an ORM state all at once ? Why's that ? Are doing more "batch" work with all N-thousand items in the ORM sense ? Usually if I write an installer, then its done. The ORM use case comes later, loads in just a handful of those records, then works with them. If I need to batch insert and work with the full set of objects in an ORM sense simultaneously, I work with the ORM all the way through. In modern SQLAlchemy this can be nearly as fast - just assign the primary key value to your objects ahead of time so that no "post fetch of primary key" is needed, and the ORM will batch the INSERT statements just as you're doing with your example. Tons and tons of work has gone into getting the ORM to be better at batch inserts, since I use it in this way myself quite a lot.
I understand this somewhat and had done that, but am still confused by one thing: in my pure ORM case (for "joe"), how did the identity map get a hold of the auto-increment ids when there was no apparent emitted SQL? It seems there wasn't one done, although I suppose it could have been done behind the scenes and not logged?
I had been looking at Query.update and it looked promising and dug around for an insert equivalent, but there wasn't one so I ended up dropping down to the SQL expression language.
1. expire bob.addresses ( see below, it appears you weren't aware of this)2. use lazy="dynamic" on bob.addresses, so that SQL is emitted each time the relationship is accessed.I'm not sure how those apply to doing a bulk insert with the ORM...
Thanks - I'm not sure I 100% get it. I think it is simply that I can make an ORM object instance not associated with a session (I think you call this "detached"), directly assign the PK (if I know it), and then when I call Session.add() the ORM will trust me and use the provided PK for integrating the object into the identity-map. Is that right?
If this is the case, I still have the unfortunate problem of not knowing the autoincrement PK values without a fresh query.
As to why there is a mix of ORM and emitted SQL, it is really only an optimization thing, and not a small one. The large batch inserts are not an initial data setup, but are a result of the application requirements, and they occur frequently. All business logic is initially coded using the ORM, since it is by far the easiest way to get it coded up (recall the weeping?). The SQL expression language is only hacked in to address issues found during profiling, and so far this has only been a requirement for bulk inserts. For example, a recent case had a particular function configuration taking ~8.5s to complete the operation. Checking the emitted SQL showed huge floods of the ORM doing several different batch inserts one insert at a time. Hacking those code segments to do batch inserts with the expression language in a way similar to my joe/bob example cut the query down to < 0.5s, even with inefficient identity-map refreshing. Fiddling with some eager loading config will carve that further down due to some remaining selects that could be merged, but the main point was the resolution of the "whoah that is a heck of a lot of unnecessary inserts" problem.
I understand this somewhat and had done that, but am still confused by one thing: in my pure ORM case (for "joe"), how did the identity map get a hold of the auto-increment ids when there was no apparent emitted SQL? It seems there wasn't one done, although I suppose it could have been done behind the scenes and not logged?the examples you showed me on pastebin all seemed to involve the ORM going out to fetch rows, so when they are fetched via SELECT all the info is there.
BEGIN (implicit)
INSERT INTO users (name) VALUES (?)
('joe',)
INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
('jo...@example.com', 1)
INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
('jo...@example.com', 1)
INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
('jo...@example.com', 1)
INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
('jo...@example.com', 1)
INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
('jo...@example.com', 1)
INSERT INTO users (name) VALUES (?)
('bob',)
INSERT INTO addresses (email, "FK_user_id") VALUES (?, ?)
(('bo...@example.com', 2), ('bo...@example.com', 2), ('bo...@example.com', 2), ('bo...@example.com', 2), ('bo...@example.com', 2))
1 <Address('jo...@example.com')>
2 <Address('jo...@example.com')>
3 <Address('jo...@example.com')>
4 <Address('jo...@example.com')>
5 <Address('jo...@example.com')>
so really you're only concerned about a bulk INSERT, ok
Thanks - I'm not sure I 100% get it. I think it is simply that I can make an ORM object instance not associated with a session (I think you call this "detached"), directly assign the PK (if I know it), and then when I call Session.add() the ORM will trust me and use the provided PK for integrating the object into the identity-map. Is that right?Correct, but I've never had anyone do this before. I'm not yet sure I like advising someone to manipulate the Session directly in this way. We need to be careful what we advise because in two weeks it will show up on tutorials everywhere then we get complaints that "SQLAlchemy is too hard" because everyone thinks they're supposed to put things straight into the identity map (exaggerating here, but only slightly).
well therein lies why the ORM most of the time can't use bulk INSERT calls in a backend-neutral way. Your approach, which essentially is that of swapping in some optimized code where flush() would normally take place, is subject to the same rules of the road. Assuming the backend is one that uses a "get the most recent ID" approach, which includes SQLite, MySQL and MSSQL, the Python DBAPI, and in fact every database client API I've ever worked with, only supports returning you the "last inserted id" one INSERT at at time - that is, only via cursor.execute(), and not cursor.executemany() (background on DBAPI at http://www.python.org/dev/peps/pep-0249/). On Postgresql, Oracle, Firebird, the underlying construct to generate ids is the sequence - on those, you could run the sequence externally to the inserts for the full range of integer IDs first.
so to do this, you'd need:1. a way to make primary key ids fast, i.e. a sequence. or not using autoinc if you're on MySQL, etc.2. a function that will copy those inserted rows into ORM objects, where you use set_committed_state() for those attributes you'd like loaded (must include at least the primary key identifying attributes)) and state.key = instance_key(obj) (paraphrased) to set up the state and use session.add(). At some point SQLA could grow a make_detached(obj) function.2a: Fine here's make_detached:from sqlalchemy.orm.attributes import instance_statedef make_detached(obj):"""Given a transient object with state, turn it into 'detached', that is, has an instance keybased on its primary key identity. obj must have primary key attributes populated."""state = instance_state(obj)state.key = state.mapper._identity_key_from_state(state)
numEmails = 5
joeEmails = [Address("joe%0...@example.com" % i) for i in range(numEmails)]
#Now manually assign some known unique PK ids in advance...
# - this is Bad code and is only a cheat for this example
for i, mail in enumerate(joeEmails, 10000):
mail.id = i
joe.addresses = joeEmails
sess.add(joe)
#flush and see that all addresses are bulk inserted with a single INSERT...
sess.flush()
class MyOrmBase(object):
@classmethod
def GetIdSequenceBlock(cls, ORMSession, BlockSize):
"""Returns a list of BlockSize ids from the sequence for the current
table's id. If no sequence can be found (dialect doesn't support it?)
an empty list is returned.
Assumptions:
- the column with the sequence is named "id"
- standard SQLAlchemy sequence naming convention
Currently only supports PostgreSQL.
"""
seqName = "%s_id_seq" % cls.__tablename__
q = "SELECT nextval('%s') FROM generate_series(1, %d)" % \
(seqName, BlockSize)
try:
ids = [v[0] for v in ORMSession.execute(q).fetchall()]
except Exception as exc: #could be various exceptions
ids = []
return ids
mails = [Address("joe%0...@example.com" % i) for i in range(mail_count)]
pregen_ids = cls.GetIdSequenceBlock(sess, mail_count)
for i, id in enumerate(pregen_ids):
mails[i].id = id
#and this flush does it all in one INSERT...
# - for PostgreSQL, anyway
# - For SQLite it does an INSERT per row due to no PK pre-assign
sess.flush()
Thanks! Although after really chewing on all of this and trying out different things, I'm now content to stay in the ORM and keep away from the need for that make_detached function (see below). Your pointing out of the batch insert behaviour when pre-assigning PKs with ORM inserts is the big win here I think. I confirmed this batch insert ORM behaviour with the following simple snippet:numEmails = 5
joeEmails = [Address("joe%0...@example.com" % i) for i in range(numEmails)]
#Now manually assign some known unique PK ids in advance...
# - this is Bad code and is only a cheat for this example
for i, mail in enumerate(joeEmails, 10000):
mail.id = i
joe.addresses = joeEmails
sess.add(joe)
#flush and see that all addresses are bulk inserted with a single INSERT...
sess.flush()So I never need to leave the ORM, and don't even need to do trickery like forcing unnatural transient->detached state transitions (although all of this has been extremely informative).
The "joe" addresses are there quite simply because you assigned them on line 44:
The "joe" addresses are there quite simply because you assigned them on line 44:The addresses are obviously there, yes. The ids are also there, though. With no apparent select. That's the mystery to me.
--
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To post to this group, send email to sqlal...@googlegroups.com.
To unsubscribe from this group, send email to sqlalchemy+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
they were flushed ! the primary key is always fetched when the ORM does an INSERT.
they were flushed ! the primary key is always fetched when the ORM does an INSERT.I know it does happen, just not how.It seems that some of my confusion was mostly based entirely on a peculiarity of the sqlite dialect. When I switch to Postgres it is perfectly clear that the ids come back on INSERT through the use of INSERT...RETURNING addresses.id. Sqlite logging has no such clarity in where they come from, so it presume it is a fundamental feature of sqlite's automatic ROWID (http://www.sqlite.org/autoinc.html). ie: what I considered as "magic" option B, although I don't see it in the sqlite3 docs either.
integration with a minimal set of queries) I still hit two dead ends. It turns out that can't do it in batch with PostgreSQL (outside of the great pre-fetch solution) or with SQLite.
With PostgreSQL, I can add a returning clause to the insert to try and get the ids back at the same time as the insert. Unfortunately, this does not work since when you insert multiple rows, the returned list only has the first element populated and the rest are None:
None? I don't know if this is a misunderstanding by me, a limitation of RETURNING, or a bug.
For SQLite I'm back to the original problem of not knowing how to get the ids, and there is no "returning" clause there.
its cursor.lastrowid: