Thanks for the input; #17 is certainly something I'd like to look at.
For anyone not following along, that's the proposal to share the
select_related() cache among similar instances.
I should point out, though, that the goal of 1.0 is a stable API and
#17 doesn't affect that one way or the other. That is, it could easily
be added without breaking the API, and so falls more under the "nice
to have" than the "blocker" category. It's also quite a bit of code
that I don't fully understand, so I need some more time to fully
understand the implications.
Jacob
For you ;)
My blog runs on mod_python, with (at peak) maybe a half-dozen
processes running. Each of those processes might have its own copy or
even multiple copies of a given Entry object. And I don't care that
that happens; whether there are a bunch of copies or only one is
utterly irrelevant to that use case.
And it's irrelevant to a whole lot of use cases, really, because there
are only two specific situations where it does cause any problems
you'd be likely to notice:
1. You have staggering numbers of objects in memory, in which case
multiple copies hurt resource usage (and, to be perfectly honest,
using a lighter-weight data structure than an instance of Model
becomes a much better solution for this most of the time).
2. Your data fluctuates quickly and often, in which case you want to
know that the object you're displaying *here* just got changed by the
form over *there*.
While these are real use cases, they are not the majority of use
cases, and so identity mapping in the ORM is not necessarily a
must-have feature for 1.0. They are important for you, yes, and they
are important for a number of situations, yes, but we have much bigger
fish to fry over in the big belly ;)
--
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."
-- Ned Batchelder, http://nedbatchelder.com
Erm. You already *know* what the category is, why are you going back
and reading it over and over?
> This is a HUGE change in my mind, and ya, its
> geared at performance, but this really isnt what I'd call just an
> optimization when it's that large of a scale.
What you're asking for is to have Django stop, every single time you
do something that might run a DB query, and
1. Figure out which objects will be returned by the query, and
2. Whether the objects are already in memory.
There's a type of software that's very good at this, and it's called a
"relational database". For Django to do what you want, it would need
to implement at least a significant subset of the features of a
relational database, and so instead of querying against an optimized
RDBMS you'd be querying against Django. I don't see how that's going
to be a major performance boost, considering Django would have to do
it in interpreted Python while your actual RDBMS can do it in
hand-tuned C.
Because it's convenient to do it. This is the whole point: Django, not a
user, should "already know" that category and just reuse it. And a user
should just write readable code.
> What you're asking for is to have Django stop, every single time you
> do something that might run a DB query, and
>
> 1. Figure out which objects will be returned by the query, and
> 2. Whether the objects are already in memory.
I believe we can start by solving just the most common case: a local
per-request cache of objects that are accessed through foreign keys.
> There's a type of software that's very good at this, and it's called a
> "relational database".
This is a bad analogy. THis was you could also disprove that we need any
caches at all :-).
Relational databases implement a mathematical model of information
*retrieval*. They usually don't care that a query returns mostly similar
data and expect a user to figure it out by data semantics.
> For Django to do what you want, it would need
> to implement at least a significant subset of the features of a
> relational database, and so instead of querying against an optimized
> RDBMS you'd be querying against Django.
Please don't overcomplicate. A single dict like
Model._instance_cache[id] that is cleared on the end of a request should
be enough.
> I don't see how that's going
> to be a major performance boost
When we were working on performance in our deployed service most of the
initial gain was achieved by throwing select_related all over the place
because things that Ned mentioned were *everywhere*. But select_related
has it's own problems: it eats memory and time creating loads of
identical instances (we all remember that creating an instance is not
exactly free because of signals).
Erm. No, now you're asking for Django to be psychic and guess what
you're going to do before you try to do it. This is also not feasible;
at some point a developer has to stop and pay some attention to what
he/she is writing, rather than assume that libraries/frameworks will
magically optimize everything.
> I believe we can start by solving just the most common case: a local
> per-request cache of objects that are accessed through foreign keys.
That's a bit more reasonable. Still not a 1.0 blocker, but more reasonable.
> This is a bad analogy. THis was you could also disprove that we need any
> caches at all :-).
No, it's not a bad analogy; what David's asking for is to have Django
magically work out what objects his query would return, without
executing the query against the DB. That requires something
functionally equivalent to a relational data store and query engine.
> Please don't overcomplicate. A single dict like
> Model._instance_cache[id] that is cleared on the end of a request should
> be enough.
Note I was replying to David's request, not yours. The
overcomplication comes precisely from his request to have Django work
out what the query will do and avoid hitting the DB; "magically figure
out that you've already queried for the objects I'm querying for now"
requires an ability to, well, figure out what the query would do and
what it would return.
Ah... I admit I didn't read the original post completely. I was merely
echoing what I understood from Ned's letter. Sorry if I messed things up!
It's a horrible analogy, and this is speaking as the author of the
patch and personally discussing instance caching w/ david himself,
many, many times (9+ months of running w/it in addition).
Frankly, either folks are grossly misunderstanding the purpose of
this functionality (let alone the ORM's implementation), or there is a
fair bit of fud floating about here. Going to hope it's the former,
thus laying out a lot of info here- may be elementary, but kindly
reread it since folks seem to be missing a lot here.
If you read the patch, or dig through the ORM implementation you'll
notice the unique key (good ole mr pk) that django itself so heavily
relies upon for fun things like updating and deletion, and
foreignkeys. Meaning that there is *no* query analysis, recreation of
the RDBMS's data store, nothing of the sort, to identify when a row
request results in a record already in memory. You just pull the pk
from the args/kwargs, and you check the weakref cache for a
preexisting instance in memory that has that unique id.
Period. No duplication of RDBMS functionality, no query analysis. If
you read the patch (specifically __call__ modification to the the
ModelBase class) this is dead obvious. Implying otherwise is just
plain wrong.
Technically also, if you're going to argue about "duplicating the
RDBMS into the ORM", you might want to take a hard look at django's
ORM now- specifically why it's triggering it's own deletes across
tables for foreignkeys instead of using a cascade deletion- iow, that
claimed issue already is there (and if you're aware of the misc DB
implementations limitations, there is a damn good reason to have that
in the ORM instead of limiting support to just the DBs that implement
it).
> > Please don't overcomplicate. A single dict like
> > Model._instance_cache[id] that is cleared on the end of a request should
> > be enough.
>
> Note I was replying to David's request, not yours. The
> overcomplication comes precisely from his request to have Django work
> out what the query will do and avoid hitting the DB; "magically figure
> out that you've already queried for the objects I'm querying for now"
> requires an ability to, well, figure out what the query would do and
> what it would return.
Specifically in the context of foreignkey accessing, what django
mainstream currently does is thus:
1) grab the foreign key id.
2) make a db request for a record from the ref'd table for that id.
3) instantiate an instance of that model w/ the data returned.
What *just* instance caching (ie, __new__ machinery, no other
modification) does, is thus:
1) grab the foreign key id.
2) make a db request for a record from the ref'd table for that id.
3) look for a preexisting instance in memory that has that unique id
already, return a ref to that, else instantiate a new instance
returning that.
What a sane implementation of that *should* do if they actually care
about performance:
1) grab the foreign key id
2) look for a preexisting instance in memory for that id. Short
circuit and return that instance if it exists.
3) no instance? Query the db for the required row data, instantiate,
return said instance.
Voila, that's how you wind up with select_related and foreignkey
accessing having a specific check to bypass hitting the db. At a
peak of 6 req/s, that load difference isn't going to be a
deal breaker- loading up 53k objects (my "spanish inquisition" usage
scenario) and accessing a ref on each, or easily clearing 300 req/s,
it becomes *very* noticable and a significant flaw, enough so that
w/out it frankly it makes django something of a rather crappy choice
(yes you can bypass the orm, but you'll hit a point where you ask
yourself "why again am I using django if I'm bypassing most of it?").
Point being, third option is the sane one here.
The commentary on performance/mem usage is one thing; what is also
being (imo, daftly) ignored in this discussion is the fact that with
this functionality in place it's a helluva lot harder to get data
staleness issues- simple example is a properly modularized codebase,
you have a codepath that reaches down and does some invalidation
checks, realizes it needs to update an instance. Does the update, new
data *must* be used for rendering the page, else you'll get weird
results.
What happens if the render code higher up already holds that data- do
you restructure all of your code to try and force the pull after the
*potential* update, and hope that no one blows their foot off via not
following that vague constraint?
Or do you treat unique DB rows as unique instances in memory, singular
(like the underlying db itself), thus avoiding data staleness issues?
Kind of a given which I go for- any api design that makes it easy to
blow your own foot off is significantly flawed from where I'm sitting.
The massive scaling benefits are a rather attractive side affect
in addition.
~brian
I'm simply responding to what's been brought up here; there seemed to be:
First, a desire to have Django magically detect inefficient code
(e.g., hitting a relationship descriptor when you already had, and
knew you had, the object it would return) and prevent the
inefficiency.
Second, something much more than a simple object cache; David's
request seemed as if it was asking for more than just, say, an
in-memory mapping of PKs to objects, or more aggressive caching of
relationships, but more an attempt to anticipate that an arbitrary
query would return objects with certain IDs and avoid hitting the
database to retrieve them.
For example, select_related() keeps coming up, and it appeared that
what was desired was some way to have select_related() realize that
somewhere down the chain of relationships in the eventual query would
be some results already in memory, and to magically rewrite the
eventual query or even optimize it away entirely to avoid ever
querying multiple times for the same object. This would indeed require
Django to implement something with the same functionality as a query
engine.
The example of deleting related objects is a dangerous red herring,
since that feature is in Django precisely because some supported
databases don't enforce foreign key constraints and so will silently
corrupt your data (by leaving dangling foreign keys that don't point
anywhere) if Django doesn't work out the necessary chain of objects to
delete. This is as simple as traversing known foreign keys, and is
nowhere near so complex as what's described in the preceding
paragraph.
And again, while I agree that some form of more aggressive mapping of
results in memory would be a boost for some situations, I still am not
convinced that it's a 1.0 blocker; as Jacob has pointed out, if it's
added, it can and should be implemented with no backwards
incompatibilities, which means it could be added without breaking an
API-stability guarantee and hence could go in post-1.0.
Everybody's said their piece. There's really no new information and lots
of unnecessary finger-pointing in the recent posts. The beauty of email
is that the earlier bits are right there in the archives if we need to
reference it again, so more repetitions aren't going to add anything.
We already want to add a feature like this. It's nice to have and just
not our highest priority nice-to-have at the moment. There's a big
difference between "not now, later" and "not now, never".
Thanks,
Malcolm
--
The cost of feathers has risen; even down is up!
http://www.pointy-stick.com/blog/
There is no such desire; the desire is to avoid going to the db when
we already have the data *in process*, and to avoid having duplicate
copies of the same data in memory w/out reason.
> Second, something much more than a simple object cache; David's
> request seemed as if it was asking for more than just, say, an
> in-memory mapping of PKs to objects, or more aggressive caching of
> relationships, but more an attempt to anticipate that an arbitrary
> query would return objects with certain IDs and avoid hitting the
> database to retrieve them.
http://groups.google.com/group/django-developers/browse_thread/thread/818386d38ad3f9e6
David's verbatim request was for #17 to be included, just that,
nothing more, nothing less.
> For example, select_related() keeps coming up, and it appeared that
> what was desired was some way to have select_related() realize that
> somewhere down the chain of relationships in the eventual query would
> be some results already in memory, and to magically rewrite the
> eventual query or even optimize it away entirely to avoid ever
> querying multiple times for the same object. This would indeed require
> Django to implement something with the same functionality as a query
> engine.
Me thinks you're reading into it way more then what has been
requested; again, see the thread's origins. There is no 'magic'
involved in the raw request in addition, so please don't dismiss it
via such terms.
> The example of deleting related objects is a dangerous red herring,
> since that feature is in Django precisely because some supported
> databases don't enforce foreign key constraints and so will silently
> corrupt your data (by leaving dangling foreign keys that don't point
> anywhere) if Django doesn't work out the necessary chain of objects to
> delete.
It's not a red herring; the point is that the ORM by nature has to
mirror chunks of the RDBMS itself, whether for compat or for avoiding
impedence. This is a natural thing, and attacking #17 via "it's
duplicating RDBMS functionality" applies a double standard.
> This is as simple as traversing known foreign keys, and is
> nowhere near so complex as what's described in the preceding
> paragraph.
Please look at the code in question. It's actually a helluva lot less
lines, and way more readable (even after accounting for the "the code
is dense" complaint). Seriously, take a look at delete and
delete_object, those are *not* fun functions to screw with and are
quite a bit more complex compaired to the ~30 lines that compose core
instance caching support.
If there is a specific complexity complaint, please level it- I'm
more then interested in hearing actual *specifics*. At the very least
that would be confirmation people are actually looking at the patch ;)
> And again, while I agree that some form of more aggressive mapping of
> results in memory would be a boost for some situations, I still am not
> convinced that it's a 1.0 blocker; as Jacob has pointed out, if it's
> added, it can and should be implemented with no backwards
> incompatibilities, which means it could be added without breaking an
> API-stability guarantee and hence could go in post-1.0.
As per my previous email, ignore the performance gains. Y'all keep
missing the fundamental shift this induces- via making a model
instance unique in memory, that means that code that trashes that
instance is screwing it up for *all* ref holders. Current mainline,
if I have a codepath that makes a query then dirties up the resultant
instance, no problem- doesn't affect other holders of that data since
it's a seperate ref.
With #17, it's the same ref, meaning code of that sort breaks w/ the
patch in use. That's *precisely* why it should be pre-1.0, while it's
not an API change it's a change to the rules of using said API
correctly.
Regarding Malcolm's request of "why don't we just let this sit for a
while", there's a reason #17 has multiple patches correcting for
bitrot- I posted the first public patch of this ~10 months ago. The
functionality is effectively complete, and has had heavy production
usage for better part of a year. Hell, the patch was marked ready
for check in just shy of 7 months ago.
What remains? Frankly, what is there to discuss? Either it's
desired, or it's not- if it is, and it's complete [1], schlop the
sucker in or nail down exactly what is required to get it to that
stage. Especially considering the 1.0 comments from above...
I'm not trying to piss y'all off here, but it's pretty discouraging to
watch completed functionality bitrot w/out a given reason- delaying
this further w/out at least some form of decision is exactly that
scenario. Yes people are busy, but that cuts both ways- maintainer
and external contributor.
~brian
[1] The only remaining thing I'm aware of that would be needed pre
commit is to shift the instance cache to per thread storage. This
is a 5 minute mod, not much point in doing said mod if the ticket is
going to remain in limbo.
James Bennett wrote:
> For example, select_related() keeps coming up, and it appeared that
> what was desired was some way to have select_related() realize that
> somewhere down the chain of relationships in the eventual query would
> be some results already in memory, and to magically rewrite the
> eventual query
No, it's not what I'm (and, apparently, Brian) talking about. SQL that
select_related() executes won't change at all. But model instantiation
will: it will reuse already constructed instance.
A single .get() of a parent record (without select_related()) will also
use already constructed instance when available and also won't hit DB as
a bonus.
This is it.
I understand this. I thought I'd pointed it out once already, but
again for sake of sanity...
The list of people I am disagreeing with in this thread:
* David
The list of people I am not disagreeing with in this thread:
* Everyone except David
Can we move past that?