Nay. :-)
Or, at least, not until you demonstrate how the same syntax is
interpreted in Postgres, SQLite, Oracle, any other SQL database, and
potentially on MongoDB, Cassandra, BigTable, or other noSQL datastore.
This was the biggest problem when I closed #11003 originally; it
remains the biggest problem. I'm not in favor of adding an ORM
extension that is only useful for MySQL's query optimizer. That
defeats the purpose of having a cross-database ORM syntax. If you are
at the point of optimizing your system and you discover that you have
a dataset and query that actually require database-specific fine
tuning, I don't see it as an unreasonable requirement that you crank
out some database-specific raw SQL.
Django's ORM isn't intended to be perfect for all situations. It's
intended to make the 80% case trivial so you can bootstrap a site
without having to worry about the little details. Once you've hit that
point and you're tuning performance, you're going to need to pay close
attention to the little details like exactly how your site uses SQL.
Attempting to find a way to express every possible SQL statement,
including backend-specific optimizations, into an ORM syntax is folly.
After all, we already have a syntax that is able to express every
possible SQL statement... it's called SQL :-)
That said -- I'm not fundamentally opposed to the idea of introducing
hinting tools to the ORM, as long as:
* they are demonstrably rich enough to get the job done,
* have an interpretation that is compatible with the cross-platform,
reusable-app message that underpins Django's design philosophy.
To that end, the 'with_hint()' proposal appeals to me a lot more than
the previous suggestion of Yet Another Extra Argument. However, the
patch as provided:
1) Doesn't contain tests or docs
2) Will break queries if the user includes the with_hint() clause on
any database other than MySQL
3) Even then, only supports the USE INDEX keyword, not FORCE INDEX or
IGNORE INDEX
4) Doesn't suggest how hinting strategies would be implemented on a
database other than MySQL
If you can come up with answers to these points, I might get
interested. 1 and 2 are fairly trivial; I can think of some obvious
answers for 3, but 4 is the big problem, and will require some serious
research and consideration.
I would also add that if this is a query extension that you really do
want, it's something that should be possible to add by writing a
custom database backend that extends Django's MySQL backend and
provides a custom Query base class -- and if this can't be done
relatively easily, then a refactoring of the sql.Query or SQLCompiler
interface to allow easier query rewriting by a backend *is* something
I could get behind very easily.
Yours,
Russ Magee %-)
A very big aye from here. For a very high traffic project backed by 20+ DB servers, not being able to hint inline was the #1 reason we had to write queries by hand for the public-facing portions, and it was a noticeable percentage of our total public queries (easily >10%).
George
> I'd be happy to use raw(); but then you lose len(), slicing,
> pagination, filter chaining, etc. My problem came about because admin
> change lists were unusably slow on large tables. With_hints allowed a
> simple monkey patch to admin, dropping 2s per page down to ~0.01s.
So the underlying problem is: how to rewrite a query (whether generated
by my code, someone else's code, or a mixture) so that it performs
better on my database. There will always be third party code which
generates queries that are really bad for performance, depending on your
project or your database. This also applies to tickets like
http://code.djangoproject.com/ticket/11604
This makes me think that what we need is a mechanism to do an equivalent
to your monkey patching that is easier and more general. Some kind of
"replace the auto-generated SQL with this SQL" method could be really
useful.
Could this be solved by writing your own cursor wrapper which checks the
SQL against a list and rewrites as necessary?
(You could set it up quickly like this:
http://chris-lamb.co.uk/2010/06/01/appending-request-url-sql-statements-django/
That still involves monkey patching, but it is possible to do the same
thing by writing a database backend).
The major drawback is fragility - doing a replacement at this level
could easily break, for example if you added a field to a model.
Adequate testing could catch this if your cursor wrapper had a debug
mode that could report what was replaced.
If we tried to do the replacement at a higher level, then you run into
the problem of how to recognise a query and replace it, which could be a
much more expensive operation if we are trying to compare objects of
type django.db.models.sql.query.Query, for example, and you also might
have the problem of not having full access to SQL statements.
However, if this method does work, if would be good to document how to
do it, and provide some better builtin hooks if necessary.
Luke
--
"We may not return the affection of those who like us, but we
always respect their good judgement." -- Libbie Fudim
Luke Plant || http://lukeplant.me.uk/
Yes; As implemented, __search is MySQL specific, and the API is
covered by our backwards compatibility policy, so we will continue to
support it. However, that doesn't mean it should be used as an example
or excuse for other database-specific features in the ORM.
The only explanation I can give for this particular feature is age.
The __search keyword was added over 4 years ago (Revision r3073,
ticket #573). Django was still in a period of rapid growth at that
time. Some of the features that were added at that time might not
survive if they were proposed today (either because of hindsight, or
because of changes in the perceived role of the ORM over time).
If a fulltext search operator were proposed today, it might still be
accepted -- after all, a query of text data using a rich full-text
search syntax is something that transcends SQL and could be considered
a primitive operation on object data. However, before I would accept
such a proposal, I would need to be convinced that the proposed syntax
would be supported on other backends. Postgres 8.3, for instance, has
fulltext search indexing capabilities. I would also need to be
convinced that the proposed syntax was rich enough to potentially
cover the range of features provided by MySQL. As it stands, we only
support queries "IN BOOLEAN MODE", not "IN NATURAL LANGUAGE MODE" or
"WITH QUERY EXPANSION".
Yours,
Russ Magee %-)
I would like to know how you're validating your assertion that MySQL
is the most used backend. It doesn't match my experience or
observation.
The fact that this is a MySQL-specific issue is perhaps the biggest
impediment to my enthusiasm about this specific feature (as proposed).
I've spent enough time in the past covering for the inadequacies of
MySQL. I don't particularly relish the thought of adding more time to
that particular ledger. :-)
> Adding with_hints()
> will only serve to encourage support from other backends too, where
> possible.
>
> Maybe we can make the with_hints() syntax more generic with both
> common and backend-specific kwargs --
>
> .with_hints(index='my_index') (string implies index on queryset base
> model)
> .with_hints(index={Model:'my_index', RelatedModel:'index2'})
> (dictionary allows defining index on a per-model basis)
>
> So the Oracle backend, for example, could implement --
>
> .with_hints(hash=True, index={Model:'my_index',
> RelatedModel:'index2'}) (
>
> On Oracle there are dozens of possible hints, so I'd say unsupported
> kwargs could simply be ignored. This would ensure seamless database
> portability.
Having an implementation ignore kwargs is the obvious approach to
providing this feature cross platform. The issue for me is what
exactly the kwargs should be, and how they would be interpreted. It
isn't clear to be how 'index="my_index"' maps to the application of
USE INDEX, IGNORE INDEX or FORCE INDEX in a query, or how the
dictionary syntax would map to the situation where you have two
appearances of a given table in a query.
Yours,
Russ Magee %-)
If you are write heavy, there's another issue that I imagine would
bring significant gains on a MyISAM setup of that proportion... and
possibly other databases to a lesser degree (the locking mechanisms
with MyISAM are different).
Basically, right now the ORM will UPDATE or INSERT based off of a
SELECT query it does when you hit save(). So even if the ORM should
already know it's an UPDATE it will run this logic and these queries.
There's "force_update=True" which I imagine would skip the SELECT, but
I was looking into it because the logic here has some other
side-effects as well.
One of the side effects is that if you change a PK, and try to save it
you end up with two rows. I guess the PK probably shouldn't be
mutable, or we can fix that logic and possible see some gains in
speed.
I have some tests somewhere to try and illustrate how MyISAM locks,
and there might be a couple open tickets relating to side effects...
if there's any interest... kind of waiting until things settle with
the refactor, as I have some ideas I want to try out.
-k
Yes - this is a well known limitation of MyISAM tables. This is a
situation where you need to pick the right tool for the job; if you're
on a write heavy site, MyISAM probably isn't going to be a good
choice.
> Basically, right now the ORM will UPDATE or INSERT based off of a
> SELECT query it does when you hit save(). So even if the ORM should
> already know it's an UPDATE it will run this logic and these queries.
>
> There's "force_update=True" which I imagine would skip the SELECT, but
> I was looking into it because the logic here has some other
> side-effects as well.
>
> One of the side effects is that if you change a PK, and try to save it
> you end up with two rows. I guess the PK probably shouldn't be
> mutable, or we can fix that logic and possible see some gains in
> speed.
If you've got any ideas on how to optimize the object saving process,
we're happy to entertain them -- but keep in mind Django's backwards
compatibility policy. Creating new objects by manually setting the
primary key is a well established idiom. We can't eliminate this
behavior without breaking backwards compatibility.
Yours,
Russ Magee %-)
The problem with sharing a common keyword is that an index hint
specified for Oracle won't be useful (or even legal, necessarily) for
MySQL. If I'm going to ship my reusable app, and I know that one
particular query will perform badly under MySQL, it would be good to
be able to ship with whatever hinting strategy will help. However,
that same hint may not be appropriate for other backends, even if
those backends do support hinting.
Effectively, there's a namespace issue here; we need to be able to
specify "MySQL" hints independently of "Oracle" hints. As a further
complication -- as of Django 1.2, database backends are a fully
qualified path, so "mysql" isn't necessarily sufficient to identify a
backend (although I'm slightly more comfortable putting this sort of
collision in the category of "acceptable risk" if there's no obvious
way to avoid it).
As for the "+/-" syntax; I'm not wild about it. I'm not an Oracle
expert, but from a quick inspection it appears that there are dozens
of optimizations that could potentially be applied, and none of them
fall into the "USE/FORCE/IGNORE" triptych (many of them don't even
apply to indexes directly). Even on MySQL -- optimizations like "USE
INDEX x FOR ORDER BY" won't be covered by a simple syntax.
I'm not expecting that you will provide a patch that will implement
all possible hints, but I don't want to adopt a syntax that will cut
off future options. I want to have some confidence that it would be
*possible* to implement Oracle hinting or MySQL FOR ORDER BY hinting
if they were so inclined.
> 2) The multiple table issue is edge case but worth supporting if it
> doesn't complicate the syntax too much. Rather than use models as keys
> we could use the filter/exclude name; eg {'relatedmodel__field':
> 'index'} would apply 'index' specifically to the 'relatedmodel__field'
> join.
Using the filter/exclude name sounds interesting, but skips the first
step of putting an index hint on the origin table. For example,
Book.objects.filter(author__name='foo') -- now put an index hint on
Book; now put one on Author. If Author and Book are an a m2m relation
-- put a hint on the m2m table.
I'd need to see a more complete set of worked examples to see how this
would pan out.
Yours,
Russ Magee %-)
The big problem I see is that hints are simply the wrong approach to
handling this issue, which I do see as an important one.
The SQL optimizer can't work out how you're going to handle the queryset
if all you mention is the filter(). SQL being a set-based language the
optimizer won't know whether you wish to retrieve 0, 1 or 1 million
rows. In many cases it actively avoids using the index for what it
thinks will be larger retrievals.
The number of rows required from the queryset is an important part of
defining the access path. Hints are only required because we didn't
specify the queryset in sufficient detail for the optimizer to
understand what we wanted.
I would say the correct approach here is to use slicing, since this will
add a LIMIT clause and provide full usage information to the SQL
optimizer about the queryset.
Can I suggest an auto-slice parameter so that we can add LIMIT N onto
the bottom of every query by default, if a slice is not already defined.
That would work for all DBMS.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services
That's categorically untrue. One of the major functions of an
optimizer is too try to figure out the approximate result size so it
can better establish index vs. data cost.
> The number of rows required from the queryset is an important part of
> defining the access path. Hints are only required because we didn't
> specify the queryset in sufficient detail for the optimizer to
> understand what we wanted.
>
Uhh, querysets don't really generate SQL that's in any way different
from what a normal person would write.
> I would say the correct approach here is to use slicing, since this will
> add a LIMIT clause and provide full usage information to the SQL
> optimizer about the queryset.
>
> Can I suggest an auto-slice parameter so that we can add LIMIT N onto
> the bottom of every query by default, if a slice is not already defined.
> That would work for all DBMS.
>
And auto_slice parameter to what exactly?
> --
> Simon Riggs www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Training and Services
>
> --
> You received this message because you are subscribed to the Google Groups "Django developers" group.
> To post to this group, send email to django-d...@googlegroups.com.
> To unsubscribe from this group, send email to django-develop...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
>
>
Alex
--
"I disapprove of what you say, but I will defend to the death your
right to say it." -- Voltaire
"The people's good is the highest law." -- Cicero
"Code can always be simpler than you think, but never as simple as you
want" -- Me
Perhaps I need to explain some more, since what I've said is correct.
The optimizer does work out the number of rows it thinks it will access;
whether you retrieve all of those rows is a different and important
issue.
For example, if we have a 1 million row table called X with a column
called TwoValues. In TwoValues there are 2 values, value=1 and value=2.
There are 999,999 rows with value=1. An index is built on TwoValues.
If I then issue this query:
SELECT * FROM X WHERE TwoValues = 1;
then the optimizer will deduce that I will access 999,999 rows out of a
million and its best strategy is to avoid using an index. If I issue the
same query for value 2 then it will retrieve 1 row and hence use the
index.
In most cases the application won't want to bring back all 999,999 rows,
though the optimizer doesn't know that unless we tell it. If we assume
that we actually only want 10 rows then the situation is open for change
to this form of SQL
SELECT * FROM X WHERE TwoValues = 1 LIMIT 10;
which will use the index, so demonstrating that the optimizer is
designed to offer an appropriate plan when presented with the full info.
Slicing provides the full information for the use case and a hint should
not be required to allow index use.
(The situation is more complex in the case of parameterised prepared
statements and in the case of stale statistics, though neither case is
important here).
True, I don't think it's a really nice API. Why is "index" a command
for Oracle? What happens if some other backend accepts something with
a "index"? Personally, I'd rather see this be an explicit namespace:
with_hints(mysql={...}, oracle={...})
That guarantees that there isn't any cross-database clashes with hint
names, only with database backend names. A backend can choose look at
the full list of hints and determine which hint set it should use;
potentially, it could even look at another backend's hints and act on
them (although I don't know why this would be helpful, except perhaps
in the case of a custom MySQL backend that extends the base
capabilities).
> Next --
>
>> Using the filter/exclude name sounds interesting, but skips the first
>> step of putting an index hint on the origin table. For example,
>> Book.objects.filter(author__name='foo') -- now put an index hint on
>> Book; now put one on Author. If Author and Book are an a m2m relation
>> -- put a hint on the m2m table.
>
> How about --
>
> {'book': 'book_idx', 'author': 'author_idx', 'author__name':
> 'm2m_idx'}
>
> Or, if you want to cover absolutely every possibility --
>
> {'book': 'book_idx', 'author__name__author': 'author_idx',
> 'author__name': 'm2m_idx'}
>
> That is, require indexes on the right side of joins to be named
> explicitly, just in case Author is used in another join.
>
> Sounds complicated, but obviously edge case. Common usage will be
> simple, eg {'book':'book_idx'}.
>
> Let me know what you think.
What about when the query constructs multiple joins on a related
table? How does the hint get named and applied then? Is it a
consistent "a join between A and B will always get hint X"? Or do
different joins need different hints?
Yours,
Russ Magee %-)
How about a raw_hints() or something that expects you to insert
additional SQL code that won't affect the result so it can still
return a QuerySet that supports len(), slicing, etc...?
You mean extra() ;)
I don't follow your logic here.
If I'm writing a reusable app, I define some models, and I define the
views that use those models. Those views issue queries. Therefore, I
know both the tables that are involved, and the queries that will be
issued on those tables. If I'm really tuning things, I might even be
using custom SQL, shipped with my project, to create custom indices
for those tables.
Since I'm in control of the table and how the table is used, I should
also be in a position to provide any query plan hints that will assist
the known usage patterns.
If I'm distributing my application to many people, different people in
my audience will be using different databases. That puts me in a
position of (potentially) having to provide hints for multiple
databases as part of my shipped application.
Therefore, there is a namespace problem. I need MySQL hints to be
applied for users that have MySQL. I need Oracle hints to be applied
for users that have Oracle, ignoring the MySQL hints. I need Postgres
to ignore any hint that is applied.
On top of that, there's a possibility that a hint name (and value
format) for one database might clash with another. I don't know of any
specific examples of overlap, but it isn't hard to manufacture one
using generic hinting keywords like "index = foo".
> Any core implementation for supporting this
> should be minimalist and only provide a mechanism for passing some
> kind of data from the QuerySet to the SQLCompiler that each backend
> can optionally choose to do something with it.
>
> I would be thrilled to be able to do use
>
> with_hints('HASH GROUP', 'FAST 10')
>
> and have it set
>
> Query.query_hints = ['HASH GROUP', 'FAST 10']
>
> I'm currently patching in support so I can do something like this for
> django-mssql.
And that's the proposal on the table, except that you would need to
specify that this is a MSSQL hint, since the hint you're making isn't
of any use to a user of any database other than MSSQL.
Yours,
Russ Magee %-)
A reasonable objection. The problem is that a query is contained in
code; if I (as an end-user) discover that a particular query needs to
be hinted, I don't have an entry point.
Putting a with_hint() keyword into the query provides that entry
point, but then means that we need to be able to account for the case
where the same app is deployed under MySQL and Postgres. A namespace
is the best way I can think to acheive this.
Of course, I'm open to any other suggestion for how we can separate
the location where the hint is required from the actual hint
definition.
>> And that's the proposal on the table, except that you would need to
>> specify that this is a MSSQL hint, since the hint you're making isn't
>> of any use to a user of any database other than MSSQL.
>
> I'm assuming the current proposal will allow for arbitrary namespaces
> that are defined simply by a backend willing to look for the specific
> name. If not, there would need to be discussion about which namespaces
> are defined by the core and how unsupported backends would register
> new namespaces.
Correct. The current proposal makes the namespace the last part of the
backend's import path (e.g., mysql for the django.db.backends.mysql
backend). This does have the potential for collisions if you write a
custom mysql backend in a different location. However, there isn't a
huge population of backends, so the practical likelihood of collision
is relatively small.
Yours,
Russ Magee %-)