Django, initial data and custom SQL

600 views
Skip to first unread message

Johan Bergström

unread,
Feb 10, 2009, 7:13:12 AM2/10/09
to Django developers
Hello,

I would like to suggest (patch will follow if someone concurs) that
custom sql is executed after Django has created indexes.

Django is (in my opinion) a bit optimistic regarding index creation,
and by looking at pg_stat_* output I see that at least a couple of
indexes on busy tables hasn't been used - ever. The obvious thing to
do is to drop these. Next step would be to sync this with my test and
staging environment so I can run tests against my data/application and
verify that something hasn't broken - and here's basically where it
gets interesting.

Since Django executes my custom SQL before creating indexes, it's
impossible to achieve something that hooks into initdb/syncdb. I know
that it is "good custom" to create indexes after inserting data – but
fixtures in Django is already executed after creating indexes, so that
can't be the reason.. So, without further ado, what I would like to
know is if there's a reason to why custom SQL is executed before index
creation.

Kind regards,
Johan bergström

ludvig.ericson

unread,
Feb 10, 2009, 10:51:38 AM2/10/09
to Django developers
On Feb 10, 1:13 pm, Johan Bergström <b...@bergstroem.nu> wrote:
> Since Django executes my custom SQL before creating indexes, it's
> impossible to achieve something that hooks into initdb/syncdb. I know
> that it is "good custom" to create indexes after inserting data – but
> fixtures in Django is already executed after creating indexes, so that
> can't be the reason.. So, without further ado, what I would like to
> know is if there's a reason to why custom SQL is executed before index
> creation.

Isn't this doable with initial SQL?
http://docs.djangoproject.com/en/dev/howto/initial-data/#initial-sql

Testing here with SQLite, it'd seem it runs the custom SQL at the very
last point, so you could actually add some ALTER TABLE statements, I
guess. Again, this is testing with SQLite, and SQLite doesn't do
indexing.

Maybe I misunderstood?

- Ludvig

Johan Bergström

unread,
Feb 10, 2009, 11:07:06 AM2/10/09
to Django developers
Hey,

On Feb 10, 4:51 pm, "ludvig.ericson" <ludvig.eric...@gmail.com> wrote:
> On Feb 10, 1:13 pm, Johan Bergström <b...@bergstroem.nu> wrote:
>
> > Since Django executes my custom SQL before creating indexes, it's
> > impossible to achieve something that hooks into initdb/syncdb. I know
> > that it is "good custom" to create indexes after inserting data – but
> > fixtures in Django is already executed after creating indexes, so that
> > can't be the reason.. So, without further ado, what I would like to
> > know is if there's a reason to why custom SQL is executed before index
> > creation.
>
> Isn't this doable with initial SQL?http://docs.djangoproject.com/en/dev/howto/initial-data/#initial-sql
>
> Testing here with SQLite, it'd seem it runs the custom SQL at the very
> last point, so you could actually add some ALTER TABLE statements, I
> guess. Again, this is testing with SQLite, and SQLite doesn't do
> indexing.

Actually it doesn't. I think you just did a reset/sqlall instead of
sync/initdb:

# cat settings.py | grep DATABASE_E
DATABASE_ENGINE = "sqlite3"

# python manage.py syncdb
<snip>
Creating table testapp_message
Creating table testapp_avatar
Installing custom SQL for testapp.Message model
Failed to install custom SQL for testapp.Message model: no such index:
testapp_message_avatar_id
Installing index testapp.Message model
<snip>
Installing json fixture 'initial_data' from '<snip>/fixtures'.

As you most likely can tell from above, sql/message.sql contains a
"drop index ..." operation.

(nitpick: SQlite has indexes - you could of course argue their
effectiveness :-)


>
> Maybe I misunderstood?
>

Perhaps I should've been more verbose :-) Thanks for your input
though.

> - Ludvig

Regards,
Johan Bergström

Johan Bergström

unread,
Feb 11, 2009, 9:28:37 AM2/11/09
to Django developers
I took the liberty of creating a ticket with attached patch at:
http://code.djangoproject.com/ticket/10236

Thanks,
Johan Bergström

Ludvig Ericson

unread,
Feb 12, 2009, 12:54:17 AM2/12/09
to django-d...@googlegroups.com
Feb 11, Johan Bergström:

> I took the liberty of creating a ticket with attached patch at:
> http://code.djangoproject.com/ticket/10236


I fail to see how "it has consequences for existing code", as Russell
put it.

I did discuss this with Bergström, and we came to the conclusion that
it won't actually break much code, if any.

The only case which it could break, AFAICT, is if custom SQL manages
to depend on the absence of indexes. I guess that could break code
that violates indexing constraints, which are applied later, maybe? I
don't know.

- Ludvig

Russell Keith-Magee

unread,
Feb 12, 2009, 1:48:09 AM2/12/09
to django-d...@googlegroups.com
On Thu, Feb 12, 2009 at 2:54 PM, Ludvig Ericson
<ludvig....@gmail.com> wrote:
>
> Feb 11, Johan Bergström:
>> I took the liberty of creating a ticket with attached patch at:
>> http://code.djangoproject.com/ticket/10236
>
> I fail to see how "it has consequences for existing code", as Russell
> put it.

It has consequences because you are proposing to change the order in
which indexes and custom SQL are applied. Any code that depends on the
existing order will be affected.

> I did discuss this with Bergström, and we came to the conclusion that
> it won't actually break much code, if any.

If it has the potential to break _any_ code, it is an unacceptable
change. The Django core developers have stated that we will maintain
backwards compatibility of the v1.0 interface, so any change with even
the _potential_ to affect backwards compatibility will need to be
checked very carefully.

However, the larger point is that you don't get to make that decision
about whether a change is acceptable or not. You can discuss a change
and make a recommendation, but ultimately the decision needs to be
made by a core developer. The Design Decision Required triage state
exists for exactly this reason.

> The only case which it could break, AFAICT, is if custom SQL manages
> to depend on the absence of indexes. I guess that could break code
> that violates indexing constraints, which are applied later, maybe? I
> don't know.

This is what needs to be confirmed, and given that it is a non-trivial
change, your decision needs to be confirmed by a core developer. I am
willing to be convinced, but you will need to prove to me that there
is no backwards compatibility problem here. The discussion in this
thread so far hasn't done that.

To answer the original question (why is it done in this order) -
mostly historical reasons. Originally, Django didn't have fixtures, so
initial data were all loaded from initial SQL. It's generally faster
to insert data and then add indicies, hence the order. When Django
added fixtures, the order wasn't changed.

Yours,
Russ Magee %-)

Ludvig Ericson

unread,
Feb 12, 2009, 2:49:47 AM2/12/09
to django-d...@googlegroups.com
On Feb 12, 2009, at 07:48, Russell Keith-Magee wrote:
> On Thu, Feb 12, 2009 at 2:54 PM, Ludvig Ericson
> <ludvig....@gmail.com> wrote:
>>
>> I fail to see how "it has consequences for existing code", as Russell
>> put it.
>
> It has consequences because you are proposing to change the order in
> which indexes and custom SQL are applied. Any code that depends on the
> existing order will be affected.

Just want to note that *I'm* not proposing any changes.

>> I did discuss this with Bergström, and we came to the conclusion that
>> it won't actually break much code, if any.
>
> If it has the potential to break _any_ code, it is an unacceptable
> change. The Django core developers have stated that we will maintain
> backwards compatibility of the v1.0 interface, so any change with even
> the _potential_ to affect backwards compatibility will need to be
> checked very carefully.
>
> However, the larger point is that you don't get to make that decision
> about whether a change is acceptable or not. You can discuss a change
> and make a recommendation, but ultimately the decision needs to be
> made by a core developer. The Design Decision Required triage state
> exists for exactly this reason.

I don't think it's "the larger point." I misunderstood the triaging
system, and incorrectly changed it. Bennett reverted. I apologized.
End of that.

What I meant by "ready for checking" was that the patch applies
cleanly, runs, and does what it is intended to do. And obviously that
was mistake, and again: I realize this now.

>> The only case which it could break, AFAICT, is if custom SQL manages
>> to depend on the absence of indexes. I guess that could break code
>> that violates indexing constraints, which are applied later, maybe? I
>> don't know.
>
> This is what needs to be confirmed, and given that it is a non-trivial
> change, your decision needs to be confirmed by a core developer. I am
> willing to be convinced, but you will need to prove to me that there
> is no backwards compatibility problem here. The discussion in this
> thread so far hasn't done that.

I realize that my opinions and decisions play a very small role in the
development of Django, but I try to express them so that core
developers may read them, and either think "what a retard", or "fair
point."

I guess the question really is: is this custom SQL execution order
change really the right fix for the issue? I can't say I'm convinced
that it is, given your next paragraph. I would suggest the need for
signaling that a custom SQL file should execute post-indices.

One solution, which is entirely backwards-compatible, would be to say
that "if you want your custom SQL to run after indices creation, name
files '*.post.sql,'" or something like that. That has its own issues,
but you get the general idea.

> To answer the original question (why is it done in this order) -
> mostly historical reasons. Originally, Django didn't have fixtures, so
> initial data were all loaded from initial SQL. It's generally faster
> to insert data and then add indicies, hence the order. When Django
> added fixtures, the order wasn't changed.

I see, so the custom SQL machinery grew out of being a crude version
of fixtures then.

I think the cause of confusion here is that Bergström and I both were
expecting it to be for more than loading data -- I want to create a
view in it (hence my interest), and he wants to drop some excessive
indices.

After pondering some more, I realized the docs actually say "initial
SQL data", and so I'm not sure if this change is actually a good idea
or not.

- Ludvig

Johan Bergström

unread,
Feb 12, 2009, 3:21:22 AM2/12/09
to Django developers
Hello folks,
I also realize that the docs says "initial data" but since fixtures
nowadays fills this gap, I think the documentation example (insert
into...) is misleading in that it recommends yet another way to insert
into your database. It is also misleading in that it says that
"there's a hook" when there's actually two (see
http://code.djangoproject.com/browser/django/trunk/django/core/management/commands/syncdb.py#L98
and forward)

As I stated in the ticket (and Russell above), databases generally
want to throw indexes at the bottom of the equation, which now both
custom SQL and fixtures is at. Changing this most likely will break
some poor guy's code (which _I'm_ perfectly OK with, since they in
that case most likely should update it anyway) but understand the Way
of Django and fully respect that. This patch is the result.

Just for reference, this is what I think should be done for an optimal
workflow:
* create table
* create m2m
* run custom data
* run fixtures
* run indexes
* emit "database done" signal

This way, you can still insert your favorite SQL with custom data, use
fixtures for data, then for advanced users - remove indexes with their
hooks.

I know this (kind of) retracts my above patch - but you generally want
to throw in functions before inserting data (think pgmemcached), and
the above behavior is way too untested (at my place, anyway) to be
sent as a real patch.

>
> - Ludvig

Thanks for your time,
Johan

Russell Keith-Magee

unread,
Feb 12, 2009, 5:53:28 AM2/12/09
to django-d...@googlegroups.com
On Thu, Feb 12, 2009 at 4:49 PM, Ludvig Ericson

<ludvig....@gmail.com> wrote:
>
> On Feb 12, 2009, at 07:48, Russell Keith-Magee wrote:
>> On Thu, Feb 12, 2009 at 2:54 PM, Ludvig Ericson
>> <ludvig....@gmail.com> wrote:
>>>
>>> I fail to see how "it has consequences for existing code", as Russell
>>> put it.
>>
>> It has consequences because you are proposing to change the order in
>> which indexes and custom SQL are applied. Any code that depends on the
>> existing order will be affected.
>
> Just want to note that *I'm* not proposing any changes.

So noted :-)

> I don't think it's "the larger point." I misunderstood the triaging
> system, and incorrectly changed it. Bennett reverted. I apologized.
> End of that.

No problems.

> I realize that my opinions and decisions play a very small role in the
> development of Django, but I try to express them so that core
> developers may read them, and either think "what a retard", or "fair
> point."

This is exactly what we want (and need) to happen. We (the core devs)
rely upon community members such as yourself to have these
discussions, taking some of the load off of us so we can concentrate
on having smaller discussions where many of the details have already
been sorted out, making decisions and committing code. In this case,
there was a misunderstanding about the best way to communicate the
result of your discussions, but we've resolved that problem.

> One solution, which is entirely backwards-compatible, would be to say
> that "if you want your custom SQL to run after indices creation, name
> files '*.post.sql,'" or something like that. That has its own issues,
> but you get the general idea.

This was my immediate thought, and yes, it would be backwards
compatible. My immediate counter-reaction is that you would require
several configuration points (post table creation, post index
creation, post data insertion), which means a lot of
difficult-to-explain configuration points.

>> To answer the original question (why is it done in this order) -
>> mostly historical reasons. Originally, Django didn't have fixtures, so
>> initial data were all loaded from initial SQL. It's generally faster
>> to insert data and then add indicies, hence the order. When Django
>> added fixtures, the order wasn't changed.
>
> I see, so the custom SQL machinery grew out of being a crude version
> of fixtures then.

Correct.

> I think the cause of confusion here is that Bergström and I both were
> expecting it to be for more than loading data -- I want to create a
> view in it (hence my interest), and he wants to drop some excessive
> indices.

To be clear - I fully acknowledge your use case. There are many
occasions where you need to invoke custom SQL, and the decision to
execute that SQL pre or post index installation is significant. I
would very much like to be able to support this use case - we just
need to find the right way to do it.

> After pondering some more, I realized the docs actually say "initial
> SQL data", and so I'm not sure if this change is actually a good idea
> or not.

There shouldn't be anywhere talking about 'initial SQL data' - I
thought I purged all of those references. If you find any such
references, they should be referring to "custom SQL", or similar
phrasing.

Yours,
Russ Magee %-)

Russell Keith-Magee

unread,
Feb 12, 2009, 5:54:13 AM2/12/09
to django-d...@googlegroups.com
On Thu, Feb 12, 2009 at 5:21 PM, Johan Bergström <bu...@bergstroem.nu> wrote:
>
>> After pondering some more, I realized the docs actually say "initial
>> SQL data", and so I'm not sure if this change is actually a good idea
>> or not.
>
> I also realize that the docs says "initial data" but since fixtures
> nowadays fills this gap, I think the documentation example (insert
> into...) is misleading in that it recommends yet another way to insert
> into your database. It is also misleading in that it says that
> "there's a hook" when there's actually two (see
> http://code.djangoproject.com/browser/django/trunk/django/core/management/commands/syncdb.py#L98
> and forward)

As I noted in my message to Ludvig, any reference to initial SQL
should be removed (and I thought they already had been). SQL hooks
_can_ be used for inserting data, but that's not the preferred option.

> As I stated in the ticket (and Russell above), databases generally
> want to throw indexes at the bottom of the equation, which now both
> custom SQL and fixtures is at. Changing this most likely will break
> some poor guy's code (which _I'm_ perfectly OK with, since they in
> that case most likely should update it anyway) but understand the Way
> of Django and fully respect that. This patch is the result.

> Just for reference, this is what I think should be done for an optimal
> workflow:
> * create table
> * create m2m
> * run custom data
> * run fixtures
> * run indexes
> * emit "database done" signal

I agree that if we were starting with a clean slate, this would be
close to the right way to do things (and, if isn't that far from the
way things are done right now). My only issue would be with moving the
signal to the end of the process. A "no, really, I'm finished" signal
was requested some time ago:

http://code.djangoproject.com/ticket/7561

Interestingly, removing superfluous indicies was one of the use cases
provided in support of this change. At the time it was rejected
because the specific proposal was backwards compatible (and was made
during the v1.0 push). The post_syncdb hook is commonly used for data
insertion, and the current location is the right place for that, but
having a second signal is a definite possibility.

Moving the fixture loading to pre-indexing is also a possibility - I
can't think of a reason this would cause backwards-incompatibility
problems (and for large fixtures, it should be marginally faster).

Yours,
Russ Magee %-)

Ludvig Ericson

unread,
Feb 12, 2009, 10:02:05 AM2/12/09
to django-d...@googlegroups.com
On Feb 12, 2009, at 11:53, Russell Keith-Magee wrote:
> On Thu, Feb 12, 2009 at 4:49 PM, Ludvig Ericson
>> One solution, which is entirely backwards-compatible, would be to say
>> that "if you want your custom SQL to run after indices creation, name
>> files '*.post.sql,'" or something like that. That has its own issues,
>> but you get the general idea.
>
> This was my immediate thought, and yes, it would be backwards
> compatible. My immediate counter-reaction is that you would require
> several configuration points (post table creation, post index
> creation, post data insertion), which means a lot of
> difficult-to-explain configuration points.

Hm, yes... It's true it would get pretty convoluted quickly. The best
approach I can think of is making it flexible, something like (just
throwing this out there):

class M(models.Model):
...
class Meta:
custom_sql = {"drop_indices": ("post-indices", "pre-
inserts")}

Or something like that, and you could build a dependency tree out of
that, running the SQL at certain points in the code. It has other
issues (like being overkill), and it's not really a proposal I like.

Instead, I think what you say in your next e-mail is a better idea:
provide the developers with the tools (like signals) to make these
complicated usecases doable, with a dash of Python code. Then Django
could provide a simple function for saying "execute SQL in file X."

def drop_indices(sender=None, **kwargs):
load_sql(sender, "drop_indices.sql")

post_something.connect(drop_indices)

(Though this usecase would probably do better with inlining the SQL as
I doubt it would span that many lines.)

However, as noted, dropping indices isn't *that* uncommon. I think
it'd be a good idea to support dropping indices with the ease of
adding a file - or even specifying that in the model. (Like
`ForeignKey(no_index=True)`.)

>> I think the cause of confusion here is that Bergström and I both were
>> expecting it to be for more than loading data -- I want to create a
>> view in it (hence my interest), and he wants to drop some excessive
>> indices.
>
> To be clear - I fully acknowledge your use case. There are many
> occasions where you need to invoke custom SQL, and the decision to
> execute that SQL pre or post index installation is significant. I
> would very much like to be able to support this use case - we just
> need to find the right way to do it.
>
>> After pondering some more, I realized the docs actually say "initial
>> SQL data", and so I'm not sure if this change is actually a good idea
>> or not.
>
> There shouldn't be anywhere talking about 'initial SQL data' - I
> thought I purged all of those references. If you find any such
> references, they should be referring to "custom SQL", or similar
> phrasing.

As linked earlier:
http://docs.djangoproject.com/en/dev/howto/initial-data/#initial-sql

This whole subsection (and the section intro) speaks of it in terms of
initial SQL, and is within that context (initial data).

A quick googling reveals two files which still have this reference:
* http://docs.djangoproject.com/en/dev/howto/initial-data/
* http://docs.djangoproject.com/en/dev/ref/settings/

- Ludvig

Reply all
Reply to author
Forward
0 new messages