Optional operator index discussion

195 views
Skip to first unread message

Eric Floehr

unread,
Jul 18, 2012, 2:19:29 PM7/18/12
to django-d...@googlegroups.com
I'd like to open up a discussion on the possibilities of having a way to optionally specifying not to create operator indexes on CharField's when db_index=True.  Based on the consensus from this discussion, I'll open up a ticket and if it is within my abilities, generate a patch.

For background, ticket #12234 (https://code.djangoproject.com/ticket/12234) resulted in the creation of a second index for all CharField's and TextField's when db_index=True to enable LIKE queries to work as they should.

However, there are many use cases where a CharField index is needed but adding a varchar_pattern_ops index (in PostgreSQL) results in a performance and storage space hit.  In my case, The storage space difference was 550GB with varchar_pattern_ops indices and 300GB without.  I don't have an exact statistic on the drop in insert speed, but it was noticeable.

In my case, these varchar fields are of small width, generally 1 to 4 characters, and indexing is important on the complete field.  However, it will never be the case that LIKE will be used to query for partial matches, so LIKE query speed isn't an issue, and an operator index is a performance/storage hit that isn't justified.

I am working around the problem now with a custom Field class, but it seems to me that this is a feature that others may benefit from and wanted to solicit feedback and ideas for if it should be an option, and if so, what form it should take.

Thanks much,
Eric

Russell Keith-Magee

unread,
Jul 18, 2012, 9:00:34 PM7/18/12
to django-d...@googlegroups.com
Broadly speaking -- yes, sounds interesting; the trick will be coming
up with an API that *isn't* PostgreSQL specific.

Off the top of my head, I would look at this problem as the problem of
configuring the types of index that are to be created. db_index is
currently treated as a boolean yes/no; if you say yes, it creates an
index (or, in the case of PostgresSQL, indices); if you say no, it
doesn't. To me, what you're talking about is turning a True/False
option into something that can be explicitly configured (with some
sort of fallback so that "True" is the default index collection).

This would need to be functionally driven -- e.g., "I want to add an
index that allows for partial matches", or "I want to add a
case-insensitive index" -- not exposing literal database syntax or
options. On some databases, some of these "functions" would be no-ops,
or subsumed by other index functions (so MySQL, for example, will only
create one index on CharFields).

There is an analogous ticket around dealing with adding
composite/multicolumn indices:

https://code.djangoproject.com/ticket/5805

I don't know if there's any potential for overlap when it comes to the
API here, but it might be worth exploring.

Yours,
Russ Magee %-)

Eric Floehr

unread,
Aug 12, 2012, 11:50:08 AM8/12/12
to django-d...@googlegroups.com
Hi Russ,

Sorry for the delayed response. Just finished up chairing PyOhio and got through my work backlog.

Broadly speaking -- yes, sounds interesting; the trick will be coming
up with an API that *isn't* PostgreSQL specific.

Definitely, and one that doesn't complicate things for the 99% case.

 
Off the top of my head, I would look at this problem as the problem of
configuring the types of index that are to be created. db_index is
currently treated as a boolean yes/no; if you say yes, it creates an
index (or, in the case of PostgresSQL, indices); if you say no, it
doesn't. To me, what you're talking about is turning a True/False
option into something that can be explicitly configured (with some
sort of fallback so that "True" is the default index collection).

That was my initial thought, but there is something elegant about db_index=True.  If it turns into more than that, it's become more complex for the 99% of the time you just want db_index=True.  The only time (today) that db_index would not be a boolean is if:

1. You are using a CharField
2. You are using PostgreSQL
3. You don't want the varchar_pattern_ops index created.

That's a really small subset.  Certainly there are probably going to be in the future other indexes potentially added to various types, but it seems to me it's always going to be a corner case.  It's a corner case I run into a lot, but I'm not egocentric enough to apply that to a general case :-).


This would need to be functionally driven -- e.g., "I want to add an
index that allows for partial matches", or "I want to add a
case-insensitive index" -- not exposing literal database syntax or
options. On some databases, some of these "functions" would be no-ops,
or subsumed by other index functions (so MySQL, for example, will only
create one index on CharFields).

There is a mechanism today to add custom indexes on fields that is *database-specific*, and that's custom SQL via the sql/<modelname>.sql structure.

The nice thing about your functionality-driven approach is it takes a *db-agnostic* approach to index creation.  None of the ORM's I've ever dealt with do anything more with index creation than the simple index, the foreign key index, or the unique/unique together index/constraint.  Doing something like this would certainly be novel, but that likely also means a lot of work. It would also require a lot of design thought, as to my knowledge there isn't an ORM that has already blazed this trail.


There is an analogous ticket around dealing with adding
composite/multicolumn indices:

https://code.djangoproject.com/ticket/5805

I don't know if there's any potential for overlap when it comes to the
API here, but it might be worth exploring.

That is definitely interesting, and is a nice feature.  I'm not sure how to apply it to this situation without making db_index=True more complex.  The issue here is that Django is creating an index that is beyond the normal column index for CharFields in PostgreSQL and there isn't an in-Django way to override that behavior.  The extra index is getting created down in the Django PostgreSQL driver layer based on the column type.  Kind of like how you can create custom Model managers today.  Right now however, there is no easy way to extend or override the database driver layer.

In summary, I would say these are the possible approaches in rough order of complexity:

1. Don't specify db_index=True for CharFields where you don't want varchar_pattern_ops index created but want a regular index createrd.  Instead, create the regular index in sql/<modelname>.sql code.  This doesn't require any change to Django code.

2. Have some mechanism in Python code (via custom classes/overrides) that would allow one to override the default Django PostgreSQL layer creating the extra index.  With a custom "index creator" (not within Django, but written by the user, like a custom Model manager) could fine tune what/how indexes get created and even possibly change the create index statements to be specific to different databases.  In that case, you could use db_index=True, could make decisions based on db type, etc. which you can't with sql/<modelname>.sql code, and everything remains in Python code.

3. Create additional options to the Column's db_index parameter to allow for fine tuning of index creation.  Right now, the *only* fine tuning that makes sense (i.e. the only time more than one index is created when db_index=True) is PostgreSQL CharField's.

4. Extend the Django ORM to have an index creation capability in much the same manner as tables and columns are created.  This would provide a db-agnostic way to create indexes that have specific properties (case insensitive, multi-column, partial, etc.) and would generate CREATE INDEX statements that are specific to the underlying db (or not create if the db doesn't support).  This would be functionality not common in ORMs today, and Django would be a trail-blazer in this area.


Are there any I'm missing?  What seems like the most viable direction to take?

Thanks!
Eric

Reply all
Reply to author
Forward
0 new messages