Proposal on Custom Indexes - Google Summer of Code

494 views
Skip to first unread message

akki

unread,
Mar 8, 2016, 9:46:02 AM3/8/16
to Django developers (Contributions to Django itself)
Hi

My name is Akshesh Doshi (akki). I am a student at Indian Institute Of Technology, Roorkee (IITR). I have been contributing to Django for quite some time now and my experience has been really great till now. I found the community to be very welcoming and have learnt a lot in this period of time.

With this spirit I would like to work on the idea of "Custom Indexes"  and extend it as my proposal for Google Summer of Code 2016.

I have started preparing my proposal and here is the initial draft of it. I would like to hear your thoughts regarding this. Also I wanted to discuss some points mentioned below. The timeline still needs work as I am still digging into the code of expressions to see how we can use them in FunctionalIndex. Any pointers/thoughts on that would be appreciated.

Key points:
  - Introduction to classed based indexes.
  - Support for custom classes to db_index.
  - Introduction of Meta.indexes.
  - Allowing fields to specify their own index type.
  - Support for a indexes/constraints API.
  - Extend expressions into indexes.
  - Bring spatial indexes under the hood.

Points I would like to discuss:
 1) Would it be right to create a IndexTogether class (subclass of Index) and internally translate Meta.index_together to it ?
      This would allow something like -
          class Meta:
              indexes = [IndexTogether(['field1', 'field2'])]
    I think this would let us keep a better track of any indexes (including those by `index_together`) that are being created. `Meta.index_together` would be internally translated to Meta.indexes. This might also be followed by deprecation of `index_together` if we want.

 2) Handling Unique constraints via indexes.
      For handling of constraints, I was thinking of creating a UniqueIndex class which would handle any unique constraint on any column and allow other options like to make it deferrable.
      Right now fields with ``unique=True`` apply the unique constraints by using both the UNIQUE constraint in the CREATE TABLE statement and by creating an index for it. For example, in this model, multiple (repeating) indexes are being generated for the `name` field (one by UNIQUE constraint, 2 others manually, on postgresql). This takes more space and is also not good performancewise. This situation can also be mitigated by keeping a track of all unique constraints at only one place.
      So I was thinking of bringing the unique constraint totally under indexes. Maybe some `models.UniqueIndex()` could be used in meta.indexes to add constraints ? Any thoughts on it ?


I had also prepared a DEP (based on an earlier DEP by Marc Tamlyn), which is now a subset of this proposal.

Regards
Akshesh Doshi
(akki)

Josh Smeaton

unread,
Mar 8, 2016, 9:20:16 PM3/8/16
to Django developers (Contributions to Django itself)
Hi Akshesh,

The proposal looks really good so far. I haven't gone through it in depth though, but I'll set aside some time to try and do so. A few comments:

- Using the schema editor rather than sqlcompiler is the right choice. 

- The earlier part of your proposal seems to focus on allowing postgres users to define more complex indexes that they previously haven't been able to use. Oracle is then highlighted also for allowing functional indexes. I think one of the biggest "why's" of this project is to allow users to define custom indexes. It doesn't matter what kind of index. They get access to the stuff that creates them. That should be highlighted before all else.

- Should the type of index (BTree, Spatial, etc) be an index type in and of itself, or should it be a property of a specific index? My knowledge here is limited, so I'm not sure if there are certain constraints on which kind of indexes can use spatial/btree/partial storage types. If certain index types are very heavily restricted, then it probably makes sense to highlight them (as you've done) as a top level class rather than an attribute of indexes in general. If most indexes can be arranged with different storage types though, it'd probably make more sense to expose that functionality as an attribute of all indexes.

- "Index() subclasses will also have a supports(backend) method which would return True or False depending on whether the backend supports that index type." Backends (operations.py) already supports check_expression_support(self, expression). You can bake index support into these methods from within each backend. Especially helpful for third party backends, as they won't need to monkey patch the index classes. The only issue I can see with reusing check_expression_support is the overhead of all other expressions being checked. I don't think that's a big concern, but it's worth being conscious of.

- "Index classes will take 3 arguments fields, model and name - the later two being optional.". Index classes should accept a list of expressions, not just field names. This will allow trivial support of functional indexes without a specific functional index class. You'll also benefit from existing expression subclasses which process field names as F() expressions. F() expressions may have to be changed (or a new concept introduced) which can resolve the field, but doesn't try to do any joining.

fullname_index=Index(ToLower('first_name'), ToLower('last_name')) -> create index <table_name>_fullname_index on <table_name>(Lower(first_name), Lower(last_name));

Once the index is added to a model, the model can inject itself into the index, or the schema editor can use the originating model to get data it needs. No need to pass the model into the index type I think. Regarding the name, I'm unsure if I prefer a name=Index() or an Index(name='') approach. It'd be good to see that choice in your proposal with a couple of pros and cons.


Ok, onto the questions you actually asked.

1) I don't think it's necessary to have an IndexTogether type. Index('field_a', 'field_b') should be sufficient I think. If it's not sufficient, call out why. I'm also in favour of translating any index_together definitions into appropriate Index types. Then SchemaEditor only needs to work with Index expressions and things remain simpler at the boundary between model definition and migrations.

2) Again I think it's a good idea to internally create index classes based on older (unique=True) definitions. Unique may not have to have it's own index type as it could (maybe) be a property of an Index. Again, I'll leave that choice up to you, but it'd be good to see the pros/cons briefly discussed.

Again, really good proposal!

Anssi Kääriäinen

unread,
Mar 9, 2016, 2:14:43 AM3/9/16
to django-d...@googlegroups.com
If the CREATE INDEX part of the generated SQL isn't hard-coded
somewhere outside the Index class, the exact same mechanism can be
used to run other DDL commands, like ALTER TABLE <foobar> ADD
CONSTRAINT check_type_choices CHECK (type IN ('foo', 'bar'));

Of course, this is a great thing to have. But I wonder if we can do
something to prevent pushing non-indexes to the indexes attribute of
Meta. Having Meta.indexes = [CheckTypeConstraint(foo__in=['foo',
'bar'])] doesn't read correctly.

- Anssi
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-develop...@googlegroups.com.
> To post to this group, send email to django-d...@googlegroups.com.
> Visit this group at https://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/c2245bd3-b05d-435c-b8df-bbe652a5f63b%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

akki

unread,
Mar 9, 2016, 4:16:10 PM3/9/16
to Django developers (Contributions to Django itself)

 Hi Josh and Anssi,

 Thanks for the feedback. Much appreciated.



- The earlier part of your proposal seems to focus on allowing postgres users to define more complex indexes that they previously haven't been able to use. Oracle is then highlighted also for allowing functional indexes. I think one of the biggest "why's" of this project is to allow users to define custom indexes. It doesn't matter what kind of index. They get access to the stuff that creates them. That should be highlighted before all else.

Looks like with all the new ideas that could be incorporated into this proposal, the main point of motivation for this concept got a bit hazy in the proposal. I'll rectify that.
 

- Should the type of index (BTree, Spatial, etc) be an index type in and of itself, or should it be a property of a specific index? My knowledge here is limited, so I'm not sure if there are certain constraints on which kind of indexes can use spatial/btree/partial storage types. If certain index types are very heavily restricted, then it probably makes sense to highlight them (as you've done) as a top level class rather than an attribute of indexes in general. If most indexes can be arranged with different storage types though, it'd probably make more sense to expose that functionality as an attribute of all indexes.

Seeing the syntax for creating indexes on various types of databases, I think it would be possible to make a generic Index class (type of index {Spatial, Unique} and method {BTree, Hash} being properties of this class). There are no major restrictions (the biggest I can think of being multi-column indexing cannot use Hash or R-tree) on usage of methods by different indexes.
But I don't want the user to be hardcoding strings like "hash" in the kwargs. Maybe something like Index(['field1', 'field2'], method=models.HashIndex) should be good. In that case Hash, BTree won't be subclasses of Index.
 

- "Index() subclasses will also have a supports(backend) method which would return True or False depending on whether the backend supports that index type." Backends (operations.py) already supports check_expression_support(self, expression). You can bake index support into these methods from within each backend. Especially helpful for third party backends, as they won't need to monkey patch the index classes. The only issue I can see with reusing check_expression_support is the overhead of all other expressions being checked. I don't think that's a big concern, but it's worth being conscious of.

I remember shaib had once raised this concern for third-party backends somewhere, don't know how I missed addressing this issue in the proposal.
Anyways, I see and I agree that we can use check_expression_support for that. Also regarding the overhead issue mentioned, I think we can make a more generic function like check_support (afterall, check_expression_support has also evolved from check_aggregate_support) or maybe implement some other way. But I don't that would be a big trouble.
 

- "Index classes will take 3 arguments fields, model and name - the later two being optional.". Index classes should accept a list of expressions, not just field names. This will allow trivial support of functional indexes without a specific functional index class. You'll also benefit from existing expression subclasses which process field names as F() expressions. F() expressions may have to be changed (or a new concept introduced) which can resolve the field, but doesn't try to do any joining.

I had left this part solely to functional indexes, because I think it would be better to keep the basic class simple otherwise making even the basic Index class work would require some fiddling with expressions.
One thing I can do is tweak Index class back again once I am successfully able to make functional indexes work. Maybe I should reword this point in my proposal to make it clear.
Perhaps I would be able to answer this more clearly, when I have a concrete vision (or at least a trail) on how to use F() expressions in Index class.
 

Once the index is added to a model, the model can inject itself into the index, or the schema editor can use the originating model to get data it needs. No need to pass the model into the index type I think. Regarding the name, I'm unsure if I prefer a name=Index() or an Index(name='') approach. It'd be good to see that choice in your proposal with a couple of pros and cons.

Well, since we are defining these indexes in a list (to be stored in Meta.indexes) I would go with the Index(name='') approach.
 

1) I don't think it's necessary to have an IndexTogether type. Index('field_a', 'field_b') should be sufficient I think. If it's not sufficient, call out why. I'm also in favour of translating any index_together definitions into appropriate Index types. Then SchemaEditor only needs to work with Index expressions and things remain simpler at the boundary between model definition and migrations.

My major concern was about the syntax. Would Index(['f1', 'f2']) mean two indexes (for each field) or one multi-column index. Here we can do something like Index(['f1', 'f2', ['f3', 'f4']]), each element of the list corresponding to a new index.


Of course, this is a great thing to have. But I wonder if we can do
something to prevent pushing non-indexes to the indexes attribute of
Meta. Having Meta.indexes = [CheckTypeConstraint(foo__in=['foo',
'bar'])] doesn't read correctly.

Perhaps it would be best not to mix constraints and indexes. I would like to know your thoughts on stopping to create indexes explicitly for unique constraints because all database do that automatically because that only gives a performance degradation as far as I know.

Gavin Wahl

unread,
Mar 9, 2016, 11:02:28 PM3/9/16
to Django developers (Contributions to Django itself)
Would you be able to support partial indexes as well?

Markus Holtermann

unread,
Mar 10, 2016, 1:17:17 AM3/10/16
to Django developers (Contributions to Django itself)
Hi Akshesh,

thank you for your proposal! Sounds like a good plan.


On Thursday, March 10, 2016 at 8:16:10 AM UTC+11, akki wrote:

Once the index is added to a model, the model can inject itself into the index, or the schema editor can use the originating model to get data it needs. No need to pass the model into the index type I think. Regarding the name, I'm unsure if I prefer a name=Index() or an Index(name='') approach. It'd be good to see that choice in your proposal with a couple of pros and cons.

Well, since we are defining these indexes in a list (to be stored in Meta.indexes) I would go with the Index(name='') approach.

I'd prefer the Index(name='') approach

class MyModel(models.Model):
    field1 = models.CharField()
    field2 
= models.CharField()

    class Meta:
        indexes = [
            Index('field1', 'field2', name='some_index_across_two_columns'),
            'field2',  # Would be the same as Index('field2'), an index on this one column -- name derived from field name: field2_index
            Index(ToLower('field1')),  # A functional index on field1
 -- name derived from field name and functions: field1_tolower_index
        ]

That way you should also be able to check for duplicate index names.
 
1) I don't think it's necessary to have an IndexTogether type. Index('field_a', 'field_b') should be sufficient I think. If it's not sufficient, call out why. I'm also in favour of translating any index_together definitions into appropriate Index types. Then SchemaEditor only needs to work with Index expressions and things remain simpler at the boundary between model definition and migrations.

My major concern was about the syntax. Would Index(['f1', 'f2']) mean two indexes (for each field) or one multi-column index. Here we can do something like Index(['f1', 'f2', ['f3', 'f4']]), each element of the list corresponding to a new index.
Index(...) would create one index from my perspective. Thus Index('field_a', 'field_b') would do the same as index_together=(('field_a', 'field_b'),) these days.
Index(['f1', 'f2', ['f3', 'f4']]) looks way to confusing to me. That would probably be better Index('f1'), Index('f2'), Index('f3', 'f4')

Since we're talking about index_together, please keep unique_together in mind ;)

Cheers

/Markus

Anssi Kääriäinen

unread,
Mar 10, 2016, 3:57:54 AM3/10/16
to django-d...@googlegroups.com
On Wed, Mar 9, 2016 at 11:16 PM, akki <akshes...@gmail.com> wrote:
>>
>> Of course, this is a great thing to have. But I wonder if we can do
>> something to prevent pushing non-indexes to the indexes attribute of
>> Meta. Having Meta.indexes = [CheckTypeConstraint(foo__in=['foo',
>> 'bar'])] doesn't read correctly.
>
>
> Perhaps it would be best not to mix constraints and indexes. I would like to
> know your thoughts on stopping to create indexes explicitly for unique
> constraints because all database do that automatically because that only
> gives a performance degradation as far as I know.

The point is that if the Index class generates the full SQL for the
index creation, then the Index classes can be used to run arbitrary
changes to the database. Having the ability to run arbitrary SQL is
great, but this will lead to users having all sorts of non-indexes in
the indexes meta attribute. I'd like to avoid that if possible.

- Anssi

akki

unread,
Mar 10, 2016, 12:51:43 PM3/10/16
to Django developers (Contributions to Django itself)

Would you be able to support partial indexes as well?
I am not sure if I would be able to squash it in my GSOC timeline, but yes, I will definitely incorporate it (afterwards, if not during my GSOC). It should not be very hard to implement once I am able to make functional indexes work.

            Index('field1', 'field2', name='some_index_across_two_columns'),
I think you meant ['field1', 'field2'] here.



Index(...) would create one index from my perspective. Thus Index('field_a', 'field_b') would do the same as index_together=(('field_a', 'field_b'),) these days.
Index(['f1', 'f2', ['f3', 'f4']]) looks way to confusing to me. That would probably be better Index('f1'), Index('f2'), Index('f3', 'f4')
Yes, I had thought more about it after my last reply and reached the same conclusion. It would best if the indexes looked like -
           Index(['f1', 'f2']) # multi-column indexing.
           
Index(['f1'], name='name_idx', type=models.Hash) # single-column hash type indexing.
There's no need to make it more complicated.



Since we're talking about index_together, please keep unique_together in mind ;)
unique_together is a whole another independent thing. It is the `ALTER TABLE <table_name> ADD CONSTRAINT` thing and is not required to be mixed with the `CREATE INDEX` things, IMO, unless I am missing something.

Moritz Sichert

unread,
Mar 12, 2016, 1:21:37 PM3/12/16
to django-d...@googlegroups.com
What about calling the attribute something like "constraints" similar to how it's done in SQLAlchemy [1]? For now the attribute can just contain a list of Index() instances but that would also lay grounds for supporting check constraints and other related table level options.

Moritz

[1] http://docs.sqlalchemy.org/en/latest/core/constraints.html

akki

unread,
Mar 13, 2016, 4:24:44 AM3/13/16
to Django developers (Contributions to Django itself)
 
What about calling the attribute something like "constraints" similar to how it's done in SQLAlchemy [1]? For now the attribute can just contain a list of Index() instances but that would also lay grounds for supporting check constraints and other related table level options.
Since we are just allowing instances of Index class and it's subclasses right now, I don't think there is much gain in naming it "constraints". It might even confuse the user.

Anssi Kääriäinen

unread,
Mar 14, 2016, 2:52:08 AM3/14/16
to django-d...@googlegroups.com
I was assuming that the Index class would be responsible for
generating the whole SQL. Seems like this isn't the case.

I assumed something like this would happen when a new index is added:
1. User adds a new BTreeIndex to a model
2. The makemigrations command detects a change and creates a
migration file with AddIndex operation
3. The migrate command calls BTreeIndex.database_forwards, the
database_forwards method generates the needed SQL ('CREATE INDEX
foobar ON thetable(thecol)')
4. The migration framework runs the generated SQL

This way would let the BTreeIndex class generate the full SQL.

If I am not mistaken, the plan is to generate the SQL mostly in the
migration operations, that is the user can't alter the full SQL by
creating custom Index classes. So, #3 above would be replaced by the
AddIndex operation detecting changes in the BTreeIndex class and
generating the needed SQL. The main point is that the AddIndex
operation isn't under user control, so parts of the generated SQL is
essentially hard-coded into Django.

Giving users full control over the generated SQL would allow contrib
or 3rd party apps to create fully custom index classes for their
needs. The expressions/transforms/lookups approach has shown this is a
powerful approach. As an example, it is unlikely we will add BRIN
index support to Django core, but it could be added to
django.contrib.postgres.

It is worth considering if we could allow users to have full control
of the generated SQL. It might be a bit too much for one GSoC program,
so I'm fine with a decision to use the hard-coded approach.

- Anssi
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-develop...@googlegroups.com.
> To post to this group, send email to django-d...@googlegroups.com.
> Visit this group at https://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/090de3a1-63da-4897-9bb6-a7997a9b9610%40googlegroups.com.

Anssi Kääriäinen

unread,
Mar 14, 2016, 3:23:30 AM3/14/16
to django-d...@googlegroups.com
How about

class MyModel(models.Model):
field1 = models.CharField()
field2 = models.CharField()

field1_idx = models.Index('field1')
field2_lower_idx = models.Index(ToLower('field2'))

The main advantage is that the same syntax can be used for custom
indexes, constraints, composite fields and so on. Internally we could
of course have the same representation for indexes as before.

- Anssi

akki

unread,
Mar 14, 2016, 3:35:36 AM3/14/16
to Django developers (Contributions to Django itself)
Hi

I have updated my proposal and completed the timeline. I saw how expressions work and now have a clear idea of how they would be used in indexes - I have added a short description of the same in my proposal.

Please have a look. I would love to hear any suggestions for further improvement. If there are any questions, I will be happy to answer.

Regards
Akshesh Doshi
(akki)

akki

unread,
Mar 14, 2016, 3:45:09 AM3/14/16
to Django developers (Contributions to Django itself)
Hi Annsi

Yes, the AddIndex operation would detect the changes and the user would not have full control over the generated SQL.

I don't know if it would be a good idea to mix index definitions with fields while creating models. Users have traditionally been defining fields that way and indexes feel more like a "Meta" property to me. Constraints can also be added to Meta (maybe merged together with indexes) whenever they arrive, but I don't how soon they will be included.

Regards
Akshesh Doshi
(akki)

Michal Petrucha

unread,
Mar 14, 2016, 4:44:35 AM3/14/16
to django-d...@googlegroups.com
On Mon, Mar 14, 2016 at 12:45:09AM -0700, akki wrote:
> Hi Annsi
>
> Yes, the AddIndex operation would detect the changes and the user would not
> have full control over the generated SQL.
>
> I don't know if it would be a good idea to mix index definitions with
> fields while creating models. Users have traditionally been defining fields
> that way and indexes feel more like a "Meta" property to me. Constraints
> can also be added to Meta (maybe merged together with indexes) whenever
> they arrive, but I don't how soon they will be included.

The way I see it, the question is: how well can we maintain this
distinction (column-backed fields in the model class itself, vs.
non-field constructs in Meta). Right now the boundary between the two
is more or less clear, with a few exceptions (ManyToManyField and
GenericForeignKey, for instance). Once we grow support for
CompositeFields, that will be another construct which looks kind of
like a limited field from one angle, but actually resembles non-field
constructs in a SQL table, too. If I ever manage to dust off and push
through my ForeignKey refactor, even ForeignKey would become this kind
of virtual construct -- just a named SQL constraint with some
Python-level behavior tacked on top of it.

To be honest, personally, I kind of like SQLAlchemy's approach, which
simply puts all table-level constructs (columns, constraints, indexes)
into model classes at the same level. It does not leave room for this
kind of subjective bikeshedding.

Cheers,

Michal
signature.asc

Marc Tamlyn

unread,
Mar 14, 2016, 5:38:57 AM3/14/16
to django-d...@googlegroups.com
I like this a lot. Even we we then keep the "Meta" options as they are forever, they then become shortcuts to what you could define explicitly.

In any case, we would need some careful checks that (for example) you can't have identical indexes defined in multiple ways with the same names.

--
You received this message because you are subscribed to the Google Groups "Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.

akki

unread,
Mar 14, 2016, 6:15:34 AM3/14/16
to Django developers (Contributions to Django itself)

you can't have identical indexes defined in multiple ways with the same names.
 
Yes, I will be checking for the duplications at the autodetector level itself. We need to take care that no two indexes are for the same fields, and none of them have a clashing name. On such an occurrence, we can raise an error like "Multiple indexes have been set for 'field1' " or merge them ourself internally (if they mean exactly the same).

Tim Graham

unread,
Mar 22, 2016, 8:55:32 AM3/22/16
to Django developers (Contributions to Django itself)
Instead of aiming for one huge patch to be merged at the end of the summer, I'd suggest to think about whether you can break up the work in chunks that can be merged incrementally. This should decrease the risk that comes with a huge patch going stale and makes code review much easier.

It's helpful if you can add your other commitments for the summer such as holidays, vacations, etc.

akki

unread,
Mar 23, 2016, 3:42:06 AM3/23/16
to Django developers (Contributions to Django itself)
Hi Tim

Thanks for the feedback.


Instead of aiming for one huge patch to be merged at the end of the summer, I'd suggest to think about whether you can break up the work in chunks that can be merged incrementally. This should decrease the risk that comes with a huge patch going stale and makes code review much easier.
Yes, I see how that would be useful in the long run. Although I had made the timeline keeping in mind that the work is done in a progressive way but it would definitely be good to see that this is clearly (and more explicitly) reflected in the proposal as well. Will update my proposal ASAP to incorporate this change.


It's helpful if you can add your other commitments for the summer such as holidays, vacations, etc.
To be honest, there aren't any commitments I have for these summers. I would be having my breaks from college and am pretty much free to work on this.
I'll to see if there's anything else worth mentioning in the proposal.

akki

unread,
Mar 23, 2016, 3:50:59 PM3/23/16
to Django developers (Contributions to Django itself)
Hi

I had updated my proposal to include the various points that were discussed earlier in this thread. I have now also incorporated the changes that were recently suggested by Tim.

Thanks
Message has been deleted

Shai Berger

unread,
Mar 26, 2016, 5:25:03 AM3/26/16
to django-d...@googlegroups.com
That is a bit too strict; sometimes it does make sense to provide more than
one kind of index on one field. As a trivial example, you may want to support
case-sensitive uniqueness as well as case-insensitive search (say, for user-
names).

akki

unread,
Mar 26, 2016, 10:52:32 AM3/26/16
to Django developers (Contributions to Django itself)

That is a bit too strict; sometimes it does make sense to provide more than
one kind of index on one field. As a trivial example, you may want to support
case-sensitive uniqueness as well as case-insensitive search (say, for user-
names).
Ohh yes definitely, I meant no two identical indexes on the same field. Sorry for the wrong choice of words.
Reply all
Reply to author
Forward
0 new messages