Some thoughts about improving migration squashing

254 views
Skip to first unread message

rap...@makeleaps.com

unread,
Feb 15, 2017, 7:22:11 AM2/15/17
to Django developers (Contributions to Django itself)
Hello everyone,

   I spent a couple hours last night trying to improve the migration squasher optimizer (migrations were taking almost 15 minutes in CI). I came up with a couple ideas for anyone interested in improvements:
 
 1- Having an interactive mode for squashing would be interested. Currently, when squashing migrations, I do the following: 
  • Generate an initial squash
  • Edit it (namely, move around operations to get more optimizations to work)
  • remove the "replaces" tag, then rerun migration squashing to "re-optimize"
  • repeat until I get something I like, then add the original "replaces" tag
   It would be cool if instead, the process were (with a flag):
  • Generate an initial squash, but have the process wait for confirmation to "commit" this squash as final (though writing out the file)
  • Edit the file, and tell the process to try re-optimizing with the same file (getting around the "no-squash of squashes" rule)
  • Potentially, allow us to also step back
 For example, the "squashmigrations" command output could look like:

generated 0001_squashed_mig.py
optmize migration[yN]? 
<user inputs y>
regenerated 0001_squashed_mig.py
( 20 operations -> 10 operations)
optimize migration[Ynr]? 
<user inputs y>
regenerated 0001_squashed_mig.py
( No change in operation count)
optimize migration[Ynr]? 
<user inputs r>
rolled back to previous version
optimize migration[Ynr]? 
<user inputs n>
Saved migration


A simpler version of this command would simply be to add an "optimizemigration" command that just reads in a single migration and optimizes the operations, without touching any of the squashiness. 


 2- The reducer might be a bit too pessimistic

 Currently, the optimizer lets "reduce" operations (that take 2 operations and return 0,1, or 2 operations, or None if nothing can be change) do whatever they want. Because of that, if you have [A, B, C ,D] and B depends on A, you can't reduce A and C because the reduction might remove A.

 In reality there are two kinds of reduction operations that we could be taking into account:

   - reducing "left". if you have [A, B, C, D] (for example, A is a CreateModel , C is an AddField for the same model),  and you can reduce A and C into just A' (A with C), giving [A', B, D], then it doesn't matter that B depends on A. 

The thing that matters is if C depends on B (for example, C adds a foreign key to a model created in B). This is actually already encoded in the CreateMode + AddField reduction, but is perhaps a more general case.

In a sense, reducing A and C "to the left" means that we're bringing A and C closer together only by moving C. This is a major part of the potential reductions that the current optimizer is missing.

   - reducing right. If you have [A, B, C, D] (for example, A is a CreateModel, C is a RemoveModel for the same model), and you can reduce A and C into just C' (C with A), giving [B, C', D], then it does matter that B depends on A. C can't depend on B (assuming causality holds in our universe)

This is the current mechanism, essentially. If B depends on A, then you can't move A past B. 

 Removing both operations is a special case of reducing right (You can make C' into a no-op).

I had monkeypatched a special case of  reducing left (taking CreateModel, AddField of different models and swapping them . For example [CreateModel(A), CreateModel(B), AddField(A.foo)] -> [CreateModel(A), AddField(A.foo), CreateModel(B)]) and got decent results, but I think making the optimization code express these two concepts separately would catch even more of the optimizations I saw that the optimizer didn't.


I hope some of this is useful 
 
  Raphael

rap...@makeleaps.com

unread,
Feb 15, 2017, 8:00:54 AM2/15/17
to Django developers (Contributions to Django itself)
I ended up having some time today, so wrote up a management command for the first suggestion!

I called it "optizemigration"


>>> ./manage.py optimizemigration appname 0001_squashed
 
# snipped django startup noise
 
Optimized from 9 operations to 4 operations

Optimized migration /Users/rtpg/proj/projname/projname/appname/migrations/0001_squashed_20170215.py


This reads in the migration file, runs the migration optimizer, and then outputs to the same file. Writing it has paid off almost immediately for me.

Those who are interested can take a look here.

How much testing/coverage requirements are there for management commands like these?


Tim Graham

unread,
Feb 15, 2017, 9:19:06 AM2/15/17
to Django developers (Contributions to Django itself)
Hi Raphael,

It looks like a similar idea was proposed in https://groups.google.com/d/topic/django-developers/C1L-NhyQYG4/discussion. I don't think a ticket was ever created, so you can do that.

100% test coverage is required. Why would we accept untested code? ;-)

Florian Apolloner

unread,
Feb 15, 2017, 9:46:01 AM2/15/17
to Django developers (Contributions to Django itself)
Fwiw I think by default it could/should try to optimize all migrations of an app, manually specifying the migration name should be optional.

Markus Holtermann

unread,
Feb 15, 2017, 9:51:57 AM2/15/17
to Django developers (Contributions to Django itself)
Thanks Raphael, that's a pretty good write up!

You're essentially speaking about 2 things here, in my opinion:

1. Adding a new feature for interactive squash
2. Improving the MigrationOptimizer

I certainly see a point for 2. Not sure how much for 1. Anyway, your reasoning for 2 sounds great! I'd be more than happy if you want to get this into Django! Can you create a respective ticket and drop your explanation in there, please :)

Cheers,

/Markus

Markus Holtermann

unread,
Feb 15, 2017, 9:53:52 AM2/15/17
to Django developers (Contributions to Django itself)
What might be interesting to look into when squashing all migrations in one app would be to assume no migrations would exist. That could then result in only 2 migrations which could run through the optimizer (as opposed to let's say 20 migrations with many more operations).

/Markus

rap...@makeleaps.com

unread,
Feb 15, 2017, 11:58:48 AM2/15/17
to Django developers (Contributions to Django itself)
Markus: Now that I've written a command to optimize a single migration file, I think that it's sufficient for the "squash, edit, optimize" workflow that I was doing before. It's more about offering people to get their squashing done well until our optimizer becomes omniscient. 

Florian: Having the command run on all migrations should be straightforward, I'll look into that!

Tim: It's not untested! I tested it locally by running it ;) I'll write up a couple test cases and some documentation.


I've posted one ticket for the management command to pass a selected migration through the optimizer, and another ticket for improvements to the optimizer itself.

Andrew Godwin

unread,
Feb 15, 2017, 12:00:32 PM2/15/17
to Django developers (Contributions to Django itself)

You're essentially speaking about 2 things here, in my opinion:

1. Adding a new feature for interactive squash
2. Improving the MigrationOptimizer

I certainly see a point for 2. Not sure how much for 1. Anyway, your reasoning for 2 sounds great! I'd be more than happy if you want to get this into Django! Can you create a respective ticket and drop your explanation in there, please :)


I am also definitely in support of 2), and I can see 1) being useful for migration sets that have a lot of operations the optimiser won't touch by default (e.g. SQL), but it's probably more work to be less generally useful.

Andrew
 

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsubscribe@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/bcc73cef-e6f8-4aaf-b5a9-4592895b3b2e%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

charettes

unread,
Feb 15, 2017, 12:07:54 PM2/15/17
to Django developers (Contributions to Django itself)
Hi Raphael,

I've been working on making the optimizer a bit smarter recently and came to the
same conclusion as you concerning the "left" and "right" optimizations.

This should be possible to solve by allowing `Operation.reduce()` to return the
full list of operations it replaces by appending `in_betwen` before or after the
operations it combines to perform "left" or "right" optimizations.

In the mean time I have a solution that makes the optimizer perform a bit better
that I'd be glad to get your feedback on[1]. One thing bugging me was that
`RemoveField` operations didn't have any references to definition of the field
they were removing, making optimization "through" them unsafe as they could be
operating on a related field.

Cheers,
Simon

[1] https://github.com/django/django/pull/7999/files

rap...@makeleaps.com

unread,
Feb 16, 2017, 11:19:01 AM2/16/17
to Django developers (Contributions to Django itself)
Hey Simon, 

 I looked through your PR and added a couple comments. The main thing is I think we can actually ignore the field context on "RemoveField", if only because the executor doesn't need it. Even though the field might be pointing to a related model, that doesn't prevent being optimized through. 

 This is hard to explain, but intuitively, each "RemoveField" is paired with an "AddField" or "CreateModel" that does depend on the related model. So if we have a potentially dangerous optimization, those initial operations will "protect" the causal order, not "RemoveField".

 Raphael

Markus Holtermann

unread,
Feb 16, 2017, 12:15:47 PM2/16/17
to django-d...@googlegroups.com
I'm not sure if it's related or not wo what you're investigating,
RemoveField cannot "just" optimized through, as you might have another
AddField operation afterwards adding another field with the same name.

/Markus

On Thu, Feb 16, 2017 at 08:19:01AM -0800, rap...@makeleaps.com wrote:
>Hey Simon,
>
> I looked through your PR and added a couple comments. The main thing is I
>think we can actually ignore the field context on "RemoveField", if only
>because the executor doesn't need it. Even though the field might be
>pointing to a related model, that doesn't prevent being optimized through.
>
> This is hard to explain, but intuitively, each "RemoveField" is paired
>with an "AddField" or "CreateModel" that *does *depend on the related
>model. So if we have a potentially dangerous optimization, those initial
>operations will "protect" the causal order, not "RemoveField".
>
> Raphael
>
>--
>You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
>To post to this group, send email to django-d...@googlegroups.com.
>To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/9dfdcec6-b98c-44f2-86af-99aaa8857cc9%40googlegroups.com.
signature.asc

rap...@makeleaps.com

unread,
Feb 16, 2017, 6:25:16 PM2/16/17
to Django developers (Contributions to Django itself)
When you have AddField('A', 'foo', ForeignKey('B')), this operation references A and foo, but also references B. 

RemoveField('A', 'foo') also references A and foo, but does it reference B? if it does, then it' s hard to have optimizations that pass through this, because this field could be referencing any model (theoretically).

But if we assert that RemoveField doesn't refer to any models referenced to by its field, then our optimizer can take a couple more liberties.

Raphael

charettes

unread,
Feb 16, 2017, 10:28:05 PM2/16/17
to Django developers (Contributions to Django itself)
> RemoveField('A', 'foo') also references A and foo, but does it reference B? if it does, then it' s hard to have optimizations that pass through this, because this field could be referencing any model (theoretically).

I think we all agree on that.


> But if we assert that RemoveField doesn't refer to any models referenced to by its field, then our optimizer can take a couple more liberties.

Do you have suggestion on how we can assert that it's the case? The only way I could come up with was to make sure RemoveField has a reference to the field it's removing. e.g. It would be generated in the form `RemoveField('A', 'foo', ForeignKey('B'))`.

Simon

Gaschignard, Raphael

unread,
Feb 17, 2017, 1:09:34 AM2/17/17
to django-d...@googlegroups.com
Hi Simon,

   I think it's a bit more general than that. Why does the `RemoveField` exist? Because somewhere, an `AddField`-esque operation exists before it, right?

  Let's say we have m, m' as models.

  Let -F be a RemoveField(m, 'foo', ForeignKey(m') operation. We also have two operations +M', a CreateModel(m'), and -M', a RemoveModel(m') operation. Because we have a removeField operation, we also have +M, a "CreateModel(m), somewhere

 Let's assume that we have an operations list like:   [ ...(1)..., +M', ...(2)..., -F, ...(3)..., -M']. How do we know that we can reduce +M' and -M' together "through" -F?

Because we have -F, we have +M somewhere. because m has a field for m', either: 
   1 - The field is within the initial +M operation. Because the definition depends on M', it must be in group (2).
   2 - The field is not within the initial +M operation. Because the definition of the field depends on M', we need an AddField operation (or moral equivalent) +F in group (2)

So, if we have -F in between +M' and -M', we will also have either a +M which depends on M' or a +F which depends on M' within (2).

So if the -F is between the two, there will also be another operation that will be present between the two that expresses the same dependencies. So the -F operation itself can avoid expressing its "RemoveField" indirect dependency to m' because another operation will do it for them.

----

It's hard to generalize this across everything because there are, after all, arbitrary migrations. But considering that RemoveField will only be reduced with a "moral equivalent" to AddField, I think we can expand this reasoning across everything to say that RemoveField will be sufficiently protected by the AddField operation's location (which will have to be after +M').


Raphael


--
You received this message because you are subscribed to a topic in the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/YMbYXiZgrF0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-developers+unsubscribe@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.

Gaschignard, Raphael

unread,
Feb 17, 2017, 1:14:58 AM2/17/17
to django-d...@googlegroups.com
To clarify on my previous post, if we're in the first case, then the +M and -F operations can be optimized in one path to remove the dependency, and then the +M' and -M' operation can be optimized through.

In the second case (with an AddField operation), the +F and -F operations will cancel each other out (since they will be "in between" +M' and -M') and then the optimization can happen.

A bit more holistically, since M's field depends on M' in the code (which is where these operations come from, after all), in order to remove M', you would first need to remove M's field, so you're almost guaranteed that a RemoveField will be "in between" any of its dependency's creations/removals.

If we have the field information in the RemoveField, we could check it. But if we do not make an assumption on (no field information) RemoveFields, it blocks a lot of possible optimizations. We could just do that (start generating RemoveField with field info), though it would not allow for older migrations to get optimized. I'd be good with doing either/both. 

Raphael

charettes

unread,
Feb 18, 2017, 1:35:26 AM2/18/17
to Django developers (Contributions to Django itself)
Hello Raphael,

Thanks for your detailed explanation! You clearly expressed why it's safe to
optimize through RemoveField operations and helped me lift any doubt about
what was a wrong assumption[1].

I gave your two passes optimization strategy a try and I believe I managed to
implement it correctly[2].

The plan is to modify the optimization algorithm to start by trying to perform
a "right" (as right of the migration in between) reduction and attempt a "left"
one if a subsequent operation can be reduced with. The "left" reduction will
only be performed if all the operations in between can optimize through (their
reduce() method returns True) the operation we're trying to perform a reduction
with.

Cheers,
Simon

[1] https://github.com/django/django/pull/7999/commits/6d0b740cbc9ae038f9eef95dac3057dccb283e6d
[2] https://github.com/django/django/pull/7999
To unsubscribe from this group and all its topics, send an email to django-develop...@googlegroups.com.

To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.

Markus Holtermann

unread,
Feb 20, 2017, 5:49:38 AM2/20/17
to django-d...@googlegroups.com
On Thu, Feb 16, 2017 at 03:25:16PM -0800, rap...@makeleaps.com wrote:
>When you have AddField('A', 'foo', ForeignKey('B')), this operation
>references A and foo, but also references B.

correct.

>RemoveField('A', 'foo') also references A and foo, but does it reference B?
>if it does, then it' s hard to have optimizations that pass through this,
>because this field could be referencing any model (theoretically).

No, that field can not reference any model. It reference exactly one
model (that even holds for FK to abstract models as fields from abstract
models are inlined in the concrete models in migrations). However,
RemoveField doesn't have the information to "just" figure out the
referenced model. RemoveField would need to look into the from_state's
apps and actually even look into the actual field that's referred.
ForeignKeys have an implicit to_field attribute, so, having

AddField('A', 'foo', ForeignKey('B', to_field='bar'))

a

RemoveField('A', 'foo')

references exactly one field on one particular model. Not more and not
less. The issue here is that RemoveField needs to take that information
from the state and not from one of its attributes.

/Markus
>> email to django-develop...@googlegroups.com <javascript:>.
>> >To post to this group, send email to django-d...@googlegroups.com
>> <javascript:>.
>> >Visit this group at https://groups.google.com/group/django-developers.
>> >To view this discussion on the web visit
>> https://groups.google.com/d/msgid/django-developers/9dfdcec6-b98c-44f2-86af-99aaa8857cc9%40googlegroups.com.
>>
>> >For more options, visit https://groups.google.com/d/optout.
>>
>>
>
>--
>You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
>To post to this group, send email to django-d...@googlegroups.com.
>Visit this group at https://groups.google.com/group/django-developers.
>To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/7515c909-a015-451d-bdaa-f040e6322166%40googlegroups.com.
signature.asc

Patryk Zawadzki

unread,
Feb 20, 2017, 7:03:37 AM2/20/17
to Django developers (Contributions to Django itself)
W dniu poniedziałek, 20 lutego 2017 11:49:38 UTC+1 użytkownik Markus Holtermann napisał:
On Thu, Feb 16, 2017 at 03:25:16PM -0800, rap...@makeleaps.com wrote:
>RemoveField('A', 'foo') also references A and foo, but does it reference B?
>if it does, then it' s hard to have optimizations that pass through this,
>because this field could be referencing any model (theoretically).

No, that field can not reference any model. It reference exactly one
model (that even holds for FK to abstract models as fields from abstract
models are inlined in the concrete models in migrations). However,
RemoveField doesn't have the information to "just" figure out the
referenced model. RemoveField would need to look into the from_state's
apps and actually even look into the actual field that's referred.
ForeignKeys have an implicit to_field attribute, so, having

  AddField('A', 'foo', ForeignKey('B', to_field='bar'))

a

  RemoveField('A', 'foo')

references exactly one field on one particular model. Not more and not
less. The issue here is that RemoveField needs to take that information
from the state and not from one of its attributes.

Technically it references _some_ model named "B" that was created not sooner than the current migration's explicit dependencies. It may be the model you saw when you created that migration or it may be some other model. You can tell your migration that it has to be executed no sooner than after another migration is complete but there is no way to say "but before model B is modified any further".
Reply all
Reply to author
Forward
0 new messages