Hello everyone,
I spent a couple hours last night trying to improve the migration squasher optimizer (migrations were taking almost 15 minutes in CI). I came up with a couple ideas for anyone interested in improvements:
1- Having an interactive mode for squashing would be interested. Currently, when squashing migrations, I do the following:
- Generate an initial squash
- Edit it (namely, move around operations to get more optimizations to work)
- remove the "replaces" tag, then rerun migration squashing to "re-optimize"
- repeat until I get something I like, then add the original "replaces" tag
It would be cool if instead, the process were (with a flag):
- Generate an initial squash, but have the process wait for confirmation to "commit" this squash as final (though writing out the file)
- Edit the file, and tell the process to try re-optimizing with the same file (getting around the "no-squash of squashes" rule)
- Potentially, allow us to also step back
For example, the "squashmigrations" command output could look like:
generated 0001_squashed_mig.py
optmize migration[yN]?
<user inputs y>
regenerated 0001_squashed_mig.py
( 20 operations -> 10 operations)
optimize migration[Ynr]?
<user inputs y>
regenerated 0001_squashed_mig.py
( No change in operation count)
optimize migration[Ynr]?
<user inputs r>
rolled back to previous version
optimize migration[Ynr]?
<user inputs n>
Saved migration
A simpler version of this command would simply be to add an "optimizemigration" command that just reads in a single migration and optimizes the operations, without touching any of the squashiness.
2- The reducer might be a bit too pessimistic
Currently, the optimizer lets "reduce" operations (that take 2 operations and return 0,1, or 2 operations, or None if nothing can be change) do whatever they want. Because of that, if you have [A, B, C ,D] and B depends on A, you can't reduce A and C because the reduction might remove A.
In reality there are two kinds of reduction operations that we could be taking into account:
- reducing "left". if you have [A, B, C, D] (for example, A is a CreateModel , C is an AddField for the same model), and you can reduce A and C into just A' (A with C), giving [A', B, D], then it doesn't matter that B depends on A.
The thing that matters is if C depends on B (for example, C adds a foreign key to a model created in B). This is actually already encoded in the CreateMode + AddField reduction, but is perhaps a more general case.
In a sense, reducing A and C "to the left" means that we're bringing A and C closer together only by moving C. This is a major part of the potential reductions that the current optimizer is missing.
- reducing right. If you have [A, B, C, D] (for example, A is a CreateModel, C is a RemoveModel for the same model), and you can reduce A and C into just C' (C with A), giving [B, C', D], then it does matter that B depends on A. C can't depend on B (assuming causality holds in our universe)
This is the current mechanism, essentially. If B depends on A, then you can't move A past B.
Removing both operations is a special case of reducing right (You can make C' into a no-op).
I had monkeypatched a special case of reducing left (taking CreateModel, AddField of different models and swapping them . For example [CreateModel(A), CreateModel(B), AddField(A.foo)] -> [CreateModel(A), AddField(A.foo), CreateModel(B)]) and got decent results, but I think making the optimization code express these two concepts separately would catch even more of the optimizations I saw that the optimizer didn't.
I hope some of this is useful
Raphael