Rethinking migrations

283 views

Skip to first unread message

Patryk Zawadzki

unread,

Nov 5, 2016, 11:53:49 AM11/5/16

to Django developers (Contributions to Django itself)

Greetings, Jazz Guitarists,

I've briefly talked about this with Markus and he mentioned that the subject was already brought up by Tyson Clugg but I think it deserves a proper discussion here.

I'm typing this from the comfort of Django: Under the Hood sprints so please excuse poor grammar and the somewhat chaotic explanations that follow. I'm very tired and English is not my mother tongue. This is not a DEP but merely a stream of consciousness I'd love to get some feedback on.

Here are some of the problems we face when dealing with migrations:

1. Dependency resolution that turns the migration dependency graph into an ordered list happens every time you try to create or execute a migration. If you have several hundred migrations it becomes quite slow. I'm talking multiple minutes kind of slow. As you can imagine working with multiple branches or perfecting your migrations quickly becomes a tedious task.

2. Dependency resolution is only stable as long as the migration set is frozen. Sometimes introducing a new migration is enough to break existing migrations by causing them to execute in a slightly different order. We often have to backtrack and edit existing migrations and enforce a strict resolution order by introducing arbitrary dependencies.

3. Removing an app from a project is a nightmare. You can't migrate to zero state unless the app is still there. There is no way to add "revert all migrations for app X" to the migration graph, it's something you need to run manually. There is no clean way to remove an app that was ever references in a relation. We were forced to do all kinds of hacks to get around this. Sometimes it's necessary to create an empty eggshell app with the same name and copy all migrations there then add necessary data migrations and finally migrations that remove all the models, indices, procedures etc. Sometimes people just leave a dead application in INSTALLED_APPS to not have to deal with this.

4. Squashing migrations is wonky at best. If you create a model in one migration, alter one of its fields in another and then finally drop the model sometime later, the squashed migration will have Django try to execute the alter first and complain about the table not being there. Also the only reason we need to squash migrations is to prevent problem 1 above from becoming exponentially worse. If migrations were only as slow as the underlying SQL commands, we'd likely never squash them.

5. There's no simple way to roll back all the migrations introduced after a particular point in time which is very useful when working with multiple feature branches. In my current project dropping the database means having to reimport over 200 MB of data snapshots. Switching branches requires me to look at branch diffs to determine which migrations to revert.

6. Conflict detection and resolution (migrate --merge) is a make-believe solution. It just trains people to execute the command without investigating whether their migration history still makes sense.

Some of these I need to dig deeper into and probably file proper tickets. For example I have an idea on how to fix 4 but it would make 1 even slower.

I took some time to get a good long look at what other ORMs are doing. The graph-based dependency solving approach is rather uncommon. Most systems treat migrations as part of the project rather than the packages it uses.

Possible solution (or "how I'd build it today if there was no existing code in Django core"):

a. Make migrations part of the project and not individual apps. This takes care of problem 3 above.

b. Prefix individual migration files with a UTC timestamp (20161105151023_add_foo) to provide a strict sorting order. This removes the depsolving requirement and takes care of 1 and 2. By eliminating those it makes 4 kind of obsolete as squashing migrations would become pointless.

c. Have reusable apps provide migration templates that Django then copies to my project when "makemigrations" is run.

d. Maintain a separate directory for each database connection.

e. Execute all migrations in alphabetical order (which means by timestamp first). When an unapplied migration is followed by an applied one, ask whether to attempt to just apply it or if the user wants to first unapply migrations that came after it. To me this would work better than 6.

f. Migrating to a timestamp solves 5.

Of course we do have migration support in core and it's not compatible with most of the above list. Any ideas? I think serializing the dependency solver state and reusing it between runs could be a pretty low hanging fruit (like "npm shrinkwrap" or yarn's lock file).

Andrew Godwin

unread,

Nov 5, 2016, 12:30:15 PM11/5/16

to Django developers (Contributions to Django itself)

Hello! I have opinions about this :)

Possible solution (or "how I'd build it today if there was no existing code in Django core"):

a. Make migrations part of the project and not individual apps. This takes care of problem 3 above.

It also means it's impossible for apps to ship migrations and define how to upgrade from version to version. I realise that (c) below is part of a proposed solution to this, but how do you propose to match up what's already been run in the database without having names match (and then you just have app migrations by another name)?

b. Prefix individual migration files with a UTC timestamp (20161105151023_add_foo) to provide a strict sorting order. This removes the depsolving requirement and takes care of 1 and 2. By eliminating those it makes 4 kind of obsolete as squashing migrations would become pointless.

Unfortunately this does not help all the time as computers' clocks aren't necessarily right or in sync, so it would merely be an approximation and you'd still get the occasional clash.

c. Have reusable apps provide migration templates that Django then copies to my project when "makemigrations" is run.

Would these be lined up with their own timestamp in the single serial migration timeline? Would you have to make sure any of these templates from any app update was copied across and put in the order before you used the new columns?

d. Maintain a separate directory for each database connection.

This I think might be a good idea, though I'd like to see a more generalised idea of "migration sets" and you then then say which alias uses which set (so you can share sets among more than one connection)

e. Execute all migrations in alphabetical order (which means by timestamp first). When an unapplied migration is followed by an applied one, ask whether to attempt to just apply it or if the user wants to first unapply migrations that came after it. To me this would work better than 6.

This is basically what South used to do, and it worked reasonably well in either being successful or exploding enough that people noticed. Given that you're proposing per-project migrations, however, people are going to run into this almost constantly, as they will clash significantly more than per-app ones.

Of course we do have migration support in core and it's not compatible with most of the above list. Any ideas? I think serializing the dependency solver state and reusing it between runs could be a pretty low hanging fruit (like "npm shrinkwrap" or yarn's lock file).

I think not only could the dependency solver state be serialised but that it would be a replacement for the datetimes-on-filename proposal in that you could easily pull out a previously-serialised order from disk and then work out what the new ones do.

I am generally not keen on the idea of per-project migrations, though - it makes what's in the database a property of the project, not the app, and that's not how Django has worked traditionally. I think an effort to get a more reliable, exposed global ordering of those individual app migrations would go a long way towards the end goal without having to have migration templates, upgrade instructions, and way more collisions between branches.

At the end of the day, though, there's a reason I made the schema editing separate from the migration runners - you can re-use all the nasty work in the schema editing interface and just replace the other part. This huge change is the sort of thing I'd want to see working and proven before we considered changing core, preferably as a third-party app, but of course I'd like to talk through potential smaller changes first, rather than throwing out the entire system.

Andrew

Shai Berger

unread,

Nov 5, 2016, 1:40:24 PM11/5/16

to django-d...@googlegroups.com

Hi,

On Saturday 05 November 2016 17:53:49 Patryk Zawadzki wrote:
>
> I'm typing this from the comfort of Django: Under the Hood sprints so
> please excuse poor grammar and the somewhat chaotic explanations that
> follow. I'm very tired and English is not my mother tongue. This is not a
> DEP but merely a stream of consciousness I'd love to get some feedback on.
>

I am dealing with some similar issues, but I've reached very different
conclusions. In much the same spirit, this is not very orderly.

> Here are some of the problems we face when dealing with migrations:
>
> 1. Dependency resolution that turns the migration dependency graph into an
> ordered list happens every time you try to create or execute a migration.
> If you have several hundred migrations it becomes quite slow. I'm talking
> multiple minutes kind of slow. As you can imagine working with multiple
> branches or perfecting your migrations quickly becomes a tedious task.
>

I've known this to happen, indeed.

> 2. Dependency resolution is only stable as long as the migration set is
> frozen. Sometimes introducing a new migration is enough to break existing
> migrations by causing them to execute in a slightly different order. We
> often have to backtrack and edit existing migrations and enforce a strict
> resolution order by introducing arbitrary dependencies.
>

So, you say you really have implicit dependencies between migrations --
dependencies in substance, which aren't recorded as dependencies. This seems
to indicate that you have a lot of manually-written migrations (data
migrations?), since the automatically-written ones do include relevant
dependencies. This seems odd -- it sounds like you're doing something out of
the ordinary.

This would also explain some of your bad experience with squashing -- indeed,
if you have many data migrations, squashing can become much less effective.

> 3. Removing an app from a project is a nightmare. You can't migrate to zero
> state unless the app is still there. There is no way to add "revert all
> migrations for app X" to the migration graph, it's something you need to
> run manually. There is no clean way to remove an app that was ever
> references in a relation. We were forced to do all kinds of hacks to get
> around this. Sometimes it's necessary to create an empty eggshell app with
> the same name and copy all migrations there then add necessary data
> migrations and finally migrations that remove all the models, indices,
> procedures etc. Sometimes people just leave a dead application in
> INSTALLED_APPS to not have to deal with this.

Clear out (maybe even remove) models.py and type "makemigrations", and you get
a migration that deletes everything. The answer to getting rid of the
historical migrations is squashing, but of course you first need squashing to
work properly.

>
> 4. Squashing migrations is wonky at best. If you create a model in one
> migration, alter one of its fields in another and then finally drop the
> model sometime later, the squashed migration will have Django try to
> execute the alter first and complain about the table not being there. Also
> the only reason we need to squash migrations is to prevent problem 1 above
> from becoming exponentially worse. If migrations were only as slow as the
> underlying SQL commands, we'd likely never squash them.
>

If that's so, it's a bug you should report; it's also an issue you can work-
around by editing the migration to remove the redundant operation. There are
issues with squashing, to be sure, but I don't think this is one of the
serious ones.

> 5. There's no simple way to roll back all the migrations introduced after a
> particular point in time which is very useful when working with multiple
> feature branches. In my current project dropping the database means having
> to reimport over 200 MB of data snapshots. Switching branches requires me
> to look at branch diffs to determine which migrations to revert.
>

Yes, this is a real issue, with one modification -- I'd much rather have a good
way to migrate to a point-in-version-history than to a point-in-time.

This is even more than a development issue -- I've encountered a use-case for
doing something like this in production: If I want to be able to export an
object represented by a model (or set of models), by serializing it and saving
the serialized version; and then I'd want to import it back in after the app
has progressed -- if I'd want generic support for that, I'd need a way to
migrate a database to the point where the object was exported, import it, and
then roll the database forward to the "present".

> 6. Conflict detection and resolution (migrate --merge) is a make-believe
> solution. It just trains people to execute the command without
> investigating whether their migration history still makes sense.
>

It could be smarter, assuming it understood the content of migrations. We
could probably improve it to a point where, for most cases, it would either
know to merge automatically or know that there really is a conflict. This would
probably not help you if you have a lot of RunPython's in your migrations.

>
> Some of these I need to dig deeper into and probably file proper tickets.
> For example I have an idea on how to fix 4 but it would make 1 even slower.
>
> I took some time to get a good long look at what other ORMs are doing. The
> graph-based dependency solving approach is rather uncommon. Most systems
> treat migrations as part of the project rather than the packages it uses.
>
>
> Possible solution (or "how I'd build it today if there was no existing code
> in Django core"):
>
> a. Make migrations part of the project and not individual apps. This takes
> care of problem 3 above.
>

So, there'd be no reason to link a migration to a specific app; quite the
contrary, it would become much more logical to have one migration include
operations for many apps. That could make the process of making an app
reusable while developing it in a project quite painful.

> b. Prefix individual migration files with a UTC timestamp
> (20161105151023_add_foo) to provide a strict sorting order. This removes
> the depsolving requirement and takes care of 1 and 2. By eliminating those
> it makes 4 kind of obsolete as squashing migrations would become pointless.
>

4: No, on large databases, squashing migrations is not pointless.

1&2: Strict order has its issues: Currently, if I find a problem with the last
migration of app A, I roll it back, fix it, and roll forward. With strict
order, I would have to roll back the project, not the app.

> c. Have reusable apps provide migration templates that Django then copies
> to my project when "makemigrations" is run.
>

I'd like to see some more details about how this works; they would need to
include the development process of reusable apps.

> d. Maintain a separate directory for each database connection.
>

Seems wrong as a blanket idea -- really depends on how the databases are used.
I wouldn't want to find myself maintaining copies of migrations which are
supposed to run on more than one database.

> e. Execute all migrations in alphabetical order (which means by timestamp
> first). When an unapplied migration is followed by an applied one, ask
> whether to attempt to just apply it or if the user wants to first unapply
> migrations that came after it. To me this would work better than 6.
>

This sounds like a good way to create data losses.

> f. Migrating to a timestamp solves 5.
>

Not really. Not with a team, since the timestamps will indicate not the real
logical order, but the order of development. You'd need empty "tag" migrations
to set points you want to migrate to...

My 2 cents,
Shai.

Patryk Zawadzki

unread,

Nov 5, 2016, 1:58:10 PM11/5/16

to Django developers (Contributions to Django itself)

W dniu sobota, 5 listopada 2016 17:30:15 UTC+1 użytkownik Andrew Godwin napisał:

Hello! I have opinions about this :)

Possible solution (or "how I'd build it today if there was no existing code in Django core"):

a. Make migrations part of the project and not individual apps. This takes care of problem 3 above.

It also means it's impossible for apps to ship migrations and define how to upgrade from version to version. I realise that (c) below is part of a proposed solution to this, but how do you propose to match up what's already been run in the database without having names match (and then you just have app migrations by another name)?

I would actually insist on keeping the names intact. It means that adding an external dependency could inject a migration with a date from the previous year but I think that's not a problem as it's guaranteed not to conflict with any other migrations.

b. Prefix individual migration files with a UTC timestamp (20161105151023_add_foo) to provide a strict sorting order. This removes the depsolving requirement and takes care of 1 and 2. By eliminating those it makes 4 kind of obsolete as squashing migrations would become pointless.

Unfortunately this does not help all the time as computers' clocks aren't necessarily right or in sync, so it would merely be an approximation and you'd still get the occasional clash.

You only need your own migrations to be ordered so you can safely assume the previous one to be applied before the one you're writing at the moment. For two unrelated changes the order pretty much does not matter.

c. Have reusable apps provide migration templates that Django then copies to my project when "makemigrations" is run.

Would these be lined up with their own timestamp in the single serial migration timeline? Would you have to make sure any of these templates from any app update was copied across and put in the order before you used the new columns?

I'd say use proper timestamps. This way two apps can depend on each other and the migrations will still get run in proper order.

d. Maintain a separate directory for each database connection.

This I think might be a good idea, though I'd like to see a more generalised idea of "migration sets" and you then then say which alias uses which set (so you can share sets among more than one connection)

Agreed.

e. Execute all migrations in alphabetical order (which means by timestamp first). When an unapplied migration is followed by an applied one, ask whether to attempt to just apply it or if the user wants to first unapply migrations that came after it. To me this would work better than 6.

This is basically what South used to do, and it worked reasonably well in either being successful or exploding enough that people noticed. Given that you're proposing per-project migrations, however, people are going to run into this almost constantly, as they will clash significantly more than per-app ones.

South was not perfect but I'd say the current solution is not better, it's just different. Some of my projects use a lot of long-running feature branches so I have an application where every other migration is a merge migration with accepted default values. We do try to make migrations backwards-compatible where needed but I don't think it's a common scenario to add conflicting changes on two feature branches. Most of our conflicts can be described as department A added a field they needed while department B added a data migration to fix a denormalized field.

Of course we do have migration support in core and it's not compatible with most of the above list. Any ideas? I think serializing the dependency solver state and reusing it between runs could be a pretty low hanging fruit (like "npm shrinkwrap" or yarn's lock file).

I think not only could the dependency solver state be serialised but that it would be a replacement for the datetimes-on-filename proposal in that you could easily pull out a previously-serialised order from disk and then work out what the new ones do.

I am generally not keen on the idea of per-project migrations, though - it makes what's in the database a property of the project, not the app, and that's not how Django has worked traditionally. I think an effort to get a more reliable, exposed global ordering of those individual app migrations would go a long way towards the end goal without having to have migration templates, upgrade instructions, and way more collisions between branches.

I do believe that the database is my property and I'd much rather see the project code hold reign over its structure. Some problems simply cannot be solved by submitting an upstream patch (project-specific or backend-specific indexes come to mind).

At the end of the day, though, there's a reason I made the schema editing separate from the migration runners - you can re-use all the nasty work in the schema editing interface and just replace the other part. This huge change is the sort of thing I'd want to see working and proven before we considered changing core, preferably as a third-party app, but of course I'd like to talk through potential smaller changes first, rather than throwing out the entire system.

Thank you for your comments. Another topic I'd like to touch is making migrations easier to apply without touching all of Django's machinery. In many environments (dockerized apps, multi-server auto-scaled groups) we simply can't run migrations as part of the deployment step. What you then have to do is run some of the migrations (all added fields have to be nullable etc.), make sure no regressions happened and then start deploying code that depends on the introduced changes. Once you're done you can run the second part of migrations that add constraints. Making migrations part of the project would make this process easier and in theory could be done out-of-tree (assuming migrations themselves don't import stuff from outside of Django).

I'm sure Carl could share some of Heroku's stories about migrating databases in large deployments.

Cheers,

Aymeric Augustin

unread,

Nov 5, 2016, 2:57:38 PM11/5/16

to django-d...@googlegroups.com

Hello,

My solution is to throw away and remake all migrations on a regular basis. Then I `TRUNCATE TABLE django_migrations` and `django-admin migrate --fake`. Obviously this isn’t a great solution.

Since I work mostly on small projects with just a couple developers on staff, doing this every few months suffices to keep the run time below one minute (which is still quite annoying).

There’s a risk to lose important, manually generated migrations, typically those that create indexes. I diff the schema before / after with apgdiff to avoid such problems.

This is quite doable but outside the comfort zone of many developers: my clients prefer to have me do it even though I documented the steps in detail.

So… yeah, it would be nice to have something more practical, even if it requires trading off some purity in the design of migrations.

--
Aymeric.

Patryk Zawadzki

unread,

Nov 5, 2016, 6:47:41 PM11/5/16

to Django developers (Contributions to Django itself)

W dniu sobota, 5 listopada 2016 18:40:24 UTC+1 użytkownik Shai Berger napisał:

> 2. Dependency resolution is only stable as long as the migration set is
> frozen. Sometimes introducing a new migration is enough to break existing
> migrations by causing them to execute in a slightly different order. We
> often have to backtrack and edit existing migrations and enforce a strict
> resolution order by introducing arbitrary dependencies.
>

So, you say you really have implicit dependencies between migrations --
dependencies in substance, which aren't recorded as dependencies. This seems
to indicate that you have a lot of manually-written migrations (data
migrations?), since the automatically-written ones do include relevant
dependencies. This seems odd -- it sounds like you're doing something out of
the ordinary.

This would also explain some of your bad experience with squashing -- indeed,
if you have many data migrations, squashing can become much less effective.

Let's not come to conclusions prematurely. Django only supports predicate dependencies. You can say "not earlier than after these are applied" but that does not mean "immediately after they are applied". Sometimes Django tries to run the migration much later. If you have your models scattered across a large number of applications (we use apps to gateway entire classes of related features) sometimes the late migrations tries to reference a column in another model that was long since removed by a much later added migration in its respective app.

> 3. Removing an app from a project is a nightmare. You can't migrate to zero
> state unless the app is still there. There is no way to add "revert all
> migrations for app X" to the migration graph, it's something you need to
> run manually. There is no clean way to remove an app that was ever
> references in a relation. We were forced to do all kinds of hacks to get
> around this. Sometimes it's necessary to create an empty eggshell app with
> the same name and copy all migrations there then add necessary data
> migrations and finally migrations that remove all the models, indices,
> procedures etc. Sometimes people just leave a dead application in
> INSTALLED_APPS to not have to deal with this.

Clear out (maybe even remove) models.py and type "makemigrations", and you get
a migration that deletes everything. The answer to getting rid of the
historical migrations is squashing, but of course you first need squashing to
work properly.

I cannot clear out anything from an app that came from PyPI. That's why I mentioned creating fake empty apps that are just containers for their migration history. Squashing does nothing to help with that if you have another application reference any of those models. Squashing only helps you have fewer migrations. If the migrations were always in the correct order, the migration engine could collapse them automatically at execution time.

> 4. Squashing migrations is wonky at best. If you create a model in one
> migration, alter one of its fields in another and then finally drop the
> model sometime later, the squashed migration will have Django try to
> execute the alter first and complain about the table not being there. Also
> the only reason we need to squash migrations is to prevent problem 1 above
> from becoming exponentially worse. If migrations were only as slow as the
> underlying SQL commands, we'd likely never squash them.
>

If that's so, it's a bug you should report; it's also an issue you can work-
around by editing the migration to remove the redundant operation. There are
issues with squashing, to be sure, but I don't think this is one of the
serious ones.

It's a bug that I will report at some point but I mostly encounter it in environments where I can't afford the time needed to properly debug.

> 6. Conflict detection and resolution (migrate --merge) is a make-believe
> solution. It just trains people to execute the command without
> investigating whether their migration history still makes sense.

It could be smarter, assuming it understood the content of migrations. We
could probably improve it to a point where, for most cases, it would either
know to merge automatically or know that there really is a conflict. This would
probably not help you if you have a lot of RunPython's in your migrations.

Depends on the project. I really don't care about the framework trying to reason about the results of a git merge. Django does not have enough understanding of the code and the version control history to do the job of the person responsible for the merge. Getting stuff right 60% of the time is not reliable enough to depend on it but is reliable enough to get some people lazy.

> Some of these I need to dig deeper into and probably file proper tickets.
> For example I have an idea on how to fix 4 but it would make 1 even slower.
>
> I took some time to get a good long look at what other ORMs are doing. The
> graph-based dependency solving approach is rather uncommon. Most systems
> treat migrations as part of the project rather than the packages it uses.
>
>
> Possible solution (or "how I'd build it today if there was no existing code
> in Django core"):
>
> a. Make migrations part of the project and not individual apps. This takes
> care of problem 3 above.
>

So, there'd be no reason to link a migration to a specific app; quite the
contrary, it would become much more logical to have one migration include
operations for many apps. That could make the process of making an app
reusable while developing it in a project quite painful.

It's already possible to do whatever you like in a migration. They're only limited to modifying stuff within an app by convenience. You can also indent Python code using three tabs and two spaces but it does not automatically make it a good idea :)

> b. Prefix individual migration files with a UTC timestamp
> (20161105151023_add_foo) to provide a strict sorting order. This removes
> the depsolving requirement and takes care of 1 and 2. By eliminating those
> it makes 4 kind of obsolete as squashing migrations would become pointless.
>
4: No, on large databases, squashing migrations is not pointless.

1&2: Strict order has its issues: Currently, if I find a problem with the last
migration of app A, I roll it back, fix it, and roll forward. With strict
order, I would have to roll back the project, not the app.

If anything depends on app A then you're already in the exact same situation.

> c. Have reusable apps provide migration templates that Django then copies
> to my project when "makemigrations" is run.
>

I'd like to see some more details about how this works; they would need to
include the development process of reusable apps.

See my other responses for some ideas. Again, I am not proposing a particular solution with readily available code and a DEP to support it.

> d. Maintain a separate directory for each database connection.

Seems wrong as a blanket idea -- really depends on how the databases are used.
I wouldn't want to find myself maintaining copies of migrations which are
supposed to run on more than one database.

See other responses for how that could be solved with changesets.

> e. Execute all migrations in alphabetical order (which means by timestamp
> first). When an unapplied migration is followed by an applied one, ask
> whether to attempt to just apply it or if the user wants to first unapply
> migrations that came after it. To me this would work better than 6.

This sounds like a good way to create data losses.

That's a vague statement at best.

> f. Migrating to a timestamp solves 5.

Not really. Not with a team, since the timestamps will indicate not the real
logical order, but the order of development. You'd need empty "tag" migrations
to set points you want to migrate to...

I don't think it's a problem that spans multiple computers. You only switch feature branches in your own local checkout.

Cheers,

Patryk Zawadzki

unread,

Nov 5, 2016, 7:01:38 PM11/5/16

to Django developers (Contributions to Django itself)

W dniu sobota, 5 listopada 2016 19:57:38 UTC+1 użytkownik Aymeric Augustin napisał:

My solution is to throw away and remake all migrations on a regular basis. Then I `TRUNCATE TABLE django_migrations` and `django-admin migrate --fake`. Obviously this isn’t a great solution.

Since I work mostly on small projects with just a couple developers on staff, doing this every few months suffices to keep the run time below one minute (which is still quite annoying).

There’s a risk to lose important, manually generated migrations, typically those that create indexes. I diff the schema before / after with apgdiff to avoid such problems.

That's the main problem we're facing. I'm currently leading a project that predates dinosaurs and when it was switched from South to Django, all the data migrations were just carried over from the old code. They are holy gifts from the elder gods and are rich with eldritch symbols. Nobody wants to have to copy and paste them every month or so when we decide to redo all of the migrations.

There's also the problem of having many long-running (weeks to months) feature branches that make it hard to find a point in time where all migrations can be safely discarded.

I can also imagine it's much harder to redo initial migrations in projects where two-way relations exist between certain applications.

Cheers,

Marten Kenbeek

unread,

Nov 5, 2016, 7:32:04 PM11/5/16

to Django developers (Contributions to Django itself)

On Saturday, November 5, 2016 at 4:53:49 PM UTC+1, Patryk Zawadzki wrote:

1. Dependency resolution that turns the migration dependency graph into an ordered list happens every time you try to create or execute a migration. If you have several hundred migrations it becomes quite slow. I'm talking multiple minutes kind of slow. As you can imagine working with multiple branches or perfecting your migrations quickly becomes a tedious task.

Did the dependency resolution actually come up in benchmarks/profiles as a bottleneck? When I optimized and benchmarked the dependency graph code, it had no trouble ordering ~1000 randomly generated migrations with lots of inter-app dependencies in less than a second. I'd be surprised if this had any significant impact on the overall performance of migrations.

An easy way to test this is the `showmigrations` command, which will only generate the graph without any model state changes or model rendering taking place. It does some other things, but nothing that should take in the order of minutes.

charettes

unread,

Nov 5, 2016, 7:58:49 PM11/5/16

to Django developers (Contributions to Django itself)

I have to agree with Marteen.

From my experience what really slow down the migrate and makemigrations
command is the rendering of model states into concrete model classes. This
is something I concluded from my work on adding the plan object to pre_migrate
and post_migrate signals.

As soon as an operation accesses state.apps the rendering kicks in which
triggers the dynamic creation of multiple model classes and the computation
of reverse relationships. There are mechanisms in place to prevent the whole
project model classes from being rendered again when a model state is
altered but if the operation is performed on a model referenced by many
others the relationship chain might force a large number of them to be
rendered again causing massive slow downs.

Markus Holtermann has been working on teaching the migration framework
how to perform database operations without relying on state.apps which should
solve the remaining performance issues of the migrate command. In the case
of makemigrations the last remaining issue in the master branch should be solved
by stopping to rely on state.apps in RenameModel.state_forwards[1].

Patryk, many improvement landed in 1.9 and 1.10 to speed up the commands
dealing with migrations. Are you still seeing the same slowdown on these versions?

Simon

[1] https://github.com/django/django/pull/7468

Patryk Zawadzki

unread,

Nov 6, 2016, 2:46:18 AM11/6/16

to Django developers (Contributions to Django itself)

niedz., 6.11.2016, 00:58 użytkownik charettes <chare...@gmail.com> napisał:

I have to agree with Marteen.

From my experience what really slow down the migrate and makemigrations
command is the rendering of model states into concrete model classes. This
is something I concluded from my work on adding the plan object to pre_migrate
and post_migrate signals.

Yes, rendering model states is very slow but in our case ordering them is also taking quite some time.

I assume that with a linear chain of migrations we'd only have to render model states when detecting database changes (makemigrations) and when executing RunPython code?

As soon as an operation accesses state.apps the rendering kicks in which
triggers the dynamic creation of multiple model classes and the computation
of reverse relationships. There are mechanisms in place to prevent the whole
project model classes from being rendered again when a model state is
altered but if the operation is performed on a model referenced by many
others the relationship chain might force a large number of them to be
rendered again causing massive slow downs.

Markus Holtermann has been working on teaching the migration framework
how to perform database operations without relying on state.apps which should
solve the remaining performance issues of the migrate command. In the case
of makemigrations the last remaining issue in the master branch should be solved
by stopping to rely on state.apps in RenameModel.state_forwards[1].

Patryk, many improvement landed in 1.9 and 1.10 to speed up the commands
dealing with migrations. Are you still seeing the same slowdown on these versions?

I run a mix of 1.9 and 1.10 and am aware of the recent optimisations as I helped with some of them during previous DUTHs.

Simon

[1] https://github.com/django/django/pull/7468

Le dimanche 6 novembre 2016 00:32:04 UTC+1, Marten Kenbeek a écrit :
On Saturday, November 5, 2016 at 4:53:49 PM UTC+1, Patryk Zawadzki wrote:
1. Dependency resolution that turns the migration dependency graph into an ordered list happens every time you try to create or execute a migration. If you have several hundred migrations it becomes quite slow. I'm talking multiple minutes kind of slow. As you can imagine working with multiple branches or perfecting your migrations quickly becomes a tedious task.

Did the dependency resolution actually come up in benchmarks/profiles as a bottleneck? When I optimized and benchmarked the dependency graph code, it had no trouble ordering ~1000 randomly generated migrations with lots of inter-app dependencies in less than a second. I'd be surprised if this had any significant impact on the overall performance of migrations.

An easy way to test this is the `showmigrations` command, which will only generate the graph without any model state changes or model rendering taking place. It does some other things, but nothing that should take in the order of minutes.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/4a012e54-fae5-4bba-97a9-f323f38e53bc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

charettes

unread,

Nov 6, 2016, 4:15:32 AM11/6/16

to Django developers (Contributions to Django itself), pat...@room-303.com

> I assume that with a linear chain of migrations we'd only have to render model states when detecting database changes (makemigrations) and when executing RunPython code?

Right, but I think it should be possible to prevent the rendering of model states
in the autodetector. I'm planning to sprint on this today. There's not much we
can do for the RunPython case unless we manage to make model rendering
lazy. For example, apps.get_model('app.Foo') could lazily render the Foo model
and all its forward and reverse relationships. This seems hard to do but it could be
worth investigating once we've dealt with all the low hanging fruits.

> I run a mix of 1.9 and 1.10 and am aware of the recent optimisations as I helped with some of them during previous DUTHs.

Great, thanks for that!

Patryk Zawadzki

unread,

Nov 8, 2016, 7:35:02 AM11/8/16

to charettes, Django developers (Contributions to Django itself)

I've just hit another problem related to custom fields.

Currently migrations contain information about "rich" fields. If you use a custom field type, the migration code will currently import your field type from its Python module. This is highly problematic in case either the code moves or you later stop using that field type and want to remove the dependency.

I am currently in the process of rewriting some of my existing migrations by hand to replace all instances of a custom field type with the type it actually uses for storage. This will eventually allow me to drop the dependency but it's not very nice.

Another problem is that for many custom field tapes makemigrations detects changes made to arguments that do no affect the database in any way (as they are returned by deconstruction).

If we could ever break backwards compatibility, I'd suggest having field deconstruction only return the column type (and necessary arguments) it wants the schema editor to create. This would prevent the migrations from having external dependencies (which is a major win in itself).

I'd also consider having apps.get_model() just use introspection to read the schema and return transient models with default field types for each underlying column type (so a custom JSONField would become a regular boring TextField inside migration code). This would save us tons of "rendering model states" time for the relatively small cost of having to cast certain columns to your preferred Python types inside a couple of data migrations.

Cheers,

Andrew Godwin

unread,

Nov 8, 2016, 8:48:59 AM11/8/16

to Django developers (Contributions to Django itself), charettes

On Tue, Nov 8, 2016 at 12:34 PM, Patryk Zawadzki <pat...@room-303.com> wrote:

I've just hit another problem related to custom fields.

Currently migrations contain information about "rich" fields. If you use a custom field type, the migration code will currently import your field type from its Python module. This is highly problematic in case either the code moves or you later stop using that field type and want to remove the dependency.

I am currently in the process of rewriting some of my existing migrations by hand to replace all instances of a custom field type with the type it actually uses for storage. This will eventually allow me to drop the dependency but it's not very nice.

This was a hard choice to make - I was obviously aware of the risks here, but eventually chose the current system given that it's far easier to reset the migrations and start over in the Django system than it was in South, and that removing code is generally rarer than adding it in.

Another problem is that for many custom field tapes makemigrations detects changes made to arguments that do no affect the database in any way (as they are returned by deconstruction).

This has to be done unless fields came with a list of keyword arguments that were "known safe", and all subclasses of those fields also implemented that method (in case you e.g. subclassed StringField and made the `choices` kwarg actually use a MySQL ENUM)

If we could ever break backwards compatibility, I'd suggest having field deconstruction only return the column type (and necessary arguments) it wants the schema editor to create. This would prevent the migrations from having external dependencies (which is a major win in itself).

That's not possible if you want to keep the migrations database-agnostic, as the type of a column varies based on the backend (and sometimes other things). If you want a system that is fixed to an exact database, at some point it might be better to just use SQL.

(There is totally room for generating migrations as raw SQL and still having them work in the current system, which would also get around the field problem you describe)

I'd also consider having apps.get_model() just use introspection to read the schema and return transient models with default field types for each underlying column type (so a custom JSONField would become a regular boring TextField inside migration code). This would save us tons of "rendering model states" time for the relatively small cost of having to cast certain columns to your preferred Python types inside a couple of data migrations.

This runs into issues when the schema you read does not give you enough information - e.g. some field types (especially geospatial ones) are more than just a column, there can also be a sequence, some indexes, constraints etc. involved.

I wrote a more advanced introspection backend as part of the migrations work, but you'd need to extend it even more and improve upon features like foreign key implication before it would be possible to do this.