Fwd: Discussion related to ticket #26822 (new migrations, --keepdb and --parallel option)

323 views
Skip to first unread message

Romain Garrigues

unread,
Jul 4, 2016, 10:35:11 PM7/4/16
to django-d...@googlegroups.com
Hi everyone,

I have raised few days ago an issue (https://code.djangoproject.com/ticket/26822) I encountered when using --keepdb and --parallel option.
I have proposed a patch (https://github.com/django/django/pull/6884), and after some exchanged with Aymeric Augustin, I have been encouraged to post to this list.

The context is the following:
- I run the first time the tests on my Django project (1.9.7) with --keepdb and --parallel option.
- All databases are kept, everything is working as expected.
- I add a new field in a model, generate a new migration and add some tests related to that field.
- I run again my tests with --keepdb and --parallel option: they are now failing because this new field has not been created in cloned databases.

After some investigation, I have seen that, in case of keepdb context, in django/db/backends/*/creation.py, if the cloned databases already exist, we don't touch them, which leads to this new field not created in cloned ones.

I have proposed in the PR to rebuild the cloned databases, even with keepdb option, to be sure that we always have the cloned databases with the latest migration state.

The problem with this method is that it will increase test database initialization time, as we will now systematically copy all cloned databases, even with --keepdb option (except the default one).

We could have just documented this limitation, but I don't think that my situation is a really rare edge case in terms of process, so I was suggesting to add a new option to be able to reset the cloned databases if needed (let's name it --parallel-clone-reset).

I don't really like the idea of adding a new option, as it impacts the test runner, the clone_test_db function signature, ... but I have not found a better idea to at the same time keep the performances with --keepdb and --parallel, and handle these newly added migrations to a project.

To summarize my proposal, this option (--parallel-clone-reset, or any other name) should be used only if you are using --keepdb and --parallel options at the same time, and when you have added a new migration between 2 test run.

I'm not sure my explanation was clear enough, but I would really like to have your feedbacks about it.

Regards,

Romain Garrigues.






Andreas Pelme

unread,
Jul 5, 2016, 2:49:32 AM7/5/16
to django-d...@googlegroups.com
Hello,

> On 5 juli 2016, at 00:22, Romain Garrigues <romain.ga...@gmail.com> wrote:
> After some investigation, I have seen that, in case of keepdb context, in django/db/backends/*/creation.py, if the cloned databases already exist, we don't touch them, which leads to this new field not created in cloned ones.
>
> I have proposed in the PR to rebuild the cloned databases, even with keepdb option, to be sure that we always have the cloned databases with the latest migration state.
>
> The problem with this method is that it will increase test database initialization time, as we will now systematically copy all cloned databases, even with --keepdb option (except the default one).

We’ve been doing similar things in pytest-django (with the --reuse-db option and pytest-xdist) and faces similar problems. Currently you have to force-recreate the databases and then all processes will run migrations and it is very slow.

I’ve been playing around with a solution to this: In my own project I create a template database and call it `test_myproject_<sha1 of all migration files>`. Whenever a migration file changes (an existing file or a new migration file) - a new database is created and all clones are recreated.

Currently this lives as a hack in my own code base, but I would like to explore this further and it could be a way forward. Here is my scripts that calculates the hash and creates the databases:
https://gist.github.com/pelme/4b3dac475cd6b1dec4fd67d25d2e7cdc
https://gist.github.com/pelme/4a3ad3a62b6244068ff63736342f9509

This method could be refined: It is not necessary to create a database with a new name every time migrations change. I.e. we could create a private table with a single row that contains the hash.

This approach hashes only the migration files directly involved in migrations, if you are using a 3rd-party library that’s imported, that will not trigger a new migration run.

As an end user of this the experience is quite nice: You only experience the migration/cloning slowness whenever migrations actually changed, otherwise everything is fast. You don’t have to remember any special command line options.

Cheers,
Andreas

Aymeric Augustin

unread,
Jul 5, 2016, 3:00:43 AM7/5/16
to django-d...@googlegroups.com
Hello,

I’ll try to clarify what I said in the PR below..

The main reason for the `--parallel` option was to make Django’s own test suite faster. A full run went down from ~8m to ~1m30 when I committed that patch, which really helps the development cycle on invasive patches.

Since Django’s own test runner bypasses migrations, whenever you make changes to a model in Django’s test suite, you need a run without `--keepdb`. So the problem you’re describing doesn’t exist for the primary use case.

Now let’s talk about models and migrations in users’ projects, which is the logical next step and the use case you’re trying to improve.

Note that the `--parallel` option is experimental and often non-functional in this context: as soon as two tests hit a resource other than the database — say, the cache — they can stomp upon one another.

> On 05 Jul 2016, at 00:22, Romain Garrigues <romain.ga...@gmail.com> wrote:
>
> We could have just documented this limitation, but I don't think that my situation is a really rare edge case in terms of process, so I was suggesting to add a new option to be able to reset the cloned databases if needed (let's name it --parallel-clone-reset).

When I make changes to models, usually I keep removing and recreating a single migration, which is incompatible with using the `--keepdb` option. Whenever I make changes, I run without `--keepdb` once.

> I don't really like the idea of adding a new option, as it impacts the test runner, the clone_test_db function signature, ... but I have not found a better idea to at the same time keep the performances with --keepdb and --parallel, and handle these newly added migrations to a project.

I’m not a fan of a new option either…

> To summarize my proposal, this option (--parallel-clone-reset, or any other name) should be used only if you are using --keepdb and --parallel options at the same time, and when you have added a new migration between 2 test run.

IIRC this will more than double the run time of Django’s own test suite on MySQL: it will increase from ~2m to ~4m (give or take 30s) because cloning databases is slow on MySQL.

I’m quoting all these figures from memory and I may mix them up. As I said on the ticket it would be useful to redo the benchmark on a first run and subsequent run of `./runtests.py` on PostgreSQL and MySQL.

You could argue that it’s best to degrade the experience of a few Django contributors (original use case, Django’s test suite) for the benefits of the wider community (new use case, projects’ test suites). However the original use case is known to work and I don’t believe that the new use case works well enough in general, at least not without some engineering to isolate tests from one another. For this reason I’m not convinced by this argument.

I hope this clarifies the context of the trade-off we’re discussing.

Best regards,

--
Aymeric.

Markus Holtermann

unread,
Jul 5, 2016, 7:48:38 AM7/5/16
to django-d...@googlegroups.com
Hi,

it might be a shot in the dark, but can't we check if Django's
testrunner applied new migrations in which case we drop the cloned
databases and recreate them. If all migrations already existed we keep
the clones the way they are?

/Markus
>--
>You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
>To post to this group, send email to django-d...@googlegroups.com.
>Visit this group at https://groups.google.com/group/django-developers.
>To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/BD5FB0D4-C9E1-4681-A5A1-CCD6D6BB84B7%40polytechnique.org.
>For more options, visit https://groups.google.com/d/optout.
signature.asc

Romain Garrigues

unread,
Jul 5, 2016, 1:54:27 PM7/5/16
to Django developers (Contributions to Django itself)

Markus, I like the idea, which is definitely better than my idea of new option to recreate it manually when we know we have to.

I can try to investigate a bit if you think that could lead to something that makes sense.
Two ideas I have in mind after a quick look at migrate command line code:
1/ Extract the code related to the "plan" (created by executor.migration_plan(targets) function) to be also used somewhere else (in clone_test_db for example...)
2/ Make "migrate" command return a sort of report (number of migrations applied, ...) of what happened during a "migrate" call, that could then be used in db.backends.base.creation.BaseDatabaseCreation.create_test_db and passed to the connection.creation.clone_test_db loop block, moving from "django.test.runner.setup_databases" to "db.backends.base.creation.BaseDatabaseCreation.create_test_db" (as cloned databases have a link with the state of the default one, it can justify this move).

This parallel option is really great and coming with some environment constraints, as you said Aymeric, but for big projects, the gain is so impressive that I will do all I can to help on that!

I have the benchmark in my todo list, do you think it makes sense to update the current PR with one of these 2 propositions explained above?

Romain.

Romain Garrigues

unread,
Sep 21, 2016, 11:24:32 AM9/21/16
to Django developers (Contributions to Django itself)
Any feedback about my 2 proposals, to know if it is worth to spend time to propose a patch for one of those?
Reply all
Reply to author
Forward
0 new messages