[Django] #31869: Improving data migration using `dumpdata` and `loaddata`

Django

unread,

Aug 8, 2020, 12:13:15 PM8/8/20

to django-...@googlegroups.com

#31869: Improving data migration using `dumpdata` and `loaddata`
-------------------------------------+-------------------------------------
Reporter: Matthijs | Owner: nobody
Kooijman |
Type: New | Status: new
feature |
Component: Core | Version: 3.1
(Management commands) |
Severity: Normal | Keywords:
Triage Stage: | Has patch: 0
Unreviewed |
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
-------------------------------------+-------------------------------------
At first glance, using `manage.py dumpdata` and `loaddata` together seems
a great way to make a full copy of an existing django installation (e.g.
for migrating to a different server, or getting a local copy of your
production data, etc.).

Documentation suggests this should be possible. An obvious way would be to
do `dumpdata` on one system, followed by `flush` and `loaddata` on the
other system.

However when you try it, you get issues with duplicate keys in the
contenttypes and similar tables, things like:

{{{
MySQLdb._exceptions.IntegrityError: (1062, "Duplicate entry 'someapp-
somemodel' for key 'django_content_type_app_label_model_76bd3d3b_uniq'")
}}}

What seems to happen is that `flush`
([https://docs.djangoproject.com/en/dev/ref/django-admin/#flush as
documented]) flushes all tables and then reruns "post-synchronization
handlers", which create content-types and I think permissions and maybe
other things as well. Since `dumpdata` does dump these tables, this
creates a conflict.

Currently, I think you can prevent this by:
- Making and importing a full database dump outside of Django (e.g. using
mysqldump). This is a good way to guarantee a really identical copy
(though there might be timezone issues with e.g. Mysql), but is often less
convenient and does not work across database types (e.g. dumping a remote
MySQL database to a local sqlite database).
- Using natural keys when dumping. The
[https://docs.djangoproject.com/en/dev/ref/django-admin/#dumpdata
documentation for `dumpdata --natural-foreign`] suggests using natural
keys when contenttypes and permissions are involved. I believe this works
because the natural foreign keys allow associating any references to these
tables to the autocreated versions in the original database. In addition,
and I think the documentation does not make this explicit, you would also
need to exclude the contenttypes, permissions and any other auto-created
models from the dumpdata, or also add `--natural primary`, which I believe
makes loaddata overwrite existing data based on the natural primary key
rather than adding new data. [[BR]]
Having to manually exclude models is quite cumbersome for a quick dump-
and-load cycle. Also, if the dumped database would somehow contain *less*
contenttypes, permissions, etc. than the autocreated ones, the newly
loaded database would still contain the extra ones. More generally, the
loaded database is not an identical copy of the original one.[[BR]]
I also had some issues with this approach, due to circular references
in my natural keys, but I think this has since been fixed in git.

I wonder if we can make this process easier somewhow?

One solution that springs to mind is to add a `flush --no-handlers` option
(or something like that), to prevent running the "post synchronization
handlers". This would (should) result in empty tables for all tables that
are dumped by `dumpdata` (I think this means all tables empty, except for
the migration table). Then doing a `dumpdata`, `flush --no-handlers` and
`loaddata` could, I think, produce an exact copy of the database,
including matching primary keys.

Or are there any other existing ways to make this easier that I missed
and/or could be (better) documented?

--
Ticket URL: <https://code.djangoproject.com/ticket/31869>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,

Aug 9, 2020, 9:55:30 AM8/9/20

to django-...@googlegroups.com

#31869: Improving data migration using `dumpdata` and `loaddata`
-------------------------------------+-------------------------------------

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Matthijs Kooijman):

Just realized that I forgot to list one option, which I think is the most
practical way to do this right now:
- Instead of using `manage.py flush`, use a database-specific management
tool (e.g. phpmyadmin) to empty/truncate all tables *except* for
migrations. Then just use `dumpdata` and `loaddata`, which should get you
an identical import, including identical primary keys. This works, but
still requires external tools, so is harder to quickly do and/or automate,
so a `manage.py`-only solution would certainly be nicer.

--
Ticket URL: <https://code.djangoproject.com/ticket/31869#comment:1>

Django

unread,

Aug 10, 2020, 2:07:07 AM8/10/20

to django-...@googlegroups.com

#31869: Improving data migration using `dumpdata` and `loaddata`
-------------------------------------+-------------------------------------

Reporter: Matthijs Kooijman | Owner: nobody

Type: New feature | Status: closed

Component: Core (Management | Version: 3.1
commands) |

Severity: Normal | Resolution: wontfix
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Changes (by felixxm):

* status: new => closed
* resolution: => wontfix

Comment:

> At first glance, using `manage.py dumpdata` and `loaddata` together
seems a great way to make a full copy of an existing django installation
(e.g. for migrating to a different server, or getting a local copy of your
production data, etc.). Documentation suggests this should be possible.

I don't see such suggestions in docs. Given that `dumpdata` isn't meant
as a backup tool (databases have their own tool for that), I think should
not add more options to make it easier or to document a more or less
feasible scenario.

--
Ticket URL: <https://code.djangoproject.com/ticket/31869#comment:2>

Django

unread,

Aug 10, 2020, 6:11:12 AM8/10/20

to django-...@googlegroups.com

#31869: Improving data migration using `dumpdata` and `loaddata`
-------------------------------------+-------------------------------------

Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Matthijs Kooijman):

> I don't see such suggestions in docs.

This is not explicit, but IMHO implied. There's a command to make a dump
of the data for all applications, and there's a command that can load such
dumps, which suggests to me that I can use this to make a copy of all my
data into another django instance. If not, maybe this should be made
explicit in the docs.

> Given that dumpdata isn't meant as a backup tool (databases have their
own tool for that)

I was not talking about backups here (I agree that database-specific tools
are better for this, though I could also see `dumpdata` as a second-best
option), but about copying data from one instance to another. But that's
just a matter of definition, I'm assuming that you would also consider
that dumpdata is not meant for such data copying?

If so, then what is `dumpdata` meant for exactly? Is it only for creating
test fixtures? Then why does it have a "dump all applications" option
and/or why does it include contenttypes and other autogenerated data at
all (even by default)?

And even if this copying is not the intention of `dumpdata`, is there any
reason that its intended usecase cannot be expanded to include this? It
seems to me that it would be easy to do so, but maybe there are
fundamental issues or things that prevent this from working reliably? If
so, maybe it would be better to document that fact (maybe with some
examples) to make this more explicit and correct user expectations. If you
can point out things here, I'd be happy to submit a documentation PR for
this.

--
Ticket URL: <https://code.djangoproject.com/ticket/31869#comment:3>

Django

unread,

Aug 10, 2020, 6:49:16 AM8/10/20

to django-...@googlegroups.com

#31869: Improving data migration using `dumpdata` and `loaddata`
-------------------------------------+-------------------------------------

Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by felixxm):

> I was not talking about backups here (I agree that database-specific
tools are better for this, though I could also see `dumpdata` as a second-

best option), but about copying data from one instance to another. But

that's just a matter of definition, I'm assuming that you would also
consider that dumpdata is not meant for such data copying?

Sure you can use `dumpdata` to create a copy of data but you're talking
about a specific and advanced scenario (combination of 3 management
commands) for overriding and existing and not empty instance. IMO, this is
not something that we want/should support or document.

You've already mentioned in the ticket description that it's possible if
you exclude Django apps that use the `post_migrate` signal and use natural
keys in `dumpdata`, however I don't think we should document this because
`dumpdata` isn't meant as a backup/clone tool.

--
Ticket URL: <https://code.djangoproject.com/ticket/31869#comment:4>

Django

unread,

Aug 10, 2020, 7:03:33 AM8/10/20

to django-...@googlegroups.com

#31869: Improving data migration using `dumpdata` and `loaddata`
-------------------------------------+-------------------------------------

Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Matthijs Kooijman):

> Sure you can use dumpdata to create a copy of data but you're talking
about a specific and advanced scenario (combination of 3 management
commands) for overriding and existing and not empty instance. IMO, this is
not something that we want/should support or document.

I'm not married to the use of three management commands, if just
`dumpdata` and `loaddata` could achieve the same, that would also be
perfect (and using these two commands together does not seem like an
advanced scenario to me, really).

As for overriding an existing instance, I believe that the issue would not
be different when starting from a non-existing database. You would have to
run migrate, then, to create the table structure, which also runs the same
handles to create e.g. contenttypes.

One more thing I had wanted to emphasize in my previous comment: Using
external database tools is fine for backup and restore, but when moving
data between database types, making SQL-based dumps is pointless AFAICS
and some higher-level tool like `dumpdata` would be needed. If `dumpdata`
is not the tool for this, is there an alternative? Or is changing the type
of your database and/or moving data between instances with different
databases simply not supported by Django then?

This does not seem like a weird or unconventional usecase to me. On
particular instance of this is wanting to import some data from a
production server (typically mysql or postgres) into a local development
copy to reproduce some problem without having to debug on production. Of
course, you can have a staging copy of your application for this, or setup
mysql/postgresql locally, which certainly makes sense for big and
professional deployments, but for small-scale applications, this is a lot
of extra work just for this.

Apologies if I come across stubborn, but I feel that just saying "This is
unsupported" is insufficient in this case. Maybe this is extra annoying
because I'm not the first one to suggest this and this is a repeating
discussion for you, but if so that would be even more indication that this
usecase should be better documented (even if just as not supported).

Again, I would consider filing a documentation PR to clarify that this
usecase is not supported, but I'm still grasping a little bit at what is
and is not the intended usecase for `dumpdata` and `loaddata`. What
defines the line between supported and not supported?

--
Ticket URL: <https://code.djangoproject.com/ticket/31869#comment:5>

Django

unread,

Aug 10, 2020, 7:27:50 AM8/10/20

to django-...@googlegroups.com

#31869: Improving data migration using `dumpdata` and `loaddata`
-------------------------------------+-------------------------------------

Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by felixxm):

> Apologies if I come across stubborn, but I feel that just saying "This
is unsupported" is insufficient in this case. Maybe this is extra annoying
because I'm not the first one to suggest this and this is a repeating
discussion for you, but if so that would be even more indication that this
usecase should be better documented (even if just as not supported).

We cannot document each unsupported scenario. It's also
[https://docs.djangoproject.com/en/3.1/topics/serialization/#natural-keys
documented] that ''"You should never include automatically generated
objects in a fixture or other serialized data."''

> This does not seem like a weird or unconventional usecase to me.

Moving data between different databases is always hard, because you can
encounter multiple issues e.g. inconsistent or non-existent data types,
and much more (that's why we have many feature flags). Adding support for
overriding and existing and not empty instance with `loaddata` and
`dumpdate` is really complicated, and IMO we shouldn't officially support
it.

> What defines the line between supported and not supported?

If sth is not tested and documented then it's not supported.

You can start a discussion on DevelopersMailingList if you don't agree.

--
Ticket URL: <https://code.djangoproject.com/ticket/31869#comment:6>

Django

unread,

Aug 11, 2020, 5:21:41 AM8/11/20

to django-...@googlegroups.com

#31869: Improving data migration using `dumpdata` and `loaddata`
-------------------------------------+-------------------------------------

Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Carlton Gibson):

Hi Matthijs — I think this is an interesting topic. A write up of exactly
what's required to move between DBs might be a good starting point for a
discussion. (Who knows whether we support it in the end or not, but it's
something everyone has run into a few times I'd guess...)

--
Ticket URL: <https://code.djangoproject.com/ticket/31869#comment:7>

Reply all

Reply to author

Forward