Why does dumpdata, flush, loaddata cycle not result in same db content?

Phil Mocek

unread,

Apr 15, 2009, 10:43:08 PM4/15/09

to django...@googlegroups.com

I have a simple project using the flatpages app. If I use
manage.py to run dumpdata (with no command line arguments,
redirecting output to a file), then flush, then loaddata (with no
command line arguments, redirecting input from a file), my sqlite3
database file is different than a backup copy I made of it, and
when I run the app, it seems to be reset to an empty-database
condition. Why is this?

The dump file appears to contain the content of my database.

The Django docs [1] say:

dumpdata

django-admin.py dumpdata <appname appname ...>

Outputs to standard output all data in the database associated
with the named application(s).

If no application name is provided, all installed applications
will be dumped.

The output of dumpdata can be used as input for loaddata.

[1]: <http://docs.djangoproject.com/en/1.0/ref/django-admin/>

In the root directory of my project, I'm running:

./manage.py dumpdata >data.json
./manage.py flush
# (answer yes to flush, no to create superuser)
./manage.py loaddata <data.json

I'm using:

* Ubuntu GNU/Linux 8.10
* Django 1.0-final-SVN-unknown (from the 1.0.1ubuntu1 package)
* Python 2.5.2
* sqlite 3.5.9

--
Phil Mocek

Russell Keith-Magee

unread,

Apr 15, 2009, 11:03:33 PM4/15/09

to django...@googlegroups.com

On Thu, Apr 16, 2009 at 10:43 AM, Phil Mocek
<pmocek-list-...@mocek.org> wrote:
>
> I have a simple project using the flatpages app. If I use
> manage.py to run dumpdata (with no command line arguments,
> redirecting output to a file), then flush, then loaddata (with no
> command line arguments, redirecting input from a file),

You were going well till this point.

Loaddata doesn't take input from stdin - it loads files that are
specified on the command line. Your database is empty because you have
specified no fixtures to load.

The sequence needs to go something like:

$ ./manage.py dumpdata > mydata.json
$ ./manage.py flush
... follow the prompts
$ ./manage.py loaddata mydata.json

> The output of dumpdata can be used as input for loaddata.

This is correct, but in a different way than you have interpreted. The
output of dumpdata is in a format that loaddata can use - but that
doesn't mean it accepts input from stdin.

Yours,
Russ Magee %-)

Phil Mocek

unread,

Apr 15, 2009, 11:30:15 PM4/15/09

to django...@googlegroups.com

On Thu, Apr 16, 2009 at 11:03:33AM +0800, Russell Keith-Magee wrote:
> Loaddata doesn't take input from stdin - it loads files that are
> specified on the command line.

Thanks, Russell.

I should have read the documentation for the loaddata command more
closely, but this is quite counter-intuitive. Can someone tell me
what the rationale for this syntax is?

The built-in usage help shows that the filename argument (called a
fixture for reasons that I have not yet researched) is mandatory:

Usage: manage.py loaddata [options] fixture [fixture ...]

However, manage.py does not report an error when a fixture is not
specified:

$ ./manage.py loaddata
$ echo $?
0

Only when I bump the verbosity up from 0 to 2 is an error
reported, and the return code still indicates success:

$ ./manage.py loaddata --verbosity=1 ; echo $?
0
$ ./manage.py loaddata --verbosity=2 ; echo $?
No fixtures found.
0

This is pretty bad unless there's a preferred method of loading
data in a non-interactive manner.

Possibly-relevant tickets found with a search of the Django Trac
(for "manage.py loaddata") include:

#6724 (loaddata w/ errors displays no warnings or error in output)
closed as duplicate of #4499 (integrity error silently failing with postgres and loaddata)

#4431 (manage.py loaddata should have better error reporting)
closed as fixed with r6936

#4371 (fixture loading fails silently in testcases)
closed as fixed with r7595

#10200 (loaddata command does not raise CommandError on errors)
new

--
Phil Mocek

Russell Keith-Magee

unread,

Apr 16, 2009, 4:04:29 AM4/16/09

to django...@googlegroups.com

On Thu, Apr 16, 2009 at 11:30 AM, Phil Mocek
<pmocek-list-...@mocek.org> wrote:
>
> On Thu, Apr 16, 2009 at 11:03:33AM +0800, Russell Keith-Magee wrote:
>> Loaddata doesn't take input from stdin - it loads files that are
>> specified on the command line.
>
> Thanks, Russell.
>
> I should have read the documentation for the loaddata command more
> closely, but this is quite counter-intuitive. Can someone tell me
> what the rationale for this syntax is?

In my original response, I oversimplified a little. The rationale with
loaddata is that you're not actually specifying a filename. What
you're specifying is a label, and Django will discover multiple
fixtures using that label. By happy coincidence, a label can be a
filename, but filenames are a subset of all possible labels.

The classic example for loaddata is the initial data fixture. When you
run syncdb, Django loads the 'initial_data' fixture - that is, any
fixture, in any of the supported locations, in any supported format,
with the label "initial_data". This means every application in your
project can potentially provide an initial data fixture; some in XML,
some in JSON, some in YAML, and they will all be loaded automatically.

initial_data is a special case because of the relationship with
syncdb, but the same rules apply to any fixture - if you put a
collection of fixtures called 'bootstrap' around your project, you can
'loaddata bootstrap' and they will all be loaded.

The problem with loading a fixture from stdin is that you have no idea
what format that fixture is. Currently, fixture format is detected
from the filename extension; if fixture data comes from stdin, we
would need to either:
1) Use file magic to work out what type of input was being provided
2) Add a --format option so the input format could be explicitly specified.

Neither of these two options particularly appeals to me, but I am open
to be being convinced otherwise.

> The built-in usage help shows that the filename argument (called a
> fixture for reasons that I have not yet researched) is mandatory:
>
> Usage: manage.py loaddata [options] fixture [fixture ...]
>
> However, manage.py does not report an error when a fixture is not
> specified:
>
> $ ./manage.py loaddata
> $ echo $?
> 0

The fact that no error message is raised for the 'you didn't provide
any arguments' case looks like a bug to me. This should be logged
as a new ticket so it isn't forgotten.

Yours,
Russ Magee %-)

Phil Mocek

unread,

Apr 17, 2009, 2:33:35 AM4/17/09

to django...@googlegroups.com

On Thu, Apr 16, 2009 at 04:04:29PM +0800, Russell Keith-Magee wrote:

> On Thu, Apr 16, 2009 at 11:30 AM, Phil Mocek wrote:
> > The built-in usage help shows that the filename argument (called a
> > fixture for reasons that I have not yet researched) is mandatory:
> >
> > Usage: manage.py loaddata [options] fixture [fixture ...]
> >
> > However, manage.py does not report an error when a fixture is not
> > specified:
> >
> > $ ./manage.py loaddata
> > $ echo $?
> > 0
>
> The fact that no error message is raised for the 'you didn't provide
> any arguments' case looks like a bug to me. This should be logged
> as a new ticket so it isn't forgotten.

Done. I opened #10849 (management loaddata: bad syntax not reported,
results in successful return code) [1] and associated it with component
"django-admin.py".

Also, I previously made note of this behavior in #10200 (loaddata
command does not raise CommandError on errors) [2].

References:

[1]: <http://code.djangoproject.com/ticket/10849>
[2]: <http://code.djangoproject.com/ticket/10200#comment:5>

--
Phil Mocek

Phil Mocek

unread,

Apr 17, 2009, 3:56:31 AM4/17/09

to django...@googlegroups.com

On Thu, Apr 16, 2009 at 04:04:29PM +0800, Russell Keith-Magee wrote:

> The problem with loading a fixture from stdin is that you have no idea
> what format that fixture is. Currently, fixture format is detected
> from the filename extension

It's impossible to detect anything about the content of a file by
examining its filename. If loaddata makes any assumptions about fixture
format based on the filename of a fixture, then I would consider this to
be a bug.

However, due to long-standing conventions, the filename suffix
("extension" in MS-DOS parlance) often provides a clue about the format
of the data within. These conventions allow loaddata to make a
reasonable guess about the content of a file based on its name. That
may be a useful practice, but it's still just a guess.

> if fixture data [were to come from] from stdin, we would need to

> either:
>
> 1) Use file magic to work out what type of input was being provided
> 2) Add a --format option so the input format could be explicitly
> specified.
>
> Neither of these two options particularly appeals to me, but I am open
> to be being convinced otherwise.

Disclaimer: I'm new to Django, and I don't yet have my head around what
a fixture is. I'm very familiar with using command line utilities to
read character streams. My discussion of this topic is offered in the
context of using loaddata with fixtures that happen to be (or happen to
come from?) files or from standard input.

Whether input is read from standard input or from a file whose name was
passed on the command line, loaddata will not know what is in the file
until the file is read. It can predict what the format of the data will
be based on the name of a file. It could make what is likely to be a
better prediction about the format of the data if the caller explicitly
stated on the command line what the format is intended to be. In either
case, loaddata would be engaging in risky behavior if it used the data
without some verification that it is of a particular format.

A --format option would achieve parity with the syntax of loaddata's
counterpart, dumpdata, and would allow the caller to explicitly state
what loaddata should expect. This would provide more information to
loaddata about the format of the data than a filename provides, because
the best the filename can do is imply something about the format of the
data. Validly-formatted data still wouldn't be guaranteed, but the
program would then know what the caller intends for the format to be,
and that is more information than it would have were it provided with
nothing but the stream of data and a filename.

If it's reasonable for the program to predict data format based on
filename, then it would seem to be reasonable in the absence of a
filename to predict the data format based on whatever the default format
is documented to be. This would again provide parity with the dumpdata
command, which prints data in its default format, JSON, when a format is
not explicitly selected by the caller. In either case, it would be best
for some sort of internal check to occur before loaddata does anything
with the stream of data.

My inclination would be for it to adopt the POSIX convention of reading
from a particular file if one is specified on the command line, from
standard input if the filename specified on the command line is "-",
and from standard input if no file is specified on the command line.
Thus, the rules would be:

if filename specified on cmdline:
if that filename is "-":
read from stdin
else:
read from file with name specified
else:
read from stdin

Note that this means anything on standard input is ignored if the caller
provides a filename. This maintains backwards-compatibility with the
current behavior.

Once the program decides where to get input, it can move on to deciding
what to do with that input based on all the information available to it.
These seem to be very different tasks, and I think it's important to
maintain a distinction between them.

For an example of the usefulness of using loaddata with standard input,
consider the act of loading the database of a Django-based Web
application with the content of the database of an instance of the same
application running elsewhere:

ssh remotehost /path/to/manage.py dumpdata | /path/to/local/manage.py loaddata

--
Phil Mocek

Russell Keith-Magee

unread,

Apr 17, 2009, 7:58:35 AM4/17/09

to django...@googlegroups.com

On Fri, Apr 17, 2009 at 3:56 PM, Phil Mocek
<pmocek-list-...@mocek.org> wrote:
>
> However, due to long-standing conventions, the filename suffix
> ("extension" in MS-DOS parlance) often provides a clue about the format
> of the data within. These conventions allow loaddata to make a
> reasonable guess about the content of a file based on its name. That
> may be a useful practice, but it's still just a guess.

This is true. However, I'm not in the habit of labelling my XML files
.json, so in circumstances like this one, the extension can be a very
useful clue. It only becomes problematic when you have extensions like
.doc, which could mean a pure text document, an RTF document, an MS
Word document (which is itself any number of possible formats), or
anything else that someone has decided to call a "doc". This isn't a
problem that exists for fixture formats, so we're safe.

I would also repeat that we're not talking about filenames here -
we're talking about fixture labels. When you call 'loaddata
foobar.json", you're not saying "load the file foobar.json". You're
saying "find all the JSON fixtures called foobar, looking in
app1/fixture, app2/fixture, FIXTURE_DIRS, and the current working
directory. Filenames are the degenerate case of the fixture labeling
system.

> Whether input is read from standard input or from a file whose name was
> passed on the command line, loaddata will not know what is in the file
> until the file is read.

This is patently untrue, unless you are applying an exceptionally
gnostic interpretation of the word "know". Django has a very large
test suite, with a large number of test fixtures. On top of that, I
have a huge number of fixtures in test cases for personal and work
projects, and I know many other people with similar fixture
collections. "Knowing" the format of those fixtures has not yet proven
to be a problem. XML fixtures are named .xml. JSON fixtures are named
.json. This is neither confusing, nor surprising, nor problematic.

> In either
> case, loaddata would be engaging in risky behavior if it used the data
> without some verification that it is of a particular format.

Well, if you pass json data in and request parsing as XML, you're
going to find the parser choking pretty quickly. Fixture loading all
happens in transactions, and the parsers are pretty strict, so if you
start seeing errors, the database won't get anything loaded.

> A --format option would achieve parity with the syntax of loaddata's
> counterpart, dumpdata, and would allow the caller to explicitly state
> what loaddata should expect. This would provide more information to
> loaddata about the format of the data than a filename provides, because
> the best the filename can do is imply something about the format of the
> data. Validly-formatted data still wouldn't be guaranteed, but the
> program would then know what the caller intends for the format to be,
> and that is more information than it would have were it provided with
> nothing but the stream of data and a filename.

I agree that there would be parity with dumpdata, and I can see how
--format could be useful in the context of a stdin input mode for
loaddata. It would certainly be more reliable than trying to invent
magic file format detection methods.

However, I flat out reject the idea that:

loaddata --format=xml foobar.xml

provides better formatting guarantees than:

loaddata foobar.xml

in any real fixture loading situation. I also reject the idea that
using a command line argument allows the parser to know what format
the caller "intends" any better than the original author of a fixture
does when they name their fixture, knowing the way it will be used by
Django.

> If it's reasonable for the program to predict data format based on
> filename, then it would seem to be reasonable in the absence of a
> filename to predict the data format based on whatever the default format
> is documented to be. This would again provide parity with the dumpdata
> command, which prints data in its default format, JSON, when a format is
> not explicitly selected by the caller. In either case, it would be best
> for some sort of internal check to occur before loaddata does anything
> with the stream of data.

There is the start an idea here, but there are several significant
edge cases. To pick a few easy targets: what happens in the following
cases? Why is your proposed behaviour _not_ surprising when compared
with the existing behaviour without the --format option?

loaddata --format=json foobar.xml

loaddata --format=json foobar
(when there is an foobar.xml fixture available)

loaddata --format=json foobar.json whizbang.xml

> My inclination would be for it to adopt the POSIX convention of reading
> from a particular file if one is specified on the command line, from
> standard input if the filename specified on the command line is "-",
> and from standard input if no file is specified on the command line.
> Thus, the rules would be:
>
> if filename specified on cmdline:
> if that filename is "-":
> read from stdin
> else:
> read from file with name specified
> else:
> read from stdin
>
> Note that this means anything on standard input is ignored if the caller
> provides a filename. This maintains backwards-compatibility with the
> current behavior.

Again - the user isn't providing a filename, they're providing a
label. However, I don't see anything fundamentally wrong with this
idea.

> Once the program decides where to get input, it can move on to deciding
> what to do with that input based on all the information available to it.
> These seem to be very different tasks, and I think it's important to
> maintain a distinction between them.

To be clear, my problems with taking input from stdin are entirely
linked to the second of these two problems. I don't have any
particular objection to taking loaddata input from stdin per se. My
objections lie with how the format of this data will be determined.
Using --format is one approach, but we need to be very clear how the
--format directive is interpreted for the existing use cases (in
particular, during unspecified format fixture discovery, such as
initial_data).

> For an example of the usefulness of using loaddata with standard input,
> consider the act of loading the database of a Django-based Web
> application with the content of the database of an instance of the same
> application running elsewhere:
>
> ssh remotehost /path/to/manage.py dumpdata | /path/to/local/manage.py loaddata

I have no difficulty seeing _how_ you would plug these pipes together.
I have a slight difficulty imagining _why_ you would want to. That
isn't to say your idea is bad - it's just not something I've found
myself in a position of thinking "oh, I wish I could do that".

Yours,
Russ Magee %-)

Reply all

Reply to author

Forward