In its current form, dumpdata is returning one big JSON object which
loaddata has to read into memory and parse before it can start importing
again. By writing one row of data in a separate JSON object for it and
having one resulting JSON object per line, loaddata could use buffered
file reading like Python's readlines function to reduce the memory usage.
Unfortunately, this feature is probably backwards incompatible, although
it might be possible to do some fancy reading of the file in the loaddata
command to check its file structure. If that's not possible, I reckon it's
best to add a new flag to enable this feature.
--
Ticket URL: <https://code.djangoproject.com/ticket/22259>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* needs_better_patch: => 0
* stage: Unreviewed => Someday/Maybe
* needs_tests: => 0
* needs_docs: => 0
Comment:
Overall the idea makes sense but we cannot accept this ticket without a
plan to address the backwards-incompatibility. Remember that dumpdata
output may be read by tools other by Django.
--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:1>
Comment (by anubhav9042):
As mentioned in #22251, this is implemented in #5423. What else is
required?
--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:2>
Comment (by Gwildor):
Replying to [comment:2 anubhav9042]:
> As mentioned in #22251, this is implemented in #5423. What else is
required?
What I'm talking about is a whole other way of representing the data. As
it is now, one big JSON object is created with all the model instances in
it, like so:
{{{
[
{
'model': 'Foo',
'pk': 1,
'fields': {***},
},
{
'model': 'Foo',
'pk': 2,
'fields': {***},
},
]
}}}
What I'm going at, is changing this so one JSON object is created and
streamed to the output per model instance, resulting in output along the
lines as so:
{{{
{'model': 'Foo', 'pk': 1, 'fields': {***}}
{'model': 'Foo', 'pk': 2, 'fields': {***}}
{'model': 'Bar', 'pk': 1, 'fields': {***}}
}}}
This, in my opinion, has the big advantage that the script that loads the
data again, does not have to load all the data into memory before it can
start with processing the data.
At the moment, you have to do something like this (of course you can read
buffered):
{{{
#!python
with open(args[0], 'rb') as f:
fc = f.read()
data = json.loads(fc)
for row in data:
process(row)
}}}
But when the output option is added the way I proposed, you can do this:
{{{
#!python
with open(args[0], 'rb') as f:
for row in f.readline():
row = json.loads(row)
process(row)
}}}
This way, you don't have to parse one big JSON object.
Another but way smaller advantage is that if the dumpdata command crashes,
you would still have some output to use for testing or developing purposes
while you wait while running the dumpdata command again. As it is now, I
believe you are left with nothing (never seen an even incomplete JSON
object printed in the terminal, just the error and nothing more). Although
this is a small and arguable advantage, I believe this will have a
positive effect on the user friendliness of using the command.
I think the best way to progress (if we want this) is to add a flag for
the dumpdata command which enables this behaviour, and add support for it
for the loaddata command by making a reasonable guess based on the reading
the first line, falling back to the old behaviour if there is no
certainty. I fear that other tools have the options to either do the same
to support both formats, or simply support one and not the other (which
will probably be the current format unless the new format turns out to be
very popular).
--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:3>
* cc: anubhav9042@… (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:4>
Comment (by claudep):
See also:
- https://en.wikipedia.org/wiki/Line_Delimited_JSON
- http://jsonlines.org/
--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:5>
Comment (by Charlie Denton):
Testing this today, and I've been having success with this library for
dumping-to/loading-from one-row-per-line JSON:
https://github.com/superisaac/django-mljson-serializer
--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:6>
* cc: Ian Foote (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:7>
* status: new => closed
* resolution: => duplicate
Comment:
I think we can close this as a duplicate of #22259 since you can now use
the `jsonl` format.
--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:8>