[Django] #22259: Per row result for dumpdata

Django

unread,

Mar 12, 2014, 3:11:54 PM3/12/14

to django-...@googlegroups.com

#22259: Per row result for dumpdata
--------------------------------------------+--------------------
Reporter: Gwildor | Owner: nobody
Type: New feature | Status: new
Component: Core (Management commands) | Version: master
Severity: Normal | Keywords:
Triage Stage: Unreviewed | Has patch: 0
Easy pickings: 0 | UI/UX: 0
--------------------------------------------+--------------------
In response of ticket #22251, I'm opening this as a separate issue like
requested. You can read the need for this option there, but basically it
has to do with memory consumption. This was addressed in #5423 and
improved drastically based on the results talked about in the ticket, but
dumpdata is still consuming a fair amount of memory, and would benefit
from further improvements. Besides that, in its current form, when the
command stops unexpectedly, nothing is saved and you don't have an
incomplete file which you can use for development or testing purposes
while you are running the command again.

In its current form, dumpdata is returning one big JSON object which
loaddata has to read into memory and parse before it can start importing
again. By writing one row of data in a separate JSON object for it and
having one resulting JSON object per line, loaddata could use buffered
file reading like Python's readlines function to reduce the memory usage.

Unfortunately, this feature is probably backwards incompatible, although
it might be possible to do some fancy reading of the file in the loaddata
command to check its file structure. If that's not possible, I reckon it's
best to add a new flag to enable this feature.

--
Ticket URL: <https://code.djangoproject.com/ticket/22259>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,

Mar 20, 2014, 3:20:16 PM3/20/14

to django-...@googlegroups.com

#22259: Per row result for dumpdata

-------------------------------------+-------------------------------------

Reporter: Gwildor | Owner: nobody
Type: New feature | Status: new

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by aaugustin):

* needs_better_patch: => 0
* stage: Unreviewed => Someday/Maybe
* needs_tests: => 0
* needs_docs: => 0

Comment:

Overall the idea makes sense but we cannot accept this ticket without a
plan to address the backwards-incompatibility. Remember that dumpdata
output may be read by tools other by Django.

--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:1>

Django

unread,

Mar 23, 2014, 4:27:31 AM3/23/14

to django-...@googlegroups.com

#22259: Per row result for dumpdata

-------------------------------------+-------------------------------------

Reporter: Gwildor | Owner: nobody
Type: New feature | Status: new

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by anubhav9042):

As mentioned in #22251, this is implemented in #5423. What else is
required?

--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:2>

Django

unread,

Mar 29, 2014, 6:33:34 PM3/29/14

to django-...@googlegroups.com

#22259: Per row result for dumpdata

-------------------------------------+-------------------------------------

Reporter: Gwildor | Owner: nobody
Type: New feature | Status: new

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by Gwildor):

Replying to [comment:2 anubhav9042]:

> As mentioned in #22251, this is implemented in #5423. What else is
required?

What I'm talking about is a whole other way of representing the data. As
it is now, one big JSON object is created with all the model instances in
it, like so:
{{{
[
{
'model': 'Foo',
'pk': 1,
'fields': {***},
},
{
'model': 'Foo',
'pk': 2,
'fields': {***},
},
]
}}}

What I'm going at, is changing this so one JSON object is created and
streamed to the output per model instance, resulting in output along the
lines as so:

{{{
{'model': 'Foo', 'pk': 1, 'fields': {***}}
{'model': 'Foo', 'pk': 2, 'fields': {***}}
{'model': 'Bar', 'pk': 1, 'fields': {***}}
}}}

This, in my opinion, has the big advantage that the script that loads the
data again, does not have to load all the data into memory before it can
start with processing the data.

At the moment, you have to do something like this (of course you can read
buffered):

{{{
#!python
with open(args[0], 'rb') as f:
fc = f.read()
data = json.loads(fc)

for row in data:
process(row)
}}}

But when the output option is added the way I proposed, you can do this:

{{{
#!python
with open(args[0], 'rb') as f:
for row in f.readline():
row = json.loads(row)
process(row)
}}}

This way, you don't have to parse one big JSON object.

Another but way smaller advantage is that if the dumpdata command crashes,
you would still have some output to use for testing or developing purposes
while you wait while running the dumpdata command again. As it is now, I
believe you are left with nothing (never seen an even incomplete JSON
object printed in the terminal, just the error and nothing more). Although
this is a small and arguable advantage, I believe this will have a
positive effect on the user friendliness of using the command.

I think the best way to progress (if we want this) is to add a flag for
the dumpdata command which enables this behaviour, and add support for it
for the loaddata command by making a reasonable guess based on the reading
the first line, falling back to the old behaviour if there is no
certainty. I fear that other tools have the options to either do the same
to support both formats, or simply support one and not the other (which
will probably be the current format unless the new format turns out to be
very popular).

--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:3>

Django

unread,

Apr 8, 2014, 7:20:43 AM4/8/14

to django-...@googlegroups.com

#22259: Per row result for dumpdata

-------------------------------------+-------------------------------------

Reporter: Gwildor | Owner: nobody
Type: New feature | Status: new

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by anubhav9042):

* cc: anubhav9042@… (added)

--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:4>

Django

unread,

Jul 21, 2015, 6:01:25 PM7/21/15

to django-...@googlegroups.com

#22259: Per row result for dumpdata

-------------------------------------+-------------------------------------

Reporter: Gwildor | Owner: nobody
Type: New feature | Status: new

Component: Core (Management | Version: master
commands) |
Severity: Normal | Resolution:
Keywords: | Triage Stage:

| Someday/Maybe
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by claudep):

--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:5>

Django

unread,

Dec 19, 2018, 11:09:17 AM12/19/18

to django-...@googlegroups.com

#22259: Per row result for dumpdata

-------------------------------------+-------------------------------------
Reporter: Gwildor Sok | Owner: nobody

Type: New feature | Status: new

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by Charlie Denton):

Testing this today, and I've been having success with this library for
dumping-to/loading-from one-row-per-line JSON:
https://github.com/superisaac/django-mljson-serializer

--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:6>

Django

unread,

Apr 13, 2019, 8:26:09 AM4/13/19

to django-...@googlegroups.com

#22259: Per row result for dumpdata

-------------------------------------+-------------------------------------
Reporter: Gwildor Sok | Owner: nobody

Type: New feature | Status: new

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by Ian Foote):

* cc: Ian Foote (added)

--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:7>

Django

unread,

Jun 17, 2020, 6:15:12 AM6/17/20

to django-...@googlegroups.com

#22259: Per row result for dumpdata

-------------------------------------+-------------------------------------
Reporter: Gwildor Sok | Owner: nobody
Type: New feature | Status: closed

Component: Core (Management | Version: master
commands) |

Severity: Normal | Resolution: duplicate

Keywords: | Triage Stage:
| Someday/Maybe
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by felixxm):

* status: new => closed
* resolution: => duplicate

Comment:

I think we can close this as a duplicate of #22259 since you can now use
the `jsonl` format.

--
Ticket URL: <https://code.djangoproject.com/ticket/22259#comment:8>

Reply all

Reply to author

Forward