Fixture loading using bulk_create

300 views
Skip to first unread message

Jonas H.

unread,
Sep 10, 2011, 6:55:27 AM9/10/11
to django-developers
I started hacking the loaddata command to make use of the shiny new
`bulk_create` code -- however, it seems that fixture loading (at least
in its current incarnation) is incompatible to bulk inserts, for this
reasons:

1. It's possible to have model objects overridden by fixtures.
e.g. in modeltests/fixtures/fixtures/fixture1.json, there's a
`fixtures.article` with pk=3, and in fixture2.json there's another
pk=3 article which overrides that from fixture1.

Of course this is not compatible with forced inserts.

2. pre_save/post_save signals are not sent.
It seems like it's not documented anywhere that signals are sent
during fixture loading but I think that for backwards compatibility
reasons this behaviour should not be changed.

Any ideas on these issues?

Jonas

Anssi Kääriäinen

unread,
Sep 10, 2011, 11:19:34 AM9/10/11
to Django developers
Why the signals aren't sent in bulk_insert, is it because PK is not
necessarily available for post_save signal? Otherwise one could send
all pre_save signals in one batch before insert and for post_save just
after. Cascading deletes do something like that IIRC. If this is
impossible to do in bulk_insert, the signal sending could still be
done manually in fixture loading. PK should be available there
directly from the fixtures, right?

For the overridden pk problem you could do batches of say 100 inserts
at a time, first do "select pk from tbl where pk in (list of pks in
batch)", then use update for the returned pks and insert the rest.

- Anssi

Alex Gaynor

unread,
Sep 10, 2011, 1:43:34 PM9/10/11
to django-d...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-d...@googlegroups.com.
To unsubscribe from this group, send email to django-develop...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.


I looked into using bulk_create for fixtures back at DjangoCon.eu, basically it's a mess because fixtures are not necessarily creating data, they can also be overwriting existing ones.

Alex

--
"I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero

Anssi Kääriäinen

unread,
Sep 11, 2011, 6:15:27 PM9/11/11
to Django developers
I created a POC patch that does bulk loading of fixtures. The work can
be found from github: https://github.com/akaariai/django/tree/fixture_loading.
Be warned, it really is POC :)

The results are somewhat depressing for Django tests. The reason is
that there just isn't that much fixture loading going on. Some
statistics numbers I gathered using the patch:

total_objects: 11200
average batch size: 2.51069012179
selects done: 4291 # Need to do a select to see which objects are
already in the DB
updates done: 1756 # Those that were are updated in the regular way
raw inserts done: 159 # Can't use bulk_insert for inherited models, so
use save(force_insert=True) instead
batch_inserts done: 9277 # Amount of objects inserted using
bulk_create
batch_count: 3695

So, before patch there would have been 22400 queries (each save is a
select + update or insert), after patch there were 4291 + 1756 + 159 +
3695 = 9901 queries.

The speed of running the tests was about 3% faster using sqlite3.

Things look a bit better when loading a dump of 10000 objects into db,
for sqlite3 I got 4-5x better timing, for PostgreSQL I got 3-4x better
timing for that test. If the DB would be non-local the result would be
even better.

All tests are passed on SQLite3, except two tests in modeltests/
fixtures which I suspect are QuerySet ordering related. There was one
model that was missing default ordering, and it caused some breakages
due to different order of results. The remaining broken tests are a
bit complicated to debug. The broken tests are two full dumps of the
DB, one in XML and one in JSON format. The dumps are then compared
character for character to expected output. The dumps are somewhat
large and I am somewhat tired, I think there is a row in different
order somewhere in there...

I am currently running the PostgreSQL tests, all I know is the
PostgreSQL tests take a long time :)

The main issues with the patch are:
- I expect it to break when given models with lots of columns in big
batches: too many sql parameters for backend.
- Signals are sent, but they are sent in batches.
- Generally more complicated object saving. The old way was very
easy to understand: each object is saved when read from the fixture.
Now that isn't true any more.
- Same model, same pk multiple times in one fixture does not work.
Django doesn't create fixtures like this.
- I might be missing something obvious in the patch, fixture loading
is a new area to me.

The feature could be useful if there are users loading big fixture
files regularly. Otherwise it complicates fixture loading for little
gain.

I am not going to create a ticket for this one. I have too many
tickets in POC+DDN state already.

- Anssi

Jonas H.

unread,
Sep 12, 2011, 8:38:42 AM9/12/11
to django-d...@googlegroups.com
On 09/12/2011 12:15 AM, Anssi Kääriäinen wrote:
> The feature could be useful if there are users loading big fixture
> files regularly. Otherwise it complicates fixture loading for little
> gain.

Maybe we could simply add an option to the loaddata command -- so that
if someone really needs tons of fixtures for their tests it's possible
to profit from bulk insertions by manually invoking loaddata from their
test code. And the implementation is quite simple:

http://paste.pocoo.org/show/474602/ (doesn't cover all edge-cases yet)

I did some benchmarking with this code and it speeds up fixture loading
*a lot*: http://www.chartgo.com/get.do?id=bdfe6af778 (chunksize=0 does
not use `bulk_create` but `save`, and the speedups seen for chunksize=1
is because `bulk_create` is used, thus avoiding `save` overhead)

Jonas

Anssi Kääriäinen

unread,
Sep 12, 2011, 9:41:51 AM9/12/11
to Django developers


On Sep 12, 3:38 pm, "Jonas H." <jo...@lophus.org> wrote:
> On 09/12/2011 12:15 AM, Anssi Kääriäinen wrote:
>
> > The feature could be useful if there are users loading big fixture
> > files regularly. Otherwise it complicates fixture loading for little
> > gain.
>
> Maybe we could simply add an option to the loaddata command -- so that
> if someone really needs tons of fixtures for their tests it's possible
> to profit from bulk insertions by manually invoking loaddata from their
> test code. And the implementation is quite simple:
>
> http://paste.pocoo.org/show/474602/(doesn't cover all edge-cases yet)
>
> I did some benchmarking with this code and it speeds up fixture loading
> *a lot*:http://www.chartgo.com/get.do?id=bdfe6af778(chunksize=0 does
> not use `bulk_create` but `save`, and the speedups seen for chunksize=1
> is because `bulk_create` is used, thus avoiding `save` overhead)
>
> Jonas

I like this idea much better than trying to hack loaddata to use
bulk_create while maintaining compatibility with the current code.

The hard limitations would be as follows:
- There must not be any updates.

Then there are limitations which could be lifted later on:

- No natural keys (or the targets of the natural keys must exists in
the DB). I think this could be lifted later on - the dumped objects
are ordered in a way that natural keys do not form circles - just save
the objects in the same order and resolve the natural keys when saving
- not when deserializing.

- Inherited models must be saved using the normal way. This could be
lifted: make bulk_create insert inherited objects if they have PK set
trusting that the user will insert the base objects in the same
transaction, or that they are already present. That is, create a
similar "raw" mode for bulk create that exists for Model.save_base.

- Objects with M2M data are saved the normal way. This could be
improved later on, so that m2m data would also be bulk saved along
with the objects.

- All objects must be loaded into memory: this is easy to lift, just
flush the collected objects once per N objects. I am not sure of this,
but you probably can flush the collected objects also once you find
out a new class - the objects are serialized class at a time.

- Signals aren't sent at all. It is easy to batch send the signals
if wanted.

My version of the patch solves all those cases in a way compatible
with the current implementation. The biggest difference to your
version is that my version can be used when running tests - but the
speed difference for Django's test suite is somewhere around 2-3%. The
cost is some added complexity, and one select per batch to see which
PKs are already in the DB and which ones not. So it seems there is not
much point for the added complexity.

The most difficult problem is that my patch _will_ break some users
fixture loading due to the SQL length / parameter amount limitations
of different backends. This is hard to solve cleanly. For example
SQLite3 seems to have a 999 parameter limitation, so that you can save
333 three field models, 99 ten field models or just 10 hundred field
models. If you have a bulk_create flag, then the backwards
incompatibility is not a problem.

So, in summary, it seems having a bulk_create flag is the only way
forward.

- Anssi

Anssi Kääriäinen

unread,
Sep 12, 2011, 3:15:03 PM9/12/11
to Django developers
A cleaned up version of my patch can be found at http://paste.pocoo.org/show/474838/.
Test fixes aren't included in the paste, so modeltests/fixtures test
will not pass due to different ordering of the results. Github
contains the latest version including some test-fixing:
https://github.com/akaariai/django/tree/fixture_loading

I hope the patch will help in solving the natural keys problem. The
biggest problem is that the patch doesn't handle maximum amount of SQL
parameters. Give it 100 serialized instances of a model containing 10
fields and it will fail.

- Anssi

On Sep 12, 4:41 pm, Anssi Kääriäinen <anssi.kaariai...@thl.fi> wrote:
> On Sep 12, 3:38 pm, "Jonas H." <jo...@lophus.org> wrote:
>
>
>
>
>
>
>
>
>
> > On 09/12/2011 12:15 AM, Anssi Kääriäinen wrote:
>
> > > The feature could be useful if there are users loading big fixture
> > > files regularly. Otherwise it complicates fixture loading for little
> > > gain.
>
> > Maybe we could simply add an option to the loaddata command -- so that
> > if someone really needs tons of fixtures for their tests it's possible
> > to profit from bulk insertions by manually invoking loaddata from their
> > test code. And the implementation is quite simple:
>
> >http://paste.pocoo.org/show/474602/(doesn'tcover all edge-cases yet)
Reply all
Reply to author
Forward
0 new messages