There is an IMO big caveat of bulk_create(): It does not set primary key
of objects it creates.
In my current project, using bulk_create would improve performance, but as
I often need an id of the newly created objects, I have to save them one
by one.
I think it would be perfect if bulk_create starts to set the primary keys
of the objects in the given sequence. At least postgresql supports it
("RETURNING id" clause gives us a set of id's).
--
Ticket URL: <https://code.djangoproject.com/ticket/19527>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* needs_better_patch: => 0
* needs_tests: => 0
* needs_docs: => 0
Comment:
Out of the core DBs Postgres and Oracle support RETURNING, and it is easy
to deduce the value when using SQLite as there aren't concurrent
transactions.
When using MySQL it seems the best approach is to manually allocate a
chunk from the sequence, assign the values to objs and then save them.
Alternative is to not support this option at all when using MySQL.
It might make some sense to alter the bulk_create() definition to "create
the given objects in the fastest way available". This could mean the
fastest way is to issue separate query for each object. For app developers
this would be handy, one could rely on bulk_create() in every mass
creation situation. It might not be faster than a for loop with a .save(),
but it would be always available.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:1>
Comment (by Tuttle):
Thank you. In this case when:
* not all DBs support PK returning in a crystal way,
* PK assigning takes time making it less-than-fastest way,
* PK assigning is not always required by the developer
then, IMHO, an option bulk_create(..., set_primary_keys=False) makes most
sense:
1. bulk_create() would always use the fastest multiple-tuple creation,
2. the developer will be allowed to use set_primary_keys=True only in case
the underlying DB supports it in a ways the Django devs will have agreed
to implement (not too hacky ways).
I thought it could be as hacky as it possibly is already implemented for
some DBs in save().
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:2>
* stage: Unreviewed => Design decision needed
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:3>
* cc: charette.s@… (added)
* version: 1.4 => master
Comment:
I'd like to work on a patch for this if it gets accepted since it's step
toward bulk creation of inherited models.
IMHO the `set_primary_keys=False` also makes more sense.
We could just raise a `NotImplementedError` if `set_primary_key=True and
not connection.features.can_return_id_from_insert` just like what
`distinct` does when specifying fields on a unsupported backend.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:4>
Comment (by akaariai):
I keep thinking that what we *really* want is some form of bulk_save().
This should work even if the models are modified instead of created. The
definition would be something along the lines of "make sure the instances
given (must be of same class) are saved, preferably by fastest way
possible". Primary keys are returned. Signals are sent (maybe with an opt-
out of "send_signals=False"). This could mean that the implementation for
some backends is loop over instances, save each one separately. If there
is no need to send signals, some backends could implement the operation in
single query (at least PostgreSQL 9.1+ with writeable CTEs, Oracle with
merge(?)).
Still, this doesn't mean a better bulk_create wouldn't be useful. But even
here I think we want to have a form of bulk_create where signals are sent,
and primary keys are set. Why? This makes the form of bulk_create() much
more useful for 3rd party apps and even for django-core.
The user wants to create the models in fastest way possible. If the
bulk_create() isn't possible for some model, what is the user going to do?
Well, they need to loop over the models and save() them separately. So,
IMO, we could do that as well directly in bulk_create. If the definition
is create the models in bulk queries, if backend doesn't support this,
then raise an error, it means many 3rd party apps can't use bulk_create at
all. So they will need to loop the models just in case the backend doesn't
support the operation, losing the potential performance gain for backends
that could do the operation faster.
A discussion at django-developers might be useful. The big question is how
we want to define bulk_create. Is it "create the models using bulk
queries, otherwise fail" or is it "create the models as fast as possible"?
My idea is that we want some way to "create the models as fast as
possible, send_signals=True/False, set_primary_keys=True/False". It might
mean the fastest way possible is loop over models, save each with
force_insert=True.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:5>
* cc: gwahl@… (added)
Comment:
Whether it's implemented as bulk_save() or bulk_create(), I love it.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:6>
Comment (by irodrigo17@…):
Either {{{bulk_create}}} with {{{set_primary_keys}}} or {{{bulk_save}}}
sound really good, I'd love to have any of this features.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:7>
Comment (by campos.ddc):
Yes please!
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:8>
* stage: Design decision needed => Accepted
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:9>
Comment (by Alex):
A few notes:
* I'm not sure why the schema stuff is in this patch?
* `for i in range(len(ids))` would be better written with `izip` (found in
six I believe)
* Needs tests!
* Docs should be updated to reflect this.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:10>
* needs_better_patch: 0 => 1
* needs_tests: 0 => 1
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:11>
* cc: w2lkm2n@… (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:12>
* cc: mszamot@… (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:13>
* cc: AkosLadanyi (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:14>
Comment (by acrefoot@…):
Alex et. al:
I've added a test, used zip, and modified the list of caveats on
bulk_create in the docs. I'd like to get this merged into master, so I'd
like someone with Oracle to run the bulk_create tests and confirm if it
works or not.
Let me know if there's something else that I can do to help get this patch
approved.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:15>
* keywords: => oracle
* cc: shai (added)
Comment:
This won't work on Oracle, because Oracle has a different implementation
of RETURNING; where on PostgreSQL, INSERT RETURNING creates records that
can be fetched, Oracle requires the statement to provide output variables,
and those variables are filled in already in the statement execution.
Work is being done now to support more arbitrary RETURNING clauses, in
#21454; I'll try to make sure the Oracle fixes there also fit this use
case.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:16>
Comment (by acrefoot@…):
I'm not sure what you mean by
> Oracle requires the statement to provide output variables, and those
variables are filled in already in the statement execution.
It seems like `return_insert_id` is used the same way for Oracle and
Postgres in `django/db/models/sql/compiler.py`
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:17>
Comment (by shai):
Look at your `fetch_returned_insert_ids()` function in
`BaseDatabaseOperations`; that function is obviously wrong for Oracle, and
other code is also (though less obviously) missing.
It's true that `return_insert_id` is used the same way, but it does
something different; where for Postgres it just adds the clause, for
Oracle it also binds an output variable. That difference also requires
different handling later, when fetching the value (for Postgres you need a
`fetch()`, for Oracle you just read the bound variable).
For bulk, life on Oracle gets harder -- since you're now returning an
array of values, you need to prepare an array variable to accept it; you
cannot use the same `return_insert_id` for the bulk and single case. In
particular, for the bulk case, you need to pass in the number of values to
be returned.
Hope this makes things clearer,
Shai.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:18>
* cc: acrefoot@… (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:19>
* cc: benth (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:20>
Comment (by the_drow):
Can we please release the postgres support first since it should work?
There's an effort to improve native postgres support and it should be a
part of it in 1.8 in my opinion.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:21>
* cc: dev@… (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:22>
* keywords: oracle => oracle QuerySet.bulk_create
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:23>
Comment (by acrefoot):
Hi again! This week, we're finalizing the work to open-source an app that
depends on this patch.
What needs to happen to get this into mainline as soon as we can?
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:24>
Comment (by timgraham):
Sending a pull request with an updated patch would be a good first step.
Also giving it a "self-review" (or having a colleague review it) using the
[https://docs.djangoproject.com/en/dev/internals/contributing/writing-code
/submitting-patches/#patch-review-checklist PatchReviewChecklist] would
be great.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:25>
* has_patch: 0 => 1
* needs_tests: 1 => 0
Comment:
A [https://github.com/django/django/pull/5166 pull request] to add support
for this on PostgreSQL needs a rebase.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:26>
* needs_better_patch: 1 => 0
Comment:
Two different update pull requests:
[https://github.com/django/django/pull/5917 #5917] and
[https://github.com/django/django/pull/5936 #5936]
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:27>
* needs_better_patch: 0 => 1
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:28>
* needs_better_patch: 1 => 0
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:29>
Comment (by Tim Graham <timograham@…>):
In [changeset:"04240b23658f8935bbfebacccc23b5e47a1d6c22" 04240b23]:
{{{
#!CommitTicketReference repository=""
revision="04240b23658f8935bbfebacccc23b5e47a1d6c22"
Refs #19527 -- Allowed QuerySet.bulk_create() to set the primary key of
its objects.
PostgreSQL support only.
Thanks Vladislav Manchev and alesasnouski for working on the patch.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:30>
Comment (by Tuttle):
I guess there should be
+.. versionchanged:: 1.10
instead of
+.. versionchanged:: 1.0
in the patch.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:31>
Comment (by Tim Graham <timograham@…>):
In [changeset:"1d17bb4f7d78f6c3df9dedc0710acc5bdb300693" 1d17bb4]:
{{{
#!CommitTicketReference repository=""
revision="1d17bb4f7d78f6c3df9dedc0710acc5bdb300693"
Refs #19527 -- Fixed typo in docs/ref/models/querysets.txt.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:32>
* has_patch: 1 => 0
Comment:
(leaving the ticket open for the Oracle implementation)
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:33>
Comment (by akaariai):
We should also consider if we want an implementation for MySQL and SQLite.
On SQLite there is no concurrency when inserting to given table, so we can
guess the inserted pk values from the last pk value in the table after the
insert. On MySQL we might need to pre-allocate the ids.
A safe-but-slow implementation is to resort to one item at time insert if
primary keys are requested.
There is a lot of work to having a bulk_create with primary keys for all
backends, but this would allow usage of bulk_create in cases where we
can't do that now.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:34>
Comment (by Tuttle):
Replying to [comment:34 akaariai]:
> A safe-but-slow implementation is to resort to one item at time insert
if primary keys are requested.
If I may, I'd prefer to trust users they know their situation and
limitations. No magic fallbacks. I love Django for its doing what it's
saying.
IMHO once the user demands bulk insert, then fast bulk insert should
always be done and when PKs are requested on an unsupporting backend, I'd
expect an not-implemented error with the explanation and recommendation
raised. This get caught during development, right?
Personally, I don't mind at all if at these not essential cases, the
framework transparently admits that all backends are not the same.
Many thanks to everyone involved so far! I'm looking forward to the
feature much!
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:35>
Comment (by akaariai):
I see your point.
The problem is that 3rd party apps are forced to skip bulk_create for all
backends because the app might be used on backend that doesn't support
fetching primary keys. Thus users of postgres pay for limitations in
MySQL.
The solution might be a method like maybe_bulk_create() - insert data as
fast as possible, with bulk_create() if the backend supports returning
primary keys, with loop and save() otherwise.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:36>
Comment (by Tuttle):
Thank you for enlightening.
I tend to see this backend compat feature to rather configure the existing
call, so personally I see the way of adding a new API call as merely
complicating the stuff more. Many newbies are already frowning upon these
parts of ORM I guess.
So why not to allow both, the compatibility requirements of 3rd party apps
as well as efficient modern feature to those who can afford it and want to
avoid one by one inserts?
I'd propose s/t like this ''so the behavior stays the same'':
{{{
#!python
def bulk_create(..., allow_fallback=True):
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:37>
Comment (by akaariai):
Yes, that API is completely fine for me.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:38>
Comment (by timgraham):
The `chain.from_iterable()` addition broke Oracle:
{{{
File "/home/tim/code/django/django/db/models/sql/compiler.py", line
1038, in as_sql
return [(" ".join(result), tuple(chain.from_iterable(params)))]
TypeError: 'Oracle_datetime' object is not iterable
}}}
`params` is something like: `['admin', u'0001_initial',
Oracle_datetime(2016, 3, 4, 12, 1, 9, 38631),
<django.db.backends.oracle.utils.InsertIdVar object at 0x7fb0343a9710>]`.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:39>
Comment (by Tim Graham <timograham@…>):
In [changeset:"359be4460e89e325fb2969b1f231c9507d1c116b" 359be446]:
{{{
#!CommitTicketReference repository=""
revision="359be4460e89e325fb2969b1f231c9507d1c116b"
Refs #19527 -- Fixed SQL compiler regression causing Oracle failure.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:40>
Comment (by Tuttle):
Isn't the feature a part of just released Django 1.10 allowing this ticket
to be closed?
But mainly: BIG thanks for everyone involved! :-)
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:41>
Comment (by timgraham):
Per comment:33, "(leaving the ticket open for the Oracle implementation)".
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:42>
Comment (by Tuttle):
I'm sorry, Tim.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:43>
Comment (by Malik A. Rumi):
With the implementation of this new feature, shouldn't the language of
https://docs.djangoproject.com/en/1.10/ref/models/instances/#explicitly-
specifying-auto-primary-key-values **also** be changed in addition to the
changes made to
https://docs.djangoproject.com/en/1.10/ref/models/querysets/ ?
I'd offer to do it via a pull request but I'm not sure I know enough about
how all this is implemented to write it correctly. In fact, if I'm wrong
about the need to update the language of
https://docs.djangoproject.com/en/1.10/ref/models/instances/#explicitly-
specifying-auto-primary-key-values, that would be ''proof'' I don't
understand! ;-)
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:44>
* cc: felixxm (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:45>
Comment (by Mariusz Felisiak):
It's also supported on SQLite 3.35+ (see
93ed71e05802a47774b52503cdc3442686d686c1) and MariaDB 10.5+ (see
98abc0c90e0eb7fe3a71cfa360962cf59605f1d3).
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:46>
Comment (by Mariusz Felisiak):
#34011 was a duplicate with the implementation idea for MySQL.
--
Ticket URL: <https://code.djangoproject.com/ticket/19527#comment:47>