how to avoid pk collisions at loaddata time

3,317 views
Skip to first unread message

Phlip

unread,
Dec 8, 2009, 7:28:21 PM12/8/09
to Django users
Djangoids:

Does the data fixture support automatically generated primary keys?

We need to add a tiny stretch of new data to an existing website. A
fixture, such as new_data.json, would work great for both the new data
and for their new feature's unit tests. Then we would just run
loaddata on the server.

However, the fixture format seems to require hard-coded PKs. So we
should not risk overwriting (or skipping?) any preexisting live data.

I could code around this, either by looking up the current PKs and
then moving the new ones to miss them, or by writing a README.txt file
explaining how to add the same data thru the Admin. But this would
seem to be a systemic problem with fixtures, so I was curious if they
had a systemic solution.

--
Phlip
http://c2.com/cgi/wiki?MoreliaViridis

Russell Keith-Magee

unread,
Dec 8, 2009, 8:13:53 PM12/8/09
to django...@googlegroups.com
At present, no. There is no systemic solution to this. Fixtures
require that you define a primary key.

In the past, it has been proposed that we allow fixtures to omit the
primary key, which would result in the same behavior that Django has
in code - a missing primary key results in an SQL INSERT and a new
object instance. However, this introduces a new problem - how do you
reference an object that is contained in a fixture (e.g., a foreign
key reference in the fixture to an object that is also in the
fixture)? As always, suggestions are welcome.

Yours,
Russ Magee %-)

Phlip

unread,
Dec 9, 2009, 12:23:02 PM12/9/09
to Django users
On Dec 8, 5:13 pm, Russell Keith-Magee <freakboy3...@gmail.com> wrote:

> fixture)? As always, suggestions are welcome.

The suggestion is

- template the json
- name each record
- the PK is the hash of the name
- link FKs by name, so they get the same hash

That way a reader can see an PK or FK, stick in its hash, and keep
going, without the need to read the other table and find the matching
FK or PK, respectively.

(the way >cough< Rails does it)

The problem now, at loaddata time in production, is the hashes still
might (one in a million chance) collide with a preexisting PK. And the
next problem is the hashes will bump their PK incrementors way up,
throwing away whole ranges of valid fictitious IDs, when the next
natural record inserts.

How about you add a argument to loaddata, such as --gentle, which
throws an error (and rolls everything back) if a new PK collides. Then
the new feature can wait while programmers fix their (hardcoded) PKs
and try again.

That argument would also work at test time, because the unit tests are
expected to always get all PKs right, so the --gentle would be always
turned on, and would be a kind of assertion.

--
Phlip
http://zeekland.zeroplayer.com/Uncle_Wiggilys_Travels/1

Russell Keith-Magee

unread,
Dec 9, 2009, 6:10:36 PM12/9/09
to django...@googlegroups.com
On Thu, Dec 10, 2009 at 1:23 AM, Phlip <phli...@gmail.com> wrote:
> On Dec 8, 5:13 pm, Russell Keith-Magee <freakboy3...@gmail.com> wrote:
>
>> fixture)? As always, suggestions are welcome.
>
> The suggestion is
>
>  - template the json

I'm not sure how this helps. We already have a 100% reliable method
for generating JSON - it's called SimpleJSON, and it's part of the
Python standard library.

>  - name each record

Where exactly does the name come from when data is being dumped programatically?

>  - the PK is the hash of the name
>  - link FKs by name, so they get the same hash

Ok; using some non-pk value for PK references is certainly one way to
handle this. There is an issue around how to resolve a hash into an
actual pk value, but that shouldn't be impossible.

The big issue is how to format a hash so that it can be differentiated
from a primary key value. To shortcut a couple of obvious (but wrong)
solutions:
* "just use hash values all the time" isn't an acceptable answer,
because we have backwards compatibility to consider
* "if it's an int, use pk, if it's a string, use hash" doesn't work,
because Django allows strings as primary keys. It isn't common, but it
is possible.

> That way a reader can see an PK or FK, stick in its hash, and keep
> going, without the need to read the other table and find the matching
> FK or PK, respectively.
>
> (the way >cough< Rails does it)
>
> The problem now, at loaddata time in production, is the hashes still
> might (one in a million chance) collide with a preexisting PK. And the
> next problem is the hashes will bump their PK incrementors way up,
> throwing away whole ranges of valid fictitious IDs, when the next
> natural record inserts.

Hash collisions aren't a huge concern to me. As long as whatever you
are hashing has sufficient entropy that collisions on *input* to the
hash aren't possible (or especially likely).

> How about you add a argument to loaddata, such as --gentle, which
> throws an error (and rolls everything back) if a new PK collides. Then
> the new feature can wait while programmers fix their (hardcoded) PKs
> and try again.
>
> That argument would also work at test time, because the unit tests are
> expected to always get all PKs right, so the --gentle would be always
> turned on, and would be a kind of assertion.

There's one other minor flaw in your proposal - the phrase "How about
you add...". If this is your itch, either *you* need to scratch it, or
*you* find someone else that will. :-) This is a volunteer project,
and code gets written by people that want it. It's worth sticking the
idea in Trac, but that doesn't mean that the feature will magically
get implemented.

Yours,
Russ Magee %-)

Phlip

unread,
Dec 9, 2009, 6:59:22 PM12/9/09
to Django users
On Dec 9, 3:10 pm, Russell Keith-Magee <freakboy3...@gmail.com> wrote:

> Ok; using some non-pk value for PK references is certainly one way to
> handle this. There is an issue around how to resolve a hash into an
> actual pk value, but that shouldn't be impossible.

In Rails, a YAML (JSON) fixture like this...

norbert:
name: the nark

...expands into this...

norbert:
id: <%= hash('norbert') %>
name: the nark

...hence this:

norbert:
id: 39779393
name: the nark

Then the id stamps into the database as that record's pk.

> The big issue is how to format a hash so that it can be differentiated
> from a primary key value. To shortcut a couple of obvious (but wrong)
> solutions:
>   * "just use hash values all the time" isn't an acceptable answer,
> because we have backwards compatibility to consider
>  * "if it's an int, use pk, if it's a string, use hash" doesn't work,
> because Django allows strings as primary keys. It isn't common, but it
> is possible.

That's why I suggested adding the templating layer to the JSONs.

In general, Django "encourages" screwing with the Admin, then
extruding sample records, while RoR "encourages" writing very terse,
very templated YAML files as test code source.

> > The problem now, at loaddata time in production, is the hashes still
> > might (one in a million chance) collide with a preexisting PK. And the
> > next problem is the hashes will bump their PK incrementors way up,
> > throwing away whole ranges of valid fictitious IDs, when the next
> > natural record inserts.

> Hash collisions aren't a huge concern to me. As long as whatever you
> are hashing has sufficient entropy that collisions on *input* to the
> hash aren't possible (or especially likely).

But abandoning all those fictitious numbers, say between our highest
record of 204 and our hash of 39779393. The auto-incrementor will use
39779394 next, and so on. Then all of those numbers between 204 and
39779393 will feel bad, because they never got to index a record.

> > How about you add a argument to loaddata, such as --gentle, which
> > throws an error (and rolls everything back) if a new PK collides. Then
> > the new feature can wait while programmers fix their (hardcoded) PKs
> > and try again.
>
> > That argument would also work at test time, because the unit tests are
> > expected to always get all PKs right, so the --gentle would be always
> > turned on, and would be a kind of assertion.
>
> There's one other minor flaw in your proposal - the phrase "How about
> you add...". If this is your itch, either *you* need to scratch it, or
> *you* find someone else that will. :-) This is a volunteer project,

I apologize for triggering the "that's how open source works" lecture.

Russell Keith-Magee

unread,
Dec 9, 2009, 7:25:31 PM12/9/09
to django...@googlegroups.com
On Thu, Dec 10, 2009 at 7:59 AM, Phlip <phli...@gmail.com> wrote:
> On Dec 9, 3:10 pm, Russell Keith-Magee <freakboy3...@gmail.com> wrote:
>
>> Ok; using some non-pk value for PK references is certainly one way to
>> handle this. There is an issue around how to resolve a hash into an
>> actual pk value, but that shouldn't be impossible.
>
> In Rails, a YAML (JSON) fixture like this...
>
>   norbert:
>       name: the nark
>
> ...expands into this...
>
>   norbert:
>      id: <%= hash('norbert') %>
>      name: the nark
>
> ...hence this:
>
>  norbert:
>    id: 39779393
>    name: the nark
>
> Then the id stamps into the database as that record's pk.

Holy mother of what? Randomly inventing a PK and hoping you don't have
a collision? Ah, no. No. Not ever.

>> The big issue is how to format a hash so that it can be differentiated
>> from a primary key value. To shortcut a couple of obvious (but wrong)
>> solutions:
>>   * "just use hash values all the time" isn't an acceptable answer,
>> because we have backwards compatibility to consider
>>  * "if it's an int, use pk, if it's a string, use hash" doesn't work,
>> because Django allows strings as primary keys. It isn't common, but it
>> is possible.
>
> That's why I suggested adding the templating layer to the JSONs.
>
> In general, Django "encourages" screwing with the Admin, then
> extruding sample records, while RoR "encourages" writing very terse,
> very templated YAML files as test code source.

What rubbish. Django encourages nothing of the sort. Django provides
an admin tool. It's a useful tool, especially as - surprise surprise -
an administration interface. You *can* use it to generate data for
fixtures if you want to. I challenge you to find anywhere in the docs
that say the admin interface is *the* way to generate fixtures.

Personally, I never use the admin to build my test fixtures - I hand
write them. Django's fixture format is simple (at least, in JSON and
YAML it is - XML is verbose, but that's XML for you). When you're in
complete control of all the data (as you should be during testing),
hard coding primary keys isn't a problem.

On the subject of which - Why on earth do you even *need* pk-less
objects in a test fixture? If you're in control of all the data - as
you should be during testing - the original problem you describe of
avoiding PK collisions at run time doesn't really exist.

I accept that this problem exists for loading new data into the
database - i.e., "create 10 new people - here are their names and
addresses, but I don't have pks for them " - but for testing? Not as
far as I can see.

>> > The problem now, at loaddata time in production, is the hashes still
>> > might (one in a million chance) collide with a preexisting PK. And the
>> > next problem is the hashes will bump their PK incrementors way up,
>> > throwing away whole ranges of valid fictitious IDs, when the next
>> > natural record inserts.
>
>> Hash collisions aren't a huge concern to me. As long as whatever you
>> are hashing has sufficient entropy that collisions on *input* to the
>> hash aren't possible (or especially likely).
>
> But abandoning all those fictitious numbers, say between our highest
> record of 204 and our hash of 39779393. The auto-incrementor will use
> 39779394 next, and so on. Then all of those numbers between 204 and
> 39779393 will feel bad, because they never got to index a record.

We're on a completely different page here. I have no problem with
using a hash as an fixture-internal reference to an object until such
time as the object is assigned a real pk by the databse. Using a hash
of content as an actual primary key values is a completely different
matter (and, to reinforce my previous point - no, no, not ever).

Yours,
Russ Magee %-)

Phlip

unread,
Dec 10, 2009, 12:41:54 PM12/10/09
to Django users
On Dec 9, 4:25 pm, Russell Keith-Magee <freakboy3...@gmail.com> wrote:

> > In general, Django "encourages" screwing with the Admin, then
> > extruding sample records, while RoR "encourages" writing very terse,
> > very templated YAML files as test code source.
>
> What rubbish.

Just a netiquette note - I stopped reading here.
Reply all
Reply to author
Forward
0 new messages