"The Case for the Surrogate Key", choosing Primary Keys

serge

unread,

Feb 4, 2002, 1:55:50 AM2/4/02

to

My remarks are based upon reading the following IMHO extremely interesting
article:

http://www.dbpd.com/vault/9805xtra.htm (Many Thanks to Clive Bolton for
sharing the link)

If we are to make a distinction between the logical key and the physical
key, and we read carefully Mike Lonigro's
points in his article, shouldn't we always use Primary keys that are
unrelated to the table's attributes.

I'm not an expert on this field, but his explanation seem very convincing to
me. Do others feel the same way, or
there are other things to consider when deciding what Best solution to use
when deciding a physical PK?

I am pretty much convinced to use surrogate autonumber keys as Primary Keys
on all my tables that have or
might have relations with other entities.

Am I right to do this? Are there any good reasons NOT to use surrogate keys?

If anyone is interested, this was my first post over a week ago on Data
Normalization:

http://groups.google.com/groups?hl=en&threadm=jX%2548.1993%249y2.468659%40ne
ws20.bellglobal.com&rnum=6&prev=/groups%3Fas_ugroup%3Dcomp.databases.ms-sqls
erver%26as_uauthors%3Dserge%26hl%3Den

Many thanks to the people who responded to my first post on Jan 25. After
reading a couple of chapters from 2 different Access books about First,
Second and Third Norms, the articles on www.sqlmag.com and this article
about "The Case for the Surrogate Key", I am eager to learn more.

What books would you suggest to further read about Database Design and
Relationships? I read many book titles suggested in my Access books, one
that is in my mind right now is: "Database Design for mere Mortals".

Since there seems to be quite a number of books on this subject, I am
curious to know which book you would recommend and why.

By the way, to my Jan 25 post, if someone is reading my original post, I
have the following scenario:

One Company can have many Products
Product NAMES are not unique, two different Companies can have the same
Product Name
Each Product can have MANY Versions

I will use four tables:
Customers with CustomerID as its PK (autonumber surrogate key)
Products with Product ID as its PK (autonumber surrogate key),
CustomerID as its Foreign Key
CustomersProducts with CustomerProduct as its PK (autonumber surrogate
key), CustomerID and ProductID as foreign keys
ProductVersions with VerID as PK (an autonumber surrogate key), Product
ID as foreign key and Version Number as an Integer
that holds the sequential version number for the productID

Am I correct in my design model here? Because i am a bit confused. First I
would like to find out if my logical model is correct before i figure out if
the physical model of this is simple?

Thank you very much

Torsten L

unread,

Feb 4, 2002, 4:44:15 AM2/4/02

to

"serge" <sergea@RemoveThis_ehmail.com> skrev i meddelandet
news:0gq78.7648$l37.6...@news20.bellglobal.com...

> If we are to make a distinction between the logical key and the physical
> key, and we read carefully Mike Lonigro's
> points in his article, shouldn't we always use Primary keys that are
> unrelated to the table's attributes.
>
> I'm not an expert on this field, but his explanation seem very convincing
to
> me. Do others feel the same way, or
> there are other things to consider when deciding what Best solution to use
> when deciding a physical PK?

When using Jet replication, I have found that logical primary keys actually
can be advantageous. In a database, I put a logical PK in the table with the
"top entity" of the database, tblProject, and surrogate PK in the other
tables. When the users had synch conflicts, the person maintaining the
replica set could read in plain text which project was involved, and ask the
responsable person what to do about it. The changing of primary keys, that
Mike Lonigro calls an "unpleasant option", is taken care of by the Cascading
updates function (after a confirmation msg box).

Torsten L

serge

unread,

Feb 4, 2002, 9:19:57 AM2/4/02

to

> When using Jet replication, I have found that logical primary keys
actually
> can be advantageous. In a database, I put a logical PK in the table with
the
> "top entity" of the database, tblProject, and surrogate PK in the other
> tables. When the users had synch conflicts, the person maintaining the
> replica set could read in plain text which project was involved, and ask
the
> responsable person what to do about it.

If I understand you correctly, you chose to use a logical PK for tblProject
because
tblProject is part of a Replication table?

So unless you use a logical text as your PK, if two people create the same
PK in text
on two different replicas, during the synch, this will result in a synch
conflict, thus allow
the users to merge the two records into one?

If we were to use a surrogate key here in this instance, then we'd basically
not have
any synch conflicts and end up having logical duplicates in the table.

At least, that's how I understand it.

>The changing of primary keys, that
> Mike Lonigro calls an "unpleasant option", is taken care of by the
Cascading
> updates function (after a confirmation msg box).

I don't think it is just a matter of "Cascading Updates". The way I
understood it, if you
use surrogate PK, you could modify your logical unique key attributes a lot
easier and not
worry about updating any of its Foreign Keys througout the database.

serge

unread,

Feb 4, 2002, 9:23:12 AM2/4/02

to

> One Company can have many Products
> Product NAMES are not unique, two different Companies can have the
same
> Product Name
> Each Product can have MANY Versions
>
> I will use four tables:
> Customers with CustomerID as its PK (autonumber surrogate key)
> Products with Product ID as its PK (autonumber surrogate key),
> CustomerID as its Foreign Key

I think this is a mistake. I don't think CustomerID will belong in the
Products table as a Foreign Key
since it is the "junction" table of CustomersProducts.

--CELKO--

unread,

Feb 4, 2002, 11:20:27 AM2/4/02

to

>> points in his article, shouldn't we always use Primary keys that
are
unrelated to the table's attributes. <<

Only if we wish NEVER to verify your data model against the reality it
is supposed to model.

The IDENTITY column is a hold over from the early programming language
which were <i>very<i> close to the hardware. For example, the fields
in a COBOL or FORTRAN program were assumed to be physically located in
main storage in the order they were declared in the program. This
meant that you could define a template that overlaid the same physical
space and read the representation in several different ways. In
COBOL, the command was REDEFINES, EQUIVALENCE in FORTRAN and a union
in 'C'.

From a logical viewpoint, this redefinition makes no sense at all. It
is confusing the numeral with the number the numeral represents. The
history of programming after this point in time has been to divorce
the logical and physical models completely.

The early SQLs were based on existing file systems. The data was kept
in physically contiguous disk pages, in physically contiguous rows,
made up of physically contiguous columns. In short, just like a deck
of punch cards or a magnetic tape.

But physically contiguous storage is only one way of building a
relational database and it is not always the best one. But aside from
that, the whole idea of a relational database is that user is not
supposed to know how things are stored at all, much less write code
that depends on the particular physical representation in a particular
release of a particular product.

One of the biggest errors is the IDENTITY column in the Sybase family
(SQL Server and Sybase). People actually program with this "feature"
and even use it as the primary key for the table! Now, let's go into
painful details as to why this thing is bad.

The practical considerations are that IDENTITY is proprietary and
non-portable, so you know that you will have maintenance problems when
you change releases or port your system to other products. It also
has some very strange bugs in both Sybase and SQL Server; go to a
newsgroup and do a search.

But let's look at the logical problems. First try to create a table
with two columns and try to make them both IDENTITY columns. If you
cannot declare more than one column to be of a certain datatype, then
that thing is not a datatype at all, by definition.

Next, create a table with one column and make it an IDENTITY column.
Now try to insert, update and delete different numbers from it. If
you cannot insert, update and delete rows from a table, then it is not
a table by definition.

Finally create a simple table with one IDENTITY column and a few other
columns. Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

to put a few rows into the table and notice that the IDENTITY column
sequentially numbered them in the order they were presented. If you
delete a row, the gap in the sequence is not filled in and the
sequence continues from the highest number that has ever been used in
that column in that particular table.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;

Since a query result is a table, and a table is a set which has no
ordering, what should the IDENTITY numbers be? The entire, whole,
completed set is presented to Foobar all at once, not a row at a time.
There are (n!) ways to number (n) rows, so which one do you pick?
The answer has been to use whatever the physical order of the result
set happened to be. That non-relational phrase "physical order"
again.

But it is actually worse than that. If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order. Can you explain from a logical model why
the same rows in the second query get different IDENTITY numbers? In
the relational model, they should be treated the same if all the
values of all the attributes are identical.

Think about trying to do replication on two databases that differ only
by an index, or by cache size or something that occasionally gives
them different execution plans for the same statements. Want to try
to maintain such a system?

There are better ways of creating identifiers.

>> a physical PK? <<

You are a programmer; you only decide on logical keys. If the system
wants to locate the data by indexes, bit vectors or hashing, that is
not your concern.

>> Am I right to do this? Are there any good reasons NOT to use
surrogate keys? <<

I make BIG distinctions among nautral keys, artifical keys and
surrogate keys; other writers tend to blur them. Be careful. You
will find a lot of OO people who want to fake a OID with GUID or
IDENTITY or ROWID -- it just does not work.

>> One Company can have many Products
Product NAMES are not unique, two different Companies can have
the same
Product Name
Each Product can have MANY Versions

I will use four tables:
Customers with CustomerID as its PK (autonumber surrogate key)
Products with Product ID as its PK (autonumber surrogate key),
CustomerID as its Foreign Key
CustomersProducts with CustomerProduct as its PK (autonumber
surrogate
key), CustomerID and ProductID as foreign keys
ProductVersions with VerID as PK (an autonumber surrogate key),
Product
ID as foreign key and Version Number as an Integer
that holds the sequential version number for the productID <<

1) Please post DDL so people do not have guess about your schema.

2) Version is a property of a product, not an enity in itself.

3) Since you did not tell us what the products are, I would assume
that they have a UPC or EAN code and I would use that natural key
instead a meaningless, error-prone, proprietary IDENTITY column.

CREATE TABLE Customers
(customer_id INTEGER NOT NULL PRIMARY KEY,
...);

CREATE TABLE Products
(upc DECIMAL(10,0) NOT NULL PRIMARY KEY,
...);

CREATE TABLE CustomersPurchases
(customer_id INTEGER NOT NULL
REFERENCES Customers(customer_id)
ON UPDATE CASCADE,
upc DECIMAL(10,0) NOT NULL
REFERENCES Products(upc)
ON UPDATE CASCADE,
...
PRIMARY KEY (customer_id, upc));

You missed the nautral key (customer_id, upc) because you thought that
the redundant IDENTITY column would work. It does not -- it lets you
insert non-existent product and non-existent customers and never stops
you.

Torsten L

unread,

Feb 4, 2002, 5:41:01 PM2/4/02

to

"serge" <sergea@RemoveThis_ehmail.com> skrev i meddelandet

news:pPw78.6291$Xw.8...@news20.bellglobal.com...

> > When using Jet replication, I have found that logical primary keys
> actually
> > can be advantageous. In a database, I put a logical PK in the table with
> the
> > "top entity" of the database, tblProject, and surrogate PK in the other
> > tables. When the users had synch conflicts, the person maintaining the
> > replica set could read in plain text which project was involved, and ask
> the
> > responsable person what to do about it.
>
> If I understand you correctly, you chose to use a logical PK for
tblProject
> because
> tblProject is part of a Replication table?

No, I do not think so. I have to admit that I was inconsistent, that's all.

> So unless you use a logical text as your PK, if two people create the same
> PK in text
> on two different replicas, during the synch, this will result in a synch
> conflict, thus allow
> the users to merge the two records into one?

I am sorry if my English is unclear or if I was too brief. You are
overinterpreting what I wrote. I'll try to explain: When using replication
(with Jet as database engine), conflicts sometimes occur when the same
record is updated in more than one replica. When you solve the conflicts, a
dialogue box will show you both versions and ask you which one to keep. I
simply meant, that the meaningful PK in tblProject gave meaningful foreign
keys in the child records in other tables. So when watching the info in the
conflict solver dialogue box, the user could tell which project the
conflicting record belonged to, and thus take appropriate measures.
Otherwise the user would have had to open tblProject, find the long integer
ID in the ID column and read the project name. This may sound trivial, but
it sure isn't to the user who nervously experiences conflict solving for the
first time, wondering what is meant by "conflict data" and "existing data".

I was just talking about the difference between reading "-12950332" compared
to "Project 1", in a situation where you can only see the data and not the
lookup values in other tables.

> >The changing of primary keys, that
> > Mike Lonigro calls an "unpleasant option", is taken care of by the
> Cascading
> > updates function (after a confirmation msg box).
>
> I don't think it is just a matter of "Cascading Updates". The way I
> understood it, if you
> use surrogate PK, you could modify your logical unique key attributes a
lot
> easier and not
> worry about updating any of its Foreign Keys througout the database.

I am talking about the Access-Jet feature "cascading updates". (I will
neither advocate nor argue against the use of cascading updates. I also have
an impression that some of the seniors in this newsgroup do not favor the
use of it.) However, in my example any changes in a logical PK in tblProject
would have been cascaded by Jet to the foreign keys in every instance in the
related tables. I cannot see how I could worry less about updating foreign
keys throughout the database.

Regards
Torsten L

Clive Bolton

unread,

Feb 4, 2002, 8:36:24 PM2/4/02

to

Again, Joe presents straw and effluent to make a point.

Comments interspersed.

"--CELKO--" <71062...@compuserve.com> wrote in message
news:c0d87ec0.02020...@posting.google.com...

> >> points in his article, shouldn't we always use Primary keys that
> are
> unrelated to the table's attributes. <<
>
> Only if we wish NEVER to verify your data model against the reality it
> is supposed to model.

Wrong.

The Primary Key (whatever it is) has two functions:
= Identify uniquely and permanently a row of data in a table for selection
and deletion. It is equivalent to pointing a finger at a line and saying
"that one".
= As a foreign key value in other tables to establish relationships.

The data in the row to which the pk refers is still there, and it is the
data that is used to verify against reality. (that is, an internal to
external comparison.)

> The IDENTITY column is a hold over from the early programming language

This is straw - set it up and knock it down. Children regularly use this
technique.

> which were <i>very<i> close to the hardware. For example, the fields
> in a COBOL or FORTRAN program were assumed to be physically located in
> main storage in the order they were declared in the program. This
> meant that you could define a template that overlaid the same physical
> space and read the representation in several different ways. In
> COBOL, the command was REDEFINES, EQUIVALENCE in FORTRAN and a union
> in 'C'.
>
> From a logical viewpoint, this redefinition makes no sense at all.

Agreed - but irrelevant. The original poster was not talking about a "record
number" at all. The actual value allocated is irrelevant, as long as it is
unique and permanent in that table.

> It is confusing the numeral with the number the numeral represents. The
> history of programming after this point in time has been to divorce
> the logical and physical models completely.
>
> The early SQLs were based on existing file systems. The data was kept
> in physically contiguous disk pages, in physically contiguous rows,
> made up of physically contiguous columns. In short, just like a deck
> of punch cards or a magnetic tape.
>
> But physically contiguous storage is only one way of building a
> relational database and it is not always the best one. But aside from
> that, the whole idea of a relational database is that user is not
> supposed to know how things are stored at all, much less write code
> that depends on the particular physical representation in a particular
> release of a particular product.

Mildly interesting, but irrelevant. The autonumber/identity has nothing to
do with how data is stored, and is not dependent upon it. It can be random,
GUID, anything - as long as it is unique and permanent in the table. As it
is structural and not data, it should not be exposed to users.

> One of the biggest errors is the IDENTITY column in the Sybase family
> (SQL Server and Sybase). People actually program with this "feature"
> and even use it as the primary key for the table! Now, let's go into
> painful details as to why this thing is bad.

I see the biggest problem is in trying to overlay Primary Key concepts onto
attributes - that is to give an attribute two (and sometimes conflicting)
meanings.

> The practical considerations are that IDENTITY is proprietary and
> non-portable,

As Identity has nothing to do with record number, the values are entirely
portable, as I have implemented many times. And with much greater certainty
of maintaining data integrity.

> so you know that you will have maintenance problems when
> you change releases or port your system to other products. It also
> has some very strange bugs in both Sybase and SQL Server; go to a
> newsgroup and do a search.

Note that we (comp.databases.ms-access) are frequently talking of Jet as the
default db engine.

> But let's look at the logical problems. First try to create a table
> with two columns and try to make them both IDENTITY columns. If you
> cannot declare more than one column to be of a certain datatype, then
> that thing is not a datatype at all, by definition.

Straw! Why would one want two alternative unique identifiers of one row? In
Jet and SS the data type is 32bit integer, and can certainly be used in
several columns. It is only one use of such data type that is being
specified. You may as well argue that only one datetime datatype field may
be used in a table because there should be only one column of DateCreated.

The Primary Key is not data, it is a row pointer. Nothing more. The data in
that row establishes the link to a specific instance of an entity. The
Primary Key is a structural element.

> Next, create a table with one column and make it an IDENTITY column.
> Now try to insert, update and delete different numbers from it. If
> you cannot insert, update and delete rows from a table, then it is not
> a table by definition.

How can one identify a row of data if there is no data? Another straw
argument based on a myopic view of what is involved.

> Finally create a simple table with one IDENTITY column and a few other
> columns. Use a few statements like
>
> INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
> INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
> INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
>
> to put a few rows into the table and notice that the IDENTITY column
> sequentially numbered them

Not necessarily so. Again based on a limited view.

in the order they were presented. If you
> delete a row, the gap in the sequence is not filled in and the
> sequence continues from the highest number that has ever been used in
> that column in that particular table.
>
> But now use a statement with a query expression in it, like this:
>
> INSERT INTO Foobar (a, b, c)
> SELECT x, y, z
> FROM Floob;
>
> Since a query result is a table, and a table is a set which has no
> ordering, what should the IDENTITY numbers be?

The Identity numbers should be unique and permanent - what the actual values
are is irrelevant.

> The entire, whole,
> completed set is presented to Foobar all at once, not a row at a time.
> There are (n!) ways to number (n) rows, so which one do you pick?

The numbering is not necessarily based on any concept of order, so which way
of generating the unique permanent values is irrelevant.

> The answer has been to use whatever the physical order of the result
> set happened to be. That non-relational phrase "physical order"
> again.

So why not stop using it (irrelevantly)?

> But it is actually worse than that. If the same query is executed
> again, but with new statistics or after an index has been dropped or
> added, the new execution plan could bring the result set back in a
> different physical order.

So what? Identity is not about order. That is a straw man. That new
statistics or indices may alter the order in which records are retrieved has
no relevance to the use of Identity values.

> Can you explain from a logical model why
> the same rows in the second query get different IDENTITY numbers? In
> the relational model, they should be treated the same if all the
> values of all the attributes are identical.

Well! All aspects of a database are at risk to stupidity. If the attributes
in the rows are identical AND the real-life items they refer to are the
same, then a double-insertion should be blocked. If the attributes are the
same and the real-life items are in fact different, then the rows should be
inserted and be given different ID's. This is an argument *for* surrogate
keys. Such occurences do arise, and must be handled in the system, but often
with such infrequency that to extend the model by adding further attributes
to the pk is quite unwarranted, and indeed intrusive into the business of
the organisation.

Real-life example from a 35,000 subscription list: Twin daughters each give
birth to a daughter on the same day. Each insists on calling their child by
their mother's name. The babies were born in different towns. It is simply
impractical to extend the pk to LN FN DOB POB, just to resolve one conflict.
Mutilating say surname to achieve differentiation is repugnant, and moves
away from matching reality. Whatever approach is used to store details of
the granddaughters, human intervention will be required to differentiate.

> Think about trying to do replication on two databases that differ only
> by an index, or by cache size or something that occasionally gives
> them different execution plans for the same statements. Want to try
> to maintain such a system?

There is no problem: replication must preserve both data AND relationships,
and this can be achieved.

> There are better ways of creating identifiers.

Does this mean that row identifiers are ok, and only one specific
implementation is being rejected? A row identifier is not an attribute of an
entity (ie not data).

> >> a physical PK? <<
>
> You are a programmer; you only decide on logical keys. If the system
> wants to locate the data by indexes, bit vectors or hashing, that is
> not your concern.
>
> >> Am I right to do this? Are there any good reasons NOT to use
> surrogate keys? <<
>
> I make BIG distinctions among nautral keys, artifical keys and
> surrogate keys; other writers tend to blur them.

It would be interesting to hear of those distinctions.

... clipped - too tiresome.

> 3) Since you did not tell us what the products are, I would assume
> that they have a UPC or EAN code and I would use that natural key
> instead a meaningless, error-prone, proprietary IDENTITY column.

Oh, are UPC or EAN codes meaningful? Error-free? Non-Proprietary?

Usage and printing errors; non-approved usage, non-coded product etc make
them useless as row ID's as there can be duplicates and nulls.
...

> You missed the nautral key (customer_id, upc) because you thought that
> the redundant IDENTITY column would work. It does not -- it lets you
> insert non-existent product and non-existent customers and never stops
> you.

Only if you ignore enforced relational integrity as a part of the relational
model. But that is equally true of "natural" pk's. That is, if any "natural
key" is ever guaranteed to be unique and permanent.

Clive

David W. Fenton

unread,

Feb 5, 2002, 12:45:37 AM2/5/02

to

torstenREM...@swipnetTHIS.se (Torsten L) wrote in
<PJs78.23131$l93.4...@newsb.telia.net>:

>When using Jet replication, I have found that logical primary keys
>actually can be advantageous. In a database, I put a logical PK in
>the table with the "top entity" of the database, tblProject, and
>surrogate PK in the other tables. When the users had synch
>conflicts, the person maintaining the replica set could read in
>plain text which project was involved, and ask the responsable
>person what to do about it. The changing of primary keys, that
>Mike Lonigro calls an "unpleasant option", is taken care of by the
>Cascading updates function (after a confirmation msg box).

I have done plenty of replication.

I would NEVER, EVER, EVER consider using real keys for PK.

NEVER.

There are way too many problems there.

If your users need to be able to resolve conflicts, then you need
to build your own conflict resolver that disambiguates the meaning
of the data for them.

Of course, I can't think of very many situations with replication
conflicts where I couldn't get to what I needed by joing the
Conflict table with the relevant data table, except with the
replica I was working in did not contain the relevant record.

Now, with replication *errors*, it was a different story, as those
are pretty difficult to resolve in Jet 3.5. Jet 4 has eliminated
the whole concept of replication errors by treating them as
conflicts, so perhaps that's what you're referring to. If so, then
you are really asking a lot of your users, as replication errors
are the result of mistreatment of your replica set (say by
promiscuously copying and deleting replics) or of design errors in
your underlying schema. It's hardly fair to expect end users to
resolve *those* kinds of problems.

I strongly doubt that's what you mean, and if so, I don't see the
problem, as conflict records should hardly ever lack sufficient
meaningful data *if* you use time/date stamps for your updates.
Lacking that, it can be kind of difficult.

But I would think that cascading updates would only *increase* the
number of conflicts, as you are adding a whole other set of fields
that can be updated, and, hence, conflict with each other.

No, I think replication is an exceedingly poor reason to utilize
meaningful PKs.

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc

David W. Fenton

unread,

Feb 5, 2002, 12:50:22 AM2/5/02

to

sergea@RemoveThis_ehmail.com (serge) wrote in
<pPw78.6291$Xw.8...@news20.bellglobal.com>:

>If I understand you correctly, you chose to use a logical PK for
>tblProject because
>tblProject is part of a Replication table?
>
>So unless you use a logical text as your PK, if two people create
>the same PK in text
>on two different replicas, during the synch, this will result in a
>synch conflict, thus allow
>the users to merge the two records into one?

That's not the way replication works.

That would not be a conflict on the same record, but a collision
between adding the same PK in two different tables. Replication
tracks the real identity of a record, which includes the replica it
comes from and a Replication GUID in each record (and Jet 4 tracks
changes at the column level). So, if you create two records in two
different replicas that actually refer to the same entity and use a
meaningful PK, you will not get a conflict, but a schema error,
because you've got two completely different records with the same
PK.

Think of it this way:

If you add a record to a table with an AutoNumber PK, and then try
to append another record with a PK that's already been used, it
will be rejected, not because of a conflict in editing data, but
because there's a UNIQUE index on the field, and you can't add a
second record with the same value in that field.

*That's* what's happening, a PK collision.

With random AutoNumbers, in all my years of replication I haven't
seen a single such collision.

With meaningful PKs, you would have it happening all the time.

The downside to surrogate keys is that you have to do maintenance
to merge data for real-entity duplication.

David W. Fenton

unread,

Feb 5, 2002, 12:51:47 AM2/5/02

to

torstenREM...@swipnetTHIS.se (Torsten L) wrote in

<16E78.23256$l93.4...@newsb.telia.net>:

>I was just talking about the difference between reading "
>-12950332" compared to "Project 1", in a situation where you can
>only see the data and not the lookup values in other tables.

You really should write your own custom conflict resolver instead
of expecting your users to do that.

Your rationale is an extremely poor design decision, in my opinion.

David W. Fenton

unread,

Feb 5, 2002, 1:13:11 AM2/5/02

to

71062...@compuserve.com (--CELKO--) wrote in
<c0d87ec0.02020...@posting.google.com>:

>CREATE TABLE Customers
>(customer_id INTEGER NOT NULL PRIMARY KEY,
> ...);
>
>CREATE TABLE Products
>(upc DECIMAL(10,0) NOT NULL PRIMARY KEY,
> ...);
>
>CREATE TABLE CustomersPurchases
>(customer_id INTEGER NOT NULL
> REFERENCES Customers(customer_id)
> ON UPDATE CASCADE,
> upc DECIMAL(10,0) NOT NULL
> REFERENCES Products(upc)
> ON UPDATE CASCADE,
> ...
> PRIMARY KEY (customer_id, upc));
>
>You missed the nautral key (customer_id, upc) because you thought
>that the redundant IDENTITY column would work. It does not -- it
>lets you insert non-existent product and non-existent customers
>and never stops you.

This is the argument that is always made, and in cases where you
have:

1. an outside authority reliably assigning these "natural" keys
(UPCs), then, yes, you can "rely" on them, as long as you are
willing to rewrite your system any time the outside authority
decides to alter the way they generate their so-called "natural"
keys. Nor does it address problems that can come from circumstances
where vendors merge, and two customers become one. Many of us have
made lots of money rewriting systems that were built on the
assumption that something like a CustomerID and a UPC as
representations of real-world entities would be sufficient as a PK
right up to the point where the system changed in a way that made
that assumption obsolete.

2. the theoretical argument argument against identity fields (in
Jet, AutoNumber, which is really just a Long Integer with a
particular default value and when a PK, a UNIQUE non-null index) is
always made as if the users were going to assign some meaning to
that field (almost always the idea that they will interpret them as
having a natural order). For true surrogate key purposes, it really
matters not at all what the actual values are, since they have no
meaning except to serve as a substitute for real values that are
unsuitable as a PK (because the natural key must have more than one
null (because of missing or not-yet-known data), or because the
natural key can change, or because the natural key is a composite
that would be unwieldy to store as foreign key in other tables). It
doesn't matter if the values in these "identity" fields are
assigned randomly, because THEY HAVE NO MEANING, they only have
FUNCTION.

3. last of all, the record you are creating to store information
about a real-world entity must be able to be created for the
purposes the application in question needs it. If you design your
schema with a composite natural key where all fields that are part
of the key must have values in order for the record to be inserted,
then you will prevent the insertion of the records until such time
as the missing data is found, or (more likely), end users will
input fake data to permit the insertion. The record is not the
entity being modelled, and the logical representation must "degrade
gracefully" in terms of the way in which it is represented in the
schema. That is, in a perfect world, perfect theoretical schemas
would work well. In the real world, one must accomodate the lack of
data in some fashion in order to allow the database to be used.

I won't even go into the horrors of multi-column foreign keys or
join tables between tables with composite keys. In those instances,
the only practical solution is a surrogate key.

And I haven't even mentioned join performance.

(nice to see Mr. Celko in these parts!)

Torsten L

unread,

Feb 5, 2002, 6:39:35 AM2/5/02

to

> You really should write your own custom conflict resolver instead
> of expecting your users to do that.

> Snip

> Your rationale is an extremely poor design decision, in my opinion.

Thank you for your two replies. I think rationale means justification? Yes I
see your point. The shortcomings (related to specific user demands) of the
Access conflict resolver of course should not justify such central design
decisions as choice of primary keys. I repent and take back.

It has not occured to me before that I could write a custom conflict
resolver, I would have thought that to be a very complicated thing to do.
Thank you for pointing out an interesting possibility.

Torsten L

michael

unread,

Feb 5, 2002, 1:06:02 PM2/5/02

to

Clive,
Comments inlined.

> > Only if we wish NEVER to verify your data model against the reality it
> > is supposed to model.
>
> Wrong.

I think the point is that thinking of an Identity/Autonumber as *the* PK is
wrong. There should alwyas be a PK independent of it. The identity is useful
as a surrogate key for internal purposes, not a way of making records
unique.

...

> > The IDENTITY column is a hold over from the early programming language
>
> This is straw - set it up and knock it down. Children regularly use this
> technique.

I see the "row number" mentality a lot. It's a real problem. Leads to a lot
of procedural code. >

...

> Agreed - but irrelevant. The original poster was not talking about a
"record
> number" at all. The actual value allocated is irrelevant, as long as it is
> unique and permanent in that table.

"permanent": now hat is a problem with Identity: it isn't permanent. Try
replicating or merging data and you get to see just how vacuous and
impernanent the Identity value is.

> Mildly interesting, but irrelevant. The autonumber/identity has nothing to
> do with how data is stored, and is not dependent upon it. It can be
random,
> GUID, anything - as long as it is unique and permanent in the table. As it
> is structural and not data, it should not be exposed to users.

As you put it, it is "structural". It has no place in a model. records
should be unique even without it. If the Identity is all that distinguishes
two records and the user never sees it, then any list presented to the user
is going to have duplicate ntries and the uer will have no way of knowing
which is which. I have seen this happen...

>
> > One of the biggest errors is the IDENTITY column in the Sybase family
> > (SQL Server and Sybase). People actually program with this "feature"
> > and even use it as the primary key for the table! Now, let's go into
> > painful details as to why this thing is bad.

Hmmm... I think the problem lies withteh users. Given a clear awareness of
the limitations of Identity, it can make a useful surrogate key.

> I see the biggest problem is in trying to overlay Primary Key concepts
onto
> attributes - that is to give an attribute two (and sometimes conflicting)
> meanings.

I don't get that.

> > The practical considerations are that IDENTITY is proprietary and
> > non-portable,
>
> As Identity has nothing to do with record number, the values are entirely
> portable, as I have implemented many times. And with much greater
certainty
> of maintaining data integrity.

NONONONONO: I have seen people propose that we export the values from one db
to another with Identity values and rely only on the Identity values to
maintain integrity. The resulting mess doesn't bear thinking about.

> > so you know that you will have maintenance problems when
> > you change releases or port your system to other products. It also
> > has some very strange bugs in both Sybase and SQL Server; go to a
> > newsgroup and do a search.
>
> Note that we (comp.databases.ms-access) are frequently talking of Jet as
the
> default db engine.
>
> > But let's look at the logical problems. First try to create a table
> > with two columns and try to make them both IDENTITY columns. If you
> > cannot declare more than one column to be of a certain datatype, then
> > that thing is not a datatype at all, by definition.
>
> Straw! Why would one want two alternative unique identifiers of one row?
In
> Jet and SS the data type is 32bit integer, and can certainly be used in
> several columns. It is only one use of such data type that is being
> specified. You may as well argue that only one datetime datatype field may
> be used in a table because there should be only one column of DateCreated.

I think Joe's point is that Identity is not a first-class data type and
that is true. It may not matter much in practice since, as you say, one
never uses more than one.

> The Identity numbers should be unique and permanent - what the actual
values
> are is irrelevant.

You are correct that the actual values are irrelevant. You make Joe's point
for him: the values being irrelevant doens't quite gel with the way the
other columns work since their values are by definition not irrelevant, so,
again, there is something very non-standard about Identity.
The real problem is many developers without a theoretical DB background
write code that depends on specific Identity values (I have seen plenty of
it).

>
> > The entire, whole,
> > completed set is presented to Foobar all at once, not a row at a time.
> > There are (n!) ways to number (n) rows, so which one do you pick?
>
> The numbering is not necessarily based on any concept of order, so which
way
> of generating the unique permanent values is irrelevant.

Again, in reality many developers use it as such. That is a problem.

...

> Well! All aspects of a database are at risk to stupidity. If the
attributes
> in the rows are identical AND the real-life items they refer to are the
> same, then a double-insertion should be blocked. If the attributes are the
> same and the real-life items are in fact different, then the rows should
be
> inserted and be given different ID's.
> This is an argument *for* surrogate
> keys.

No! How can you then tell which is which?? If the end-user never sees your
identity values (as you said earlier) then how is she to know "John Doe"
from "John Doe"?
You are making the key mistake: you can't use Identity to make records
unique: they should already be unique without. Itf they are not then you
have a problem. This is a pet peeve of mine becuase I worked for while on a
project where the developers were doing. The result was that in the UI, you
cuold see lots of duplkicates and the user had no way of knowing wihc was
which. This was fine in development, but when the product goes out to
clients, they WILL get support calls complaining of errors in reports etc
becuase of this and the poor users will have to resort to nasty hacks like
"John Doe A" and "John Doe B"

>Such occurences do arise, and must be handled in the system, but often
> with such infrequency that to extend the model by adding further
attributes
> to the pk is quite unwarranted, and indeed intrusive into the business of
> the organisation.

I agree that expected frequency of duplicates is an use, but IMO the expectd
frequency should be signifiicantly less than 1 over the lifetime of the
product.

> Real-life example from a 35,000 subscription list: Twin daughters each
give
> birth to a daughter on the same day. Each insists on calling their child
by
> their mother's name. The babies were born in different towns. It is simply
> impractical to extend the pk to LN FN DOB POB, just to resolve one
conflict.

First name + surname + dob as PK is not enough in the UK where hundreds or
thousands of John Smiths are born annually. The town I live in has a pop of
aroun 250K. It has almost 200 J.Smith listed in the phone book. Say 1
phone/3 inhabitants (wild guess) would make it over 500 J/Smiths, and with
John being a popular name in these parts, there will be quite a few John
Smiths. BTW, Bith John and Smith are a lot more common in other parts of the
country. The likelihood of two sharing a birthday is way too great to rely
on FN+SN+DOB to be unique.

> Mutilating say surname to achieve differentiation is repugnant, and moves
> away from matching reality. Whatever approach is used to store details of
> the granddaughters, human intervention will be required to differentiate.

Preferably at developemtn time, e.g. by adding SSN or middle initial or
whatever; not by having the client call the help desk!

> > Think about trying to do replication on two databases that differ only
> > by an index, or by cache size or something that occasionally gives
> > them different execution plans for the same statements. Want to try
> > to maintain such a system?
>
> There is no problem: replication must preserve both data AND
relationships,
> and this can be achieved.

Not if you rely on Identity for uniqueness.

All that said, I do use Identity as a *surrogate* key for purely pragmatic
reasons. I insist that there should always be a unique constraint on the
table that does not involve the Identity column; that seems to be the only
way to stop some programmers from abusing it.

michael

Brendan Reynolds

unread,

Feb 5, 2002, 4:09:11 PM2/5/02

to

The situation that I encounter very frequently is that records can be
uniquely identified by a combination of fields, but that not all records can
be uniquely identified by any *one* combination of fields. I'll give a
simplified example, using Name, Address, and Phone Number - but please bear
in mind that this is a simplified example, I am aware that the combination
of these fields is not guaranteed to be unique (when my great uncle was
alive, he and my father shared the same name and address, and neither had a
phone! :-) Assume, if you will, that there are other fields involved as
well, we're just concentrating on these three for the sake of example.

Now assume I have the following records:

John Smith | 1 Some Street | <phone number is null>
John Smith | 2 Another Street | <phone number is null>

Here the combination of name and address serves to distinguish these two
records.

John Browne | <address is null> | 234 5678
John Browne | <address is null> | 345 6789

Here the combination of name and phone number serves to distinguish these
two records.

I can not use either of these two combinations as the primary key, as
neither combination alone serves to distinguish all records. And I can not
use the combination of all three fields as the primary key, because one or
move of the three fields can be Null, and a field that is part of a primary
key can not be Null.

It seems to me, then, that there can be situations where there is no
reliable natural key, and yet it is still possible to distinguish unique
records, either by sight or through code? That using a surrogate key *does
not* necessarily mean, as is so often argued, that we are not preventing
duplicates - merely that there is, unfortunately, and due to the reality of
the situation being modelled, no way to prevent them *at the level of the
database engine*?

The other side of the coin is that a natural key does not necessarily do any
more to prevent duplicates than a surrogate key. Using the same simplified
example, let's suppose for the sake of argument that all three fields, Name,
Address, and Phone, will always be known. Does this prevent the entry of
duplicates?

--
Brendan Reynolds
bren...@indigo.ie
If it's not plain text, without attachments,
I'm not reading it.
----------------------------------------
"michael" <monosodiumg@N_O_S_P_A_Mhotmail.com> wrote in message
news:z0V78.30348$oK1.275458@NewsReader...

Brendan Reynolds

unread,

Feb 5, 2002, 4:27:03 PM2/5/02

to

I'll repeat a real-life example I described recently - though not, if I
remember correctly, in this newsgroup. I developed an app last year to
record information that was delivered in pre-printed envelopes, each with a
unique number. Well, that was the theory.

At least one batch of envelopes was printed with duplicate numbers, at least
one batch was printed with no numbers, and in many cases the information
came in not in the pre-printed envelopes at all.

Boy was I glad I hadn't used that envelope number as the primary key! :-)

--
Brendan Reynolds
bren...@indigo.ie
If it's not plain text, without attachments,
I'm not reading it.
----------------------------------------

"David W. Fenton" <dXXXf...@bway.net> wrote in message
news:91AC9D3Bdfe...@news-server.nyc.rr.com...

Clive Bolton

unread,

Feb 5, 2002, 5:40:17 PM2/5/02

to

Well, thanks for the contribution - I have printed it out for thought. It's
our national holiday here, and we are off to a free openair concert on a
beautiful warm afternoon. Bye for now.

Clive

"michael" <monosodiumg@N_O_S_P_A_Mhotmail.com> wrote in message
news:z0V78.30348$oK1.275458@NewsReader...

michael

unread,

Feb 5, 2002, 6:52:46 PM2/5/02

to

> The situation that I encounter very frequently is that records can be
> uniquely identified by a combination of fields, but that not all records
can
> be uniquely identified by any *one* combination of fields.

In that case all records can be uniquely identified by the union of these
different sets of fields.

I have come across this too. It's a pain. However, while such a set of
column can't be used as PK in mssql, you can still declare a UNIQUE
constraint over them which effectively enforces their key status. For an
actual PK, you can then use a surrogate. When it comes to replication,
merging etc the uniqueness of each record is guaranteed and you can generate
your multi-column logical PK for join purposes in the knowledge that each
record is truly unique.

>no way to prevent them *at the level of the
> database engine*?

A UNIQUE constraint will do this.

> The other side of the coin is that a natural key does not necessarily do
any
> more to prevent duplicates

A key prevents duplicates by definition: it isn't a key if there are
duplicates.

> John Smith | 1 Some Street | 123 4567
> Jack Smith | 1 Some Street | 123 4567

Exactly what we seek to prevent. Two records like this is a disaster in the
making, even if there is a Identity key to distinguish them. The UI never
sees the Identity and so has no way of telling these apart.

These are not a problem in the current context. This should be prevented by
validation etc. The difference between "123 4567" and "1234 567" is a
difference between two strings. the telephone numbers is 1234567 (int) in
both cases.
The difference between "1 Some Street" and "1 Some St." is a validation or
canonicalization issue (IMO).

michael

"Brendan Reynolds" <bren...@indigo.ie> wrote in message
news:vRX78.33907$8s4.1...@news.indigo.ie...

David W. Fenton

unread,

Feb 5, 2002, 9:17:39 PM2/5/02

to

torstenREM...@swipnetTHIS.se (Torsten L) wrote in

<XvP78.16338$n4.26...@newsc.telia.net>:

The ADH explains how to do it. It's not all *that* difficult, but a
somewhat more than an afternoon's work.

Brendan Reynolds

unread,

Feb 6, 2002, 6:20:15 AM2/6/02

to

Michael,

First, this thread is being cross-posted to comp.databases.ms-access and
comp.databases.ms-sqlserver. I'm an Access developer without much
familiarity with SQL Server. The test I describe below uses a Jet database,
there may, for all I know, be differences in behaviour between Jet and SQL
Server here.

In Jet, a unique index on a combination of fields that allow Null is not
going to prevent duplicates. I created a Jet table with three text fields,
TestName, Address, Phone, created a unique index on the combination of the
three fields, and entered the following two records:

John Smith | Null | 123 4567
John Smith | Null | 123 4567

Second, if the purpose of the natural key is to prevent duplicates, then how
can you say that the fact that the natural key does not prevent duplicates
is 'not a problem in the current context'? Yes, of course we can prevent
duplicates by other means - just as we prevent duplicates by other means
when we use a surrogate key. But then what is the purpose of the natural
key?

--
Brendan Reynolds
bren...@indigo.ie
If it's not plain text, without attachments,
I'm not reading it.
----------------------------------------
"michael" <monosodiumg@N_O_S_P_A_Mhotmail.com> wrote in message

news:I5_78.17257$bP3.163173@NewsReader...

michael

unread,

Feb 6, 2002, 9:56:26 AM2/6/02

to

Brendan,

> can you say that the fact that the natural key does not prevent duplicates
> is 'not a problem in the current context'?

I didn't say that. I was referring to the example duplicate records you had
mentioned. They are duplicates only if you treat the phone number as a
string and allow "St" vs "Street". The solutions to these problems are IMO
validation or data scrubbing and not specific to keys and uniqueness.

>- just as we prevent duplicates by other means
> when we use a surrogate key.

A surrogate key doesn't prevent duplicate. Your records are duplicates
period. Arbitrarily adding a column with a meaningless unique value
certainly makes them unique, but at the expense of sweeping the problem
under the carpet. When someone says get me John Smith's details, whose
details do you supply (bear in mind the user doesn't have access to your
surrogate key)? My first ever DB job (late jurassic era ;-) was a tool
tracking DB in Access 1.x/2 and this was a major problem. The root problem
was that their serial number system was flawed. There was no way round that.
Adding an Autonumber would have meant users picking what they thought was a
particular tool and actually getting a different tool with the same name and
number. I explained to the CEO that there was a fundamental identification
problem that needed fixing and eventually it was. I scanned the tool list
for duplicates and the tool manager allocated new numbers for these. Note
that that has nothing to do with the DB, it was an organisational problem.

> In Jet, a unique index on a combination of fields that allow Null is not
> going to prevent duplicates.

...
I tried it in Access 2K and you are right. Freaky! Shocking in fact.

Quote from the SQL server documentation:
"No two rows in the table are allowed to have the same not null values for
the columns in a UNIQUE constraint."
I haven't access to the server at the moment so I can't check but it seems
to me the above statement explicitly prevents the example from working in
SQL Server. I will check it.

michael

"Brendan Reynolds" <bren...@indigo.ie> wrote in message

news:nj888.34046$8s4.1...@news.indigo.ie...

Brendan Reynolds

unread,

Feb 6, 2002, 2:57:04 PM2/6/02

to

Michael,

So relying on code to try to detect short forms of names and abbreviations
of parts of addresses is OK, but relying on code to look for existing
records where there is a match on one of several combinations of fields is
not OK? I just don't get the logic. In code I can check for any existing
record that matches on TestName and Address, or TestName and Phone, or
Address and Phone. Sure, it takes extra work, will impose some performance
hit, and won't work if the data is modified by some other app. But all of
those things apply every bit as much to detecting short forms and
abbreviations. Why is one acceptable and one not?

I never said that a surrogate key prevented duplicates. I said, and I quote:

"Yes, of course we can prevent duplicates by other means - just as we
prevent duplicates by other means when we use a surrogate key."

I can confirm that there *is* a difference in behaviour between Jet and SQL
Server. In SQL Server 7, a unique constraint on the three fields will not
allow these records:

John Smith | Null | 123 4567
John Smith | Null | 123 4567

It will, however, allow these records:

John Smith | Null | 123 4567

John Smith | 1 Some Street | 123 4567

This is probably preferable to the behaviour of Jet, but does it really
prevent duplicates? In this example, we really have no way of knowing if
these two records are duplicates or not. It seems to me that code will still
be required to alert the user to the existence of a *possible* duplicate. So
again the natural key has not solved our problem.

Note that I am not suggesting for a moment that a surrogate key would solve
it either. Of course it would not. But I don't see how the solution to the
problem is any different, regardless of whether we use a surrogate key or a
natural key. Either way, we still have to fall back on code and/or user
intervention to identify real-world duplicates.

Now if your argument is about the use of surrogate keys to 'sweep the
problem under the carpet' then we are in complete agreement on that point.
But that is not, in my opinion, an argument against the use of surrogate
keys. It is an argument against the *improper use* of surrogate keys.

--
Brendan Reynolds
bren...@indigo.ie
If it's not plain text, without attachments,
I'm not reading it.
----------------------------------------
"michael" <monosodiumg@N_O_S_P_A_Mhotmail.com> wrote in message

news:3lb88.31818$oK1.299895@NewsReader...

Bruce Loving

unread,

Feb 6, 2002, 2:57:44 PM2/6/02

to

ADH?

Clive Bolton

unread,

Feb 6, 2002, 4:33:00 PM2/6/02

to

Access X Developer's Handbook by Getz, Litwin and Gilbert, pub. Sybex.
1500 - 3000 pp.

Clive

"Bruce Loving" <bruce_...@uncg.edu> wrote in message
news:a3s1rn$7l2$1...@hypatia.uncg.edu...
> ADH?

michael

unread,

Feb 7, 2002, 2:39:31 PM2/7/02

to

Brendan,

> So relying on code to try to detect short forms of names and abbreviations
> of parts of addresses is OK, but relying on code to look for existing
> records where there is a match on one of several combinations of fields is
> not OK? I just don't get the logic.

I don't know what bit of my post you are referring to.

> In code I can check for any existing
> record that matches on TestName and Address, or TestName and Phone, or
> Address and Phone. Sure, it takes extra work, will impose some performance
> hit, and won't work if the data is modified by some other app. But all of
> those things apply every bit as much to detecting short forms and
> abbreviations. Why is one acceptable and one not?

With reference to these two records:
> John Smith | 1 Some St | 123 4567

> John Smith | 1 Some Street | 123 4567

In practice, I would try to prevent such records getting into the DB in the
first place. But I can tolerate them in the DB better than two identical
records. I can show them to the user and the user can say "hey, that's the
same guy" and call the manager or whatever and get it sorted. We agree it's
a mistake, but this is one that isn't always avoidable. From a modelling
point of view, these are in fact separate entities: you can tell them apart.

However the pair:

> John Smith | 1 Some Street | 123 4567
> John Smith | 1 Some Street | 123 4567

is different. These are indistinguishable and in *many* case will
erroneously be seen as one record because of GROUP BY clauses etc. Imagine
they are salesmen and you JOIN to the Sales table and GROUP BY salesman: you
could end up double-counting sales etc.
This kind of exact duplicate is totally and easily avoidable (via UNIQUE
constraint), without requiring any intervention on the user's part.

> This is probably preferable to the behaviour of Jet,

Not probably: unequivocally. It means UNIQUE only works on single columns
and on sets of columns where none of the columns allows NULL which makes it
pretty weak, not to say downright misleading. Thx for pointing out this
prolem!

>In this example, we really have no way of knowing if
> these two records are duplicates or not.

At least we know there are two records and we *can* tell them apart. This is
an error that will be spotted and can be fixed. Without a unique constraint,
we would be allowing exact duplicates and they are not so easy to spot or
fix and completely avoidable without user intervention.

>It seems to me that code will still
> be required to alert the user to the existence of a *possible* duplicate.

In exact duplicate can be prevented by UNIQUE (excpet in Access, but that's
deficiency specific to Access and not present as you verified in SQL Server,
nor I will wager, in any other serious commercial RDBMS).

>So again the natural key has not solved our problem.

While the natural key doesn't solve the "St" vs "Street" problem, it does at
least close the door on a whole class of completely obvious errors. The
Identity approach opens the door wide for exact duplicates. I don't like to
say surrogate because the problem isn't with surrogate keys in general but
with the specific practice of using Identity/Autonumber as surrogate key.

> Now if your argument is about the use of surrogate keys to 'sweep the
> problem under the carpet' then we are in complete agreement on that point.
> But that is not, in my opinion, an argument against the use of surrogate
> keys. It is an argument against the *improper use* of surrogate keys.

Agree completely. That's what I was trying to say too. So you will agree
that an Identity/Autonumber is only something you would use as PK if you
have already identified a natural key that could be used as PK and you want
to use the Identity/Autonumber as surrogate for it for performance reasons.

I do use Identity/Autonumber a lot, but not as a 'solution' to uniqueness
problems. Lots of times I see them used as a way to make the records unique,
which IMO completly misses the point.

michael

"Brendan Reynolds" <bren...@indigo.ie> wrote in message

news:TTf88.34174$8s4.1...@news.indigo.ie...

Brendan Reynolds

unread,

Feb 7, 2002, 3:52:45 PM2/7/02

to

Well, I wouldn't say that I agree completely - although perhaps I would if
I were using SQL Server. But if we're not in complete agreement, at least I
think we have managed to have an interesting and useful discussion without
degenerating to the level of 'religious war', which, judging from previous
threads on this subject, is no small accomplishment! :-) Thank you.

--CELKO--

unread,

Feb 7, 2002, 4:28:47 PM2/7/02

to

>> I'll repeat a real-life example I described recently - though not,
if I
remember correctly, in this newsgroup. I developed an app last year to
record information that was delivered in pre-printed envelopes, each
with a
unique number. Well, that was the theory.

At least one batch of envelopes was printed with duplicate numbers, at
least
one batch was printed with no numbers, and in many cases the
information
came in not in the pre-printed envelopes at all. <<

I love it! Why didn't anyone think they are not processing envelopes,
but questionaires? Better yet, did the reply forms fit into the
envelopes or not? I used to work in print shops 30 years ago and I
hated it when the bill and envelope did not match.

David W. Fenton

unread,

Feb 7, 2002, 6:42:25 PM2/7/02

to

monosodiumg@N_O_S_P_A_Mhotmail.com (michael) wrote in
<LAA88.28841$bP3.216530@NewsReader>:

>The
>Identity approach opens the door wide for exact duplicates. I
>don't like to say surrogate because the problem isn't with
>surrogate keys in general but with the specific practice of using
>Identity/Autonumber as surrogate key.

Can you explain what kind of surrogate key is preferable? One that
is derived from other fields?

Gad, you can't possibly mean that, as that would violate far more
basic normalization rules, and even if implemented with triggers,
could cause massive cascades of updates if one of the source fields
is changed.

The more meaningless the surrogate, the better.

I can't see how rolling your own random number surrogate key would
be any better than the available indentity/autonumber fields.

Surely I'm missing some major alternative here?

David W. Fenton

unread,

Feb 7, 2002, 6:43:38 PM2/7/02

to

monosodiumg@N_O_S_P_A_Mhotmail.com (michael) wrote in
<LAA88.28841$bP3.216530@NewsReader>:

>That's what I was trying to say too. So you will agree

>that an Identity/Autonumber is only something you would use as PK
>if you have already identified a natural key that could be used as
>PK and you want to use the Identity/Autonumber as surrogate for it
>for performance reasons.
>
>I do use Identity/Autonumber a lot, but not as a 'solution' to
>uniqueness problems. Lots of times I see them used as a way to
>make the records unique, which IMO completly misses the point.

Oh, good lord, no wonder we have a disagreement.

You have been ascribing stupidity to the users of the
identity/autonumber as surrogate key.

This is usually referred to as constructing a straw man.

michael

unread,

Feb 8, 2002, 5:09:15 AM2/8/02

to

David,

> You have been ascribing stupidity to the users of the
> identity/autonumber as surrogate key.

I merely said that I have observed developers doing this in real life many
times, and a few times in this news group. The term stupidity wasn't one I
used; I would have called it inexperience or blamed it on a lack of
knowledge or understanding, not on a lack of intellect.

michael

"David W. Fenton" <dXXXf...@bway.net> wrote in message

news:91AEBAB90df...@news-server.nyc.rr.com...

michael

unread,

Feb 8, 2002, 5:29:24 AM2/8/02

to

> >The
> >Identity approach opens the door wide for exact duplicates. I
> >don't like to say surrogate because the problem isn't with
> >surrogate keys in general but with the specific practice of using
> >Identity/Autonumber as surrogate key.

I was being careless in my language: I should have said primary key instead
of surrogate key. The problem is only when the Identity field is used a PK
in the absence of a natural PK, in which case the term surrogate is
innapropriate as there isn't a key for it to be a surrogate of.

> Can you explain what kind of surrogate key is preferable? One that
> is derived from other fields?

...
In practice, I see many collegues clustering on the PK. They also use
Identity as PK. The combination of both can produce a hotspot for insertion.
The problem goes away if one clusters on another index, or, if one is dead
set on clustering on the PK (as most developers I know are), then using a
hash over the other fields would avoid the hotspot problem.

> The more meaningless the surrogate, the better.

...

> I can't see how rolling your own random number surrogate key would
> be any better than the available indentity/autonumber fields.

...
A hash is more meaningless than a number taken from a monotonic increasing
sequence, so by
your criterion it would be better.

> I can't see how rolling your own random number surrogate key would
> be any better than the available indentity/autonumber fields.

On reflection I have admit the problem is not really Identity per se, but
combinations of practices involving Identity. If I have fixed the blame on
Identity, it is because I see so many of my colleagues misusing it for
practices such as ordering, record numbering, clustering and fixing
uniqueness problems in problematic models.

michael

"David W. Fenton" <dXXXf...@bway.net> wrote in message

news:91AEBD8D8df...@news-server.nyc.rr.com...

serge

unread,

Feb 8, 2002, 8:41:26 AM2/8/02

to

Thank you all very much for having contributed lenghtly on this subject.

I hope not to offend anyone by asking if it is possible to summarize the
most important
rules in short points for me - an unexperienced in this subject -, and
others who might be following
this thread carefully as I have.

Since there are some disagreements, is it possible to come to some type of
consensus or not?

Or should I basically choose what I think is Best for me at this time by
searching in this thread the
pros and cons of using an Identity (AutoNumber) column as a surrogate PK?

Again, Thank you all very much. You sure have provided many many points that
will take me a
while before I understand most of them.

Brendan Reynolds

unread,

Feb 8, 2002, 8:53:17 AM2/8/02

to

Well, in this case, in effect the envelope *was* the form. The contact
information was on the envelope itself. (These envelopes were not sent via
public post, they were filled in at petrol/gasoline stations and forwarded
by courier. It was for an oil company who were replacing paper gift tokens
with swipe cards.). But of course, even if there had been numbered forms
*in* the envelope, the same kind of printing errors could have happened with
the forms.

I'm told the data exported from that app eventually ends up on a mainframe
in Houston, Texas. The phrase 'Houston, we have a problem' became something
of a byword around here! :-)

--
Brendan Reynolds
bren...@indigo.ie
If it's not plain text, without attachments,
I'm not reading it.
----------------------------------------

"--CELKO--" <71062...@compuserve.com> wrote in message
news:c0d87ec0.02020...@posting.google.com...

David W. Fenton

unread,

Feb 9, 2002, 5:33:13 PM2/9/02

to

monosodiumg@N_O_S_P_A_Mhotmail.com (michael) wrote in
<eDN88.1675$HC4.1803@NewsReader>: [quoting me, without proper
attributions:]

>> The more meaningless the surrogate, the better.
>...
>> I can't see how rolling your own random number surrogate key
>> would be any better than the available indentity/autonumber
>> fields.
>...
>A hash is more meaningless than a number taken from a monotonic
>increasing sequence, so by
>your criterion it would be better.

And a random Autonumber does not meet those criteria?

>> I can't see how rolling your own random number surrogate key
>> would be any better than the available indentity/autonumber
>> fields.
>
>On reflection I have admit the problem is not really Identity per
>se, but combinations of practices involving Identity. If I have
>fixed the blame on Identity, it is because I see so many of my
>colleagues misusing it for practices such as ordering, record
>numbering, clustering and fixing uniqueness problems in
>problematic models.

In short, the whole thing is a straw man argument, blaming
indentity/autonumber surrogate keys for badly designed schemas. The
surrogate key is not the problem at all, and never has been.

I just wish people would argue what they *really* mean instead of
confusing the issue in situations like this.

David W. Fenton

unread,

Feb 9, 2002, 5:35:42 PM2/9/02

to

monosodiumg@N_O_S_P_A_Mhotmail.com (michael) wrote in
<lkN88.1634$HC4.1180@NewsReader>:

>David,
>
>> You have been ascribing stupidity to the users of the
>> identity/autonumber as surrogate key.
>I merely said that I have observed developers doing this in real
>life many times, and a few times in this news group.

*Which* newsgroup? You do realize the thread is cross-posted to two
newsgroups, right?

> . . . The term stupidity wasn't one I
>used; . . .

No, but the practices you describe exhibit stupidity, and they are
orthoganal to the surrogate key problem space.

> . . . I would have called it inexperience or blamed it on a lack

> of
>knowledge or understanding, not on a lack of intellect.

Well, I'll be blunt: anyone who uses surrogate keys for the purpose
of creating uniqueness is being stupid.

But the reason why this thread has been going on so long is because
you've been talking past your correspondents on the subject, since
no one has argued that this is a proper use of surrogate keys, even
while it's been (apparently) your chief objection.

michael

unread,

Feb 10, 2002, 9:13:26 AM2/10/02

to

> And a random Autonumber does not meet those criteria?

What is a random Autonumber? Certainly not Autonumber in Access nor Identity
in mssql.

> In short, the whole thing is a straw man argument, blaming
> indentity/autonumber surrogate keys for badly designed schemas. The
> surrogate key is not the problem at all, and never has been.

Correct. However identity/autonumber are a bit like pointer arithmetic in C:
useful but unneccesary and usually abused.

michael
"David W. Fenton" <dXXXf...@bway.net> wrote in message

news:91B0BF307df...@news-server.nyc.rr.com...

Douglas J. Steele

unread,

Feb 10, 2002, 10:40:44 AM2/10/02

to

"michael" <monosodiumg@N_O_S_P_A_Mhotmail.com> wrote in message

news:s6v98.5800$HC4.59505@NewsReader...

> > And a random Autonumber does not meet those criteria?
> What is a random Autonumber? Certainly not Autonumber in Access nor
Identity
> in mssql.

Look at the field properties next time you create an Autonumber field in
Access. The second one shown on the General tab is "New Values". Assuming
you selected Long Integer (the default) as the Field Size, New Values can be
set to either Increment (the default) or Random.

--
Doug Steele, Microsoft Access MVP
http://I.Am/DougSteele

Albert Marshall

unread,

Feb 10, 2002, 3:50:01 PM2/10/02

to

In article <s6v98.5800$HC4.59505@NewsReader>, michael <monosodiumg@N_O_S
_P_A_Mhotmail.com> writes

>> And a random Autonumber does not meet those criteria?
>What is a random Autonumber? Certainly not Autonumber in Access nor Identity
>in mssql.
>

I can't answer for MS SQL Server, but look at the New Values property of
an Access autonumber field. Your ignorance of something this basic does
you no favours in this argument.

>> In short, the whole thing is a straw man argument, blaming
>> indentity/autonumber surrogate keys for badly designed schemas. The
>> surrogate key is not the problem at all, and never has been.
>Correct. However identity/autonumber are a bit like pointer arithmetic in C:
>useful but unneccesary and usually abused.
>

Extremely useful and sometimes abused. I've had tables with six or seven
field natural keys and using these in relationships and joins gets
ridiculously unwieldy, not to mention causing unconscionable data
duplication across tables.
--
Albert Marshall
Database Developer
Marshall Le Botmel Ltd
01242 222017

David W. Fenton

unread,

Feb 10, 2002, 11:29:50 PM2/10/02

to

monosodiumg@N_O_S_P_A_Mhotmail.com (michael) wrote in
<s6v98.5800$HC4.59505@NewsReader>:

>> And a random Autonumber does not meet those criteria?
>What is a random Autonumber? Certainly not Autonumber in Access
>nor Identity in mssql.

As others have pointed out, you need to look more closely at
Access/Jet. Random AutoNumbers are required in replication, so
that's probably why I noticed them. The result would be pretty even
distribution of the values, insofar as VBA's random number
generator can manage.

In any event, most people simply don't understand what the Jet
AutoNumber really is. It's really only a special kind of default
value placed on a long integer field. That's why you can append
values to it that are out of sequence (i.e., not Max() + 1), as
long as they wouldn't violate whatever index you've got on the
field (in the case of a PK, of course, unique, non-null).

I don't know about SQL Server identity fields, but it seems to me
that the Jet AutoNumber with a random value answers all your
objections.

>> In short, the whole thing is a straw man argument, blaming
>> indentity/autonumber surrogate keys for badly designed schemas.
>> The surrogate key is not the problem at all, and never has been.
>Correct. However identity/autonumber are a bit like pointer
>arithmetic in C: useful but unneccesary and usually abused.

I see. Blame the tools for the errors of the users of the tools.

Clive Bolton

unread,

Feb 11, 2002, 5:05:41 AM2/11/02

to

Serge - you certainly started something!

I have been working on a small article to bring together various aspects of
this question. I hope to clean it up over the next 24 hours and email it to
selected people for evaluation and correction. If it is seen as useful, it
will be posted here under a new thread, possible called "Using Surrogate
Keys".

I suspect that most responders here do in fact use a mixture of NK and SK,
as determined by a bit of art.

It can all be made to work, with varying strengths and weaknesses.

Clive

"serge" <sergea@RemoveThis_ehmail.com> wrote in message
news:dAQ88.27475$Xw.33...@news20.bellglobal.com...

serge

unread,

Feb 11, 2002, 10:48:48 AM2/11/02

to

Thank you very much Clive.

If you do manage to put it together, that would definitely be great.

Good Luck

Tony Toews

unread,

Feb 11, 2002, 6:12:38 PM2/11/02

to

"Clive Bolton" <al...@remove.wave.co.nz> wrote:

>Serge - you certainly started something!

Yeah, we have this debate every six months or year or so in some newsgroup or other.

>I suspect that most responders here do in fact use a mixture of NK and SK,
>as determined by a bit of art.

<Rereading previous postings in this thread> Ah, a surrogate key is the autonumber
key. I always use autonumbers everywhere. I will create non duplicate indexes
where appropriate but never on names.

I also have no database theory or training of any kind so I'm not familiar with the
terminology.

Tony
--
Tony Toews, Microsoft Access MVP
Please respond only in the newsgroups so that others can
read the entire thread of messages.
Microsoft Access Links, Hints, Tips & Accounting Systems at
http://www.granite.ab.ca/accsmstr.htm

Doug Miller

unread,

Feb 11, 2002, 6:58:37 PM2/11/02

to

"serge" <sergea@RemoveThis_ehmail.com> wrote in message news:<dAQ88.27475$Xw.33...@news20.bellglobal.com>...

> I hope not to offend anyone by asking if it is possible to summarize the

> most important
> rules in short points for me - an unexperienced in this subject -, and
> others who might be following
> this thread carefully as I have.

From the peanut gallery, I'll throw in a few comments.

1 - Every record in a table should be unique. What makes them unique
is sometimes defined by the business side of things, which means
things can get confusing and fuzzy and changing. (waddya mean john
changed his name to sally?)

2 - A perfectionist would normalize to the point that even city names
were in their own special little table. This gets confusing and
complicated in a big hurry. Perfectionists are a PITA IMO in the real
world.

3 - Sometimes a solution that works and makes sense for the most folks
is the easiest to use. The data guy's responsibility is to draw the
line and take the long picture, and demand certain things, while still
listening. A very fine line to walk.

Myself, I usually try to get the business folks to help define
"uniqueness", use artificial surrogate keys everywhere (cause business
folks change their minds lots, and it is easier to change unique key
constraints then table/programming relationships), and when that still
ends up with data that is suspect, create exception reports that I let
the business folks run themselves and let them deal with it. You can
get really creative with exception reports. "Find me all the records
where the first 3 chars of last name with spaces stripped, zip, and
last 4 digits of phone number are the same". If all that is true, it
very well might be a duplicate, but I sure ain't gonna zap them as a
DBA without the business folks approving it. And I've even been known
to put a date field in a table so that it only shows up periodically
on an exception report - a "user" can "clear" two dup entries and they
won't show up on the exception report next time.

> Or should I basically choose what I think is Best for me at this time by
> searching in this thread the
> pros and cons of using an Identity (AutoNumber) column as a surrogate PK?
>
> Again, Thank you all very much. You sure have provided many many points that
> will take me a
> while before I understand most of them.

Here's what I do. Build your tables and relationships in the logical
phase as if you were using a real unique primary key combining several
fields. But when it comes time to actually define them physically, use
surrogate keys, because stuff changes.

Just what works for me.

-Doug Miller

Clive Bolton

unread,

Feb 11, 2002, 9:46:13 PM2/11/02

to

There's another group?

Clive

"Tony Toews" <tto...@telusplanet.net> wrote in message

Gary Rockley

unread,

Feb 12, 2002, 3:06:54 AM2/12/02

to

Dammit Tony,

You weren't supposed to tell him!

gary

serge

unread,

Feb 12, 2002, 9:09:48 AM2/12/02

to

> From the peanut gallery, I'll throw in a few comments.

Your comments are appreciated.

> Here's what I do. Build your tables and relationships in the logical
> phase as if you were using a real unique primary key combining several
> fields. But when it comes time to actually define them physically, use
> surrogate keys, because stuff changes.

Yes, that's what I have started doing, I think it is a very good approach.

Thanks for your comments.