Surrogate or Natural Keys?

Retf

unread,

Apr 28, 2006, 1:27:10 PM4/28/06

to

Hi All,

I need know: what is best choice:
Surrogate or Natural Keys?

I need understand, what is best?

Thanks

Aaron Bertrand [SQL Server MVP]

unread,

Apr 28, 2006, 1:32:52 PM4/28/06

to

What on earth does "best" mean? This is like asking what is the best car or
what is the best toothpaste. What are your criteria?

http://www.aspfaq.com/2504

"Retf" <re...@terra.com.br> wrote in message
news:eh2UChua...@TK2MSFTNGP02.phx.gbl...

David Portas

unread,

Apr 28, 2006, 3:06:01 PM4/28/06

to

"Retf" <re...@terra.com.br> wrote in message
news:eh2UChua...@TK2MSFTNGP02.phx.gbl...

That's a meaningless question because they achieve different things. I
suggest you study a good book rather than ask questions like this in a
newsgroup.

--
David Portas, SQL Server MVP

Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.

SQL Server Books Online:
http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx
--

--CELKO--

unread,

Apr 30, 2006, 10:32:05 AM4/30/06

to

You might want to learn the definitons first. I have a taxonomy of key
types. We have a lot of problems with terminology on this one, so let
me get that out of the way.

There is no such thing as a "universal, one-size-fits-all" key. Just
as no two sets of entities are the same, the attributes that make them
unique have to be found in the reality of the data. Here is my
classification of types of keys:

1) A natural key is a subset of attributes which occur in a table and
act as a unique identifier. They are seen by the user. You can go to
the external reality and verify them. you would also like to have some
validation rule. Example: UPC codes on consumer goods (read the
package barcode) and validate them with a check digit or a
manufacturer's website, geographical co-ordinates (get a GPS).

2) An artificial key is an extra attribute added to the table which is
seen by the user. It does not exist in the external reality, but can
be verified for syntax or check digits inside itself. It is up to the
DBA to maintain a trusted source for them. Example: the open codes in
the UPC scheme which a user can assign to his own stuff. The check
digits still work, but you have to verify them inside your own
enterprise.

If you have to construct a key yourself, it takes time to deisgn them,
to invent a validation rule, set up audit trails, etc.

3) An "exposed physical locator" is not based on attributes in the data
model and is exposed to user. There is no reasonable way to predict it
or verify it, since it usually comes from the physical state of the
hardware at the time of data insertion. The system obtains a value
thru some physical process in the storage hardware totally unrelated to
the logical data model. Example: IDENTITY columns, other proprietary,
non-relaitonal auto-numbering devices.

Technically, these are not really keys at all, since they are
attributes of the PHYSICAL storage and are not even part of the LOGICAL
data model. But they are handy for lazy, non-RDBMS programmers who
don't want to research or think! This is the worst way to program in
SQL.

4) A surrogate key is system generated to replace the actual key behind
the covers where the user never sees it. It is based on attributes in
the table. Example: Teradata hashing algorithms, indexes, pointer
chains, ADABASE numbers, etc.

The fact that you can never see it or use it for DELETE and UPDATE or
create it for INSERT is vital. When users can get to them, they will
screw up the data integrity by getting the real keys and these physical
locators out of synch. The system must maintain them.

** Notice that people get "exposed physical locator" and surrogate
mixed up; they are totally different concepts. **

An appeal to authority, with a quote from Dr. Codd: "..Database users
may cause the system to generate or delete a surrogate, but they have
no control over its value, nor is its value ever displayed to them
..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
the database relational model to capture more meaning. ACM
Transactions on Database Systems, 4(4). pp. 397-434.

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means never
used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result
that some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities. A solution - proposed in part [4] and more fully in [14] -
is to introduce entity domains which contain system-assigned
surrogates. Database users may cause the system to generate or delete
a surrogate, but they have no control over its value, nor is its value
ever displayed to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture
more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

The steps for finding a key are

1) Look for an industry standard and the trusted external source that
maintains and verifies it. I count this as a natural key, but you could
argue that it is artificial.

2) Look for a natural key in the attributes. Example: (longitude,
latitude) makes a good key for a geographical location. A GPS can be
used to verify it.

3) If you must design a new identifier, plan it carefully -- especially
if people will see and use it. You have to be able to verify it in
application programs, so you should have a regular expression, other
syntax rules, check digits. You have to be able to be verify in the
reality of the model or with a trusted source that you maintain.

Validation means the format is good -- "This could one of our invoice
numbers because it is 7 digits long, passes a Bull code check digit and
begins with { '01', '02', '07', '99'}"

Verification means that it references a real entity -- "This is a real
invoice because I can look it up in Accounts Payable and trace its
approval back to Cindy Lu Who on 2005-02-12."

This is hard work. I have a few chapters in SQL PROGRAMMING STYLE on
scales, measurements and how to design encoding schemes.

Tony Rogerson

unread,

Apr 30, 2006, 10:47:42 AM4/30/06

to

> ** Notice that people get "exposed physical locator" and surrogate
> mixed up; they are totally different concepts. **

Yes, you certainly do get confused over this one.

> An appeal to authority, with a quote from Dr. Codd: "..Database users
> may cause the system to generate or delete a surrogate, but they have
> no control over its value, nor is its value ever displayed to them
> ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
> the database relational model to capture more meaning. ACM
> Transactions on Database Systems, 4(4). pp. 397-434.

You've just described how people use IDENTITY as a surrogate correctly. The
column with the IDENTITY property can not be changed, users have no control
over its value.

If used as a surrogate it will never be displayed to the user, simply used
internally by BOTH the database and application for performance and
scalability.

>
> This means that a surrogate ought to act like an index; created by the
> user, managed by the system and NEVER seen by a user. That means never
> used in queries, DRI or anything else that a user does.

That is YOUR opinion and quite frankly its one hell of a jump from what Codd
wrote.

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials

"--CELKO--" <jcel...@earthlink.net> wrote in message
news:1146407525.4...@v46g2000cwv.googlegroups.com...

David Portas

unread,

Apr 30, 2006, 11:20:48 AM4/30/06

to

--CELKO-- wrote:
> An appeal to authority, with a quote from Dr. Codd: "..Database users
> may cause the system to generate or delete a surrogate, but they have
> no control over its value, nor is its value ever displayed to them
> ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
> the database relational model to capture more meaning. ACM
> Transactions on Database Systems, 4(4). pp. 397-434.
>
> This means that a surrogate ought to act like an index; created by the
> user, managed by the system and NEVER seen by a user. That means never
> used in queries, DRI or anything else that a user does.
>

However, in that same paper, Codd goes on to say that joins and other
relational operations ARE permitted on the surrogate keys and that "it
is now the surrogate that is the primary key and provides truly
permanent identification of each entity" so Codd's surrogates are most
definitely part of the logical model and are definitely NOT what you
say is "like an index".

It's difficult to see how to reconcile Codd's surrogate key idea with
the rest of his model. The consequences of admitting "special data"
that isn't represented as real values in relations seem too serious.
Perhaps that's why I don't know of anyone other than you who has
anything to say about Codd-style surrogates. The definition that I
think most people are more familiar with is the one given by Date (in
An Introduction to Database Systems) who says "Surrogate keys are keys
in the usual relational sense" and "Tuple IDs are usually concealed
from the user, while surrogates must not be (because of The Information
Principle)".

--CELKO--

unread,

Apr 30, 2006, 12:00:29 PM4/30/06

to

>> Codd goes on to say that joins and other relational operations ARE permitted on the surrogate keys and that "it is now the surrogate that is the primary key and provides truly permanent identification of each entity" <<

Surrogate means "something which acts in place of another", so of
course the relational operations can be performed on them. Would you
hire a sterile surrogate mother :)? However, the system gets to pick
the method. Consider a covering index. I do not have to read the base
table to do an EXISTS() or other operations. IThe index is my
surrogate. A stronger version in Sybase is the PK-FK "key joins" and
the Informix "pre-join" indexes. These produce pointer chains,. so I
can read one base table and get the rest of the data from the fact that
a relationship exists without ever seeing it.

>> It's difficult to see how to reconcile Codd's surrogate key idea with the rest of his model. The consequences of admitting "special data" that isn't represented as real values in relations seem too serious. <<

That is why they have to be like indexes, hashing, pointer chains or
other access methods. Hidden from the user and handled **completely**
by the system.

>> most people are more familiar with is the one given by Date <<

You have to watch out for Date versus Codd (both RM1 and RM2). They
disagree on NULLs and quite a few other things. Date's surrogate key
was what other people would call an artificial key at one point, then
he made some changes and I do not know what his position is now. I am
still trying to clean up "scalar" versus "atomic" in my own notes!
.

David Portas

unread,

Apr 30, 2006, 2:04:21 PM4/30/06

to

--CELKO-- wrote:
> >> Codd goes on to say that joins and other relational operations ARE permitted on the surrogate keys and that "it is now the surrogate that is the primary key and provides truly permanent identification of each entity" <<
>
> Surrogate means "something which acts in place of another", so of
> course the relational operations can be performed on them. Would you
> hire a sterile surrogate mother :)? However, the system gets to pick
> the method. Consider a covering index. I do not have to read the base
> table to do an EXISTS() or other operations. IThe index is my
> surrogate. A stronger version in Sybase is the PK-FK "key joins" and
> the Informix "pre-join" indexes. These produce pointer chains,. so I
> can read one base table and get the rest of the data from the fact that
> a relationship exists without ever seeing it.
>
> >> It's difficult to see how to reconcile Codd's surrogate key idea with the rest of his model. The consequences of admitting "special data" that isn't represented as real values in relations seem too serious. <<
>
> That is why they have to be like indexes, hashing, pointer chains or
> other access methods. Hidden from the user and handled **completely**
> by the system.
>

That is not Codd's position as I understand it. Where does he say that
another key is required? He says "Users [...] are no longer compelled
to invent a user-controlled key if they do not wish to". This idea is a
non-starter in my opinion. For example the projection consisting of the
surrogate key alone has no exposed key at all. How can such a relation
be completely hidden from the user? It cannot be, nor can it be a
proper relation as far as I can see. I have no problem with the idea of
indexes of course but in my view Codd's surrogates can only weaken RM.
They are hidden pointer values that do not belong in the logical model
where he put them.

> >> most people are more familiar with is the one given by Date <<
>
> You have to watch out for Date versus Codd (both RM1 and RM2). They
> disagree on NULLs and quite a few other things. Date's surrogate key
> was what other people would call an artificial key at one point, then
> he made some changes and I do not know what his position is now.

They do not agree because the field has moved on a great deal in 30
years. One of the things that got left behind was Codd's idea of
surrogates. Can you name any reference since Codd that refers to
surrogate keys that meet his definition? I can't.

--CELKO--

unread,

Apr 30, 2006, 5:05:30 PM4/30/06

to

>> That is not Codd's position as I understand it. Where does he say that another key is required? He says "Users [...] are no longer compelled to invent a user-controlled key if they do not wish to". <<

Not an exposed key, but a mechanism --

CREATE SURROGATE Foobar AS
CALL SomethingMagic (c1, c2, c3);

Then there is a thing in the SQL engine that will do something like

IF join(c1,c2.3) on (t1) and (t2)
THEN use-surrogates
ELSE do-regular-joins
END IF;

RDBMS was under attack from traditional files and the Bachman model for
performance. His purpose was to gvie foundations for keeping the
abstract model and the physical implementation separated, but to show
they need not conflict.

As an contrast, you really have a hard time with a physical
implementation of an infinite, unbounded set :)

Tony Rogerson

unread,

May 1, 2006, 4:25:22 AM5/1/06

to

> Not an exposed key, but a mechanism --
>
> CREATE SURROGATE Foobar AS
> CALL SomethingMagic (c1, c2, c3);
>
> Then there is a thing in the SQL engine that will do something like
>
> IF join(c1,c2.3) on (t1) and (t2)
> THEN use-surrogates
> ELSE do-regular-joins
> END IF;
>
> RDBMS was under attack from traditional files and the Bachman model for
> performance. His purpose was to gvie foundations for keeping the
> abstract model and the physical implementation separated, but to show
> they need not conflict.
>
> As an contrast, you really have a hard time with a physical
> implementation of an infinite, unbounded set :)
>

Do the three mainstream rdbms implement this (ms sql server, oracle, db2) ?

SQL Server doesn't, so what do you propose? Not use surrogate keys or role
your own?

People in the MS SQL Server world have been rolling their own for years
using the IDENTITY property in addition to any natural key, but correctly
using the surrogate as the foreign key rather than the natural.

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials

"--CELKO--" <jcel...@earthlink.net> wrote in message

news:1146431130....@i39g2000cwa.googlegroups.com...

David Portas

unread,

May 1, 2006, 5:58:46 AM5/1/06

to

Tony Rogerson wrote:
>
> Do the three mainstream rdbms implement this (ms sql server, oracle, db2) ?
>
> SQL Server doesn't, so what do you propose? Not use surrogate keys or role
> your own?
>
> People in the MS SQL Server world have been rolling their own for years
> using the IDENTITY property in addition to any natural key, but correctly
> using the surrogate as the foreign key rather than the natural.
>

You are right of course. It is tragic that SQL Server can't do better.
It is absurd that we are forced to change the logical model in order to
implement a physical optimization but we have to because SQL Server is
extremely weak on Physical Data Independence. It isn't the only
database to suffer in that way but from my acquaintance with other
databases it seems like Microsoft has it worse than others in many
respects.

Joe is dodging my questions however. Codd and Date both agree that
surrogates are logical keys, not physical index structures.

Tony Rogerson

unread,

May 1, 2006, 10:43:10 AM5/1/06

to

Absolutely, I seem to remember a couple of discussions on this functionality
there are a number of pitfalls which to be honest I can't remember, it would
certainly make the query syntax look strange, what would the joins be? Would
you use the natural keys and the engine override underneath - bit of a can
of worms. The problem also runs into the application as well, how does the
application deal with a natural composite key - it can yes, but efficiently,
mmmmm not so sure......

> Joe is dodging my questions however. Codd and Date both agree that
> surrogates are logical keys, not physical index structures.

Its because he's wrong and he's too dam ignorant and arogant to conceede the
fact.

No doubt we will soon see some weird quote back to some distant scollar -
common when celko finds himself back into a corner unwilling to conceede
defeat.

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials

"David Portas" <REMOVE_BEFORE_R...@acm.org> wrote in message
news:1146477526....@y43g2000cwc.googlegroups.com...