Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Identity key?

2 views
Skip to first unread message

DODO

unread,
Jan 5, 2001, 4:55:24 PM1/5/01
to
Hello,


I have a question regarding an observation made by Joe Celko in an article.
Quoting the article:
====================================================================
For example, this was posted on a newsgroup:

CREATE TABLE US_States
(state_id IDENTITY NOT NULL,
state_abbreviation CHAR(2) NOT NULL,
state_name VARCHAR(15) NOT NULL);

The idiot who did this assumed that the IDENTITY column was a key. Those
other two columns better be unique or this thing is not going to work. The
correct declaration should have been something like this:

CREATE TABLE US_States
(state_code CHAR(2) NOT NULL PRIMARY KEY,
state_name VARCHAR(15) NOT NULL UNIQUE);
======================================================================

Woulldn't be better to have the table like this?

CREATE TABLE US_States
(state_id IDENTITY,
state_code CHAR(2) NOT NULL PRIMARY KEY,
state_name VARCHAR(15) NOT NULL UNIQUE);

TIA,

DODO


Joe Celko

unread,
Jan 7, 2001, 4:49:12 PM1/7/01
to

>> I have a question regarding an observation made by Joe Celko in an
article... The correct declaration should have been something like this:

CREATE TABLE US_States
(state_code CHAR(2) NOT NULL PRIMARY KEY,
state_name VARCHAR(15) NOT NULL UNIQUE);
======================================================================

Wouldn't be better to have the table like this?

CREATE TABLE US_States
(state_id IDENTITY,
state_code CHAR(2) NOT NULL PRIMARY KEY,
state_name VARCHAR(15) NOT NULL UNIQUE); <<

No. What the hell does state_id mean in terms of a LOGICAL model? It
would just be the random order that the rows were PHYSICALLY inserted
into this table. Both state_code and state_name are UNIQUE, but you
need to pick one to be the PRIMARY KEY. Since state_code is the
shortest one and the one mostly likely to be used, I made it the
PRIMARY KEY.

PRIMARY KEY and NOT NULL UNIQUE are not quite equivalent in SQL. Both
define a column or set of columns as a key, but the PRIMARY KEY is the
default of a REFERENCES clause. And some products use special indexing
methods for the PRIMARY KEY.

--CELKO--
Joe Celko, SQL Guru & DBA at Trilogy
When posting, inclusion of SQL (CREATE TABLE ..., INSERT ..., etc)
which can be cut and pasted into Query Analyzer is appreciated.


Sent via Deja.com
http://www.deja.com/

DODO

unread,
Jan 8, 2001, 10:32:12 AM1/8/01
to
Mr Celko,

Thanks for answering.
I was viewing state_id as a surrogate key. I've seen many schemas that use
them. And these(now I realised) were the real questions:

1. Do they(the surrogates) belong to the logical model?(From your answer I
infer that NO)
2. A standard which has a surrogate key for every table, declared Primary
Key and the "real" key declared as Alternate key is bad?(again from your
answer I infer that YES).
TIA,
DODO

"Joe Celko" <71062...@compuserve.com> wrote in message
news:93ao8n$5tc$1...@nnrp1.deja.com...

Jan Hidders

unread,
Jan 8, 2001, 11:36:55 AM1/8/01
to
Joe Celko wrote:
>
> >> I have a question regarding an observation made by Joe Celko in an
> article... The correct declaration should have been something like this:
>
> CREATE TABLE US_States
> (state_code CHAR(2) NOT NULL PRIMARY KEY,
> state_name VARCHAR(15) NOT NULL UNIQUE);
> ======================================================================
>
> Wouldn't be better to have the table like this?
>
> CREATE TABLE US_States
> (state_id IDENTITY,
> state_code CHAR(2) NOT NULL PRIMARY KEY,
> state_name VARCHAR(15) NOT NULL UNIQUE); <<
>
> No. What the hell does state_id mean in terms of a LOGICAL model?

I tend to disagree. The relational model is not really suited for
logical modelling anyway; that's why entity-relationship modelling was
invented. But if you are using it as such then the term PRIMARY KEY is
nonsense, because that describes an implementation aspect. So if your
position is that surrogate keys have no place in a logical model, then
primary keys should also be avoided.

--
Jan Hidders

Joe Celko

unread,
Jan 8, 2001, 2:14:04 PM1/8/01
to

>> I tend to disagree. The relational model is not really suited for
logical modelling anyway; that's why entity-relationship modelling was
invented. <<

The relational model deals with how the logical model is implemented,
then the SQL product has to worry about the physical storage and
location. But I want to see the logical model in my SQL. I want data
element names that follow the NCITS L8 committee rules. Etc.

>> But if you are using it as such then the term PRIMARY KEY is
nonsense, because that describes an implementation aspect. So if your
position is that surrogate keys have no place in a logical model, then
primary keys should also be avoided. <<

Unh? Dr. Codd would be VERY surprised to find out that his papers had
no mention of primary keys in them <g>. I agree with the basic idea of
the relational model that a table MUST have a key, or it is not a
table. Keys are a logical concept and always have been. I will grant
that the SQL PRIMARY KEY clause is a language requirement with certain
properites in addition to those of the primary key concept invented by
Dr. Codd.

A surrogate key is created by the database system for its own use and
it is never exposed to the user. Artifical keys are created by the
users and they are exposed. These two terms get confused. I don't
care about surrogate keys, any more than I would care about pointer
chains, indexes and all the other various access methods SQL products
use -- I never see them as a programmer. I would care about them if I
were the PHYSICAL database administrator and had to tune them, buy more
disk drives, etc.

What you want is a natural key, if one exists and is easy to record.
Why? Because you can verify the data in the database against the
reality it is supposed to model.

If you cannot get a natural key, then you want to create an artifical
key. It should have some properties that let you verify it against
reality it is supposed to model. I like check digits because I can
verify the syntax of the encoding in my application program. I like
standardized encoding systems, like UPC in the United States or EAN in
Europe. I can verify those codes with an outside agency.

But the most useless approach is a key that relates the row to a
particular physical location in a particular release of one particular
database product -- it has nothing to do with the data itself and
cannot be verified when the data changes location.

Whenever possible, I would like to avoid superkeys -- a key which
contains a proper subset that is itself a key. This leads to
normalization problems.

DODO

unread,
Jan 8, 2001, 2:45:37 PM1/8/01
to

> A surrogate key is created by the database system for its own use and
> it is never exposed to the user. Artifical keys are created by the
> users and they are exposed. These two terms get confused. I don't

Then in the example


CREATE TABLE US_States
(state_id IDENTITY,
state_code CHAR(2) NOT NULL PRIMARY KEY,
state_name VARCHAR(15) NOT NULL UNIQUE);

state_id would be an artificial key? And you think that is useless? What
happens if the state_code is changed by the standard comitee?


> What you want is a natural key, if one exists and is easy to record.

What is a natural key?


TIA,
DODO


David Cressey

unread,
Jan 8, 2001, 2:53:45 PM1/8/01
to
Joe,

You are using the term "surrogate key" differently than some other authors
do,
unless I'm misreading either you or them. I'm thinking of Ralph Kimball in
particular,
but not limited to him.

The way they use "surrogate key", the key would be of no interest to the
programmers
of the source databases, but they would be of interest to programmers
constructing
queries against, say, a data mart.

The reason is that surrogate keys that serve as primary keys to the
dimension tables
are referenced via foreign keys in the fact tables. Whoever programs the
join conditions
needs to be aware of the surrogate keys.

Joe Celko <71062...@compuserve.com> wrote in message

<93d3hh$2kr$1...@nnrp1.deja.com>...

Joe Celko

unread,
Jan 8, 2001, 5:13:21 PM1/8/01
to

>> Then in the example
CREATE TABLE US_States
(state_id IDENTITY,
state_code CHAR(2) NOT NULL PRIMARY KEY,
state_name VARCHAR(15) NOT NULL UNIQUE);

state_id would be an artificial key? <<

The short answer is "yes, and it is redundant"; the real answer is that
the IDENTITY column makes it a totally non-relational nightmare. Real
SQL programmers use real keys and do not try to imitate 1950's magnetic
tape or punch card systems in a relational model. In the early days of
programming languages, we exposed a lot of the physical and hardware
implementation to the programmer. For example, COBOL and FORTRAN
depend on physically contiguous storage of data, which made a ROWID
(i.e. physical location reference) possible.

Later, we designed languages to independent of any physical and
hardware implementation. Thus, an INTEGER datatype did not have to be
eight binary bits in a two complement format. It just had to behave
like an integer in the program and the programmer did not have to worry
about how the hardware did its work -- or even know what the hardware
was.

SQL and other modern programming languages carry this idea further and
try to completely separate logical and physical implementations. The
idea is that SQL is based on sets (which have no ordering) and a
standard that defines its logical behavior. The behavior has nothing
to do with whether a product uses B-Tree indexes, bit vectors or
hashing; two complement or base ten arithmetic; whether the host
program calling the SQL is C or Cobol; etc.

The IDENTITY column in SQL Server is an attempt to return to those
thrilling days of yeasteryear, and the conceptual junk left over from
the days when people did not know much about relational databases.
What we knew was sequential file systems -- punch cards and magnetic
tapes. Even the disk file systems mimicked these systems, adding only
simple indexes.

Sequence was a basic way of looking at data. People thought in terms
of them at a primitive level. A sequence of bits make a byte, a
sequence of bytes make a field, a sequence of fields make a record and
sequence of records make a file. Very low level, very close to the
machinery.

Then along comes the relational model. It is based on sets; a set is a
completed whole, without any ordering to it. No sequences! Very
abstract! Programmers did not know how to cope, so the vendors exposed
the physical implementation and called these things "features" and
locked their products to particular architectures. I can go into
details on that problem, but let me say that when we went to the bar
after ANSI X3H2 meetings, the vendors griped about what they had to do
to these extensions to preserve them in the next platform, how they
could not scale up to data warehouse size databases, etc.

The IDENTITY column is one of these mistakes.

1) It is not part of the SQL-92 Standard and it is highly proprietary
to the Sybase family. It is not portable -- not quite the same thing
as proprietary, since you can often translate one SQL dialect into
another with a simple replacement (i.e. the % operator becomes the MOD
() function). So your code will not move over to a new database.

2) IDENTITY looks like a datatype, but it is not. Create a table with
one column in it and make it an IDENTITY column. Insert a number into
the table and see what happens. Try to set it to NULL. If you cannot
insert, update and delete all the columns, then this is not a table!

3) IDENTITY looks like a constraint, but it is not. Try to create a
table with two IDENTITY columns and it fails. If you cannot add it to
a column, then it is not a constraint. It is possible to write a a set
of constraints that prohibit data from ever being put in the table
(their predicate is always FALSE). It is possible to write a a set of
constraints that allow anything in the table (their predicate is always
TRUE). But no constraint can prohibit the creation of the table
itself -- that is a meta-constraint.

4) It is not relational. Consider this statement on a table, Foo,
which has an identity column. Assume the query returns more than one
row.

INSERT INTO Foo (x)
SELECT a FROM Bar;

You will get a result like this:

IDENTITY X
============
1 'a'
2 'b'
3 'c'

but if the query changed an index or was put on the physical disk data
page differently, you might have gotten:

IDENTITY X
============
1 'b'
2 'c'
3 'a'

Explain why one result is the logically correct choice for an
identifier and all other choices are not, without any reference to the
physical implementation. You cannot.

Instead of treating the query as a set, you are doing 1950's sequential
processing using the underlying sequential file system the Sybase
family started with.

5) If you have designed your tables correctly, they will have a
meaningful primary key derived from the nature of the entity they
model. The IDENTITY column should be a redundant key. The reason
IDENTITY columns are popular as keys is that they are easy to declare.
This is also the same reason that people build non-normalized databases
and put pennies in fuse boxes -- easy is not right.

6) It is a bitch to do calculations on IDENTITY column values. Well,
it was hard to do direct math on the sequential position of a record in
a 1950's punch card system and that it what the IDENTITY is mimicking.

7) There is no check digit in an IDENTITY columns value, so you have no
way of verifying it if you use it as a key.

8) If you use IDENTITY as a key, the values tend to cluster on physical
data pages because they are sequential. The result is that if the most
recent rows are the most likely to be accessed, there will be locking
contention for control of those physical data pages. What you really
wanted in a key is some spread of the rows over physical storage to
avoid having every user trying to get to the same page at the same
time.

9) The actual implementation of the IDENTITY column has been
problematic since Version 7.0. You can look up threads in the news
groups to get assorted tales of woe.

There are other ways of getting a unique identifier for a table. The
most portable method for getting a new identifier number which is not
in the set is something like this:

INSERT INTO Foobar (keycol, a, b, c...)
VALUES(COALESCE((SELECT MAX(keycol) FROM Foobar) +1, 0),
aa, bb, cc, ...);

The scalar subquery expression returns the current high value for the
key column, and then increments it. If there is no maximum value (i.e.
this is the first row to be inserted), then it returns zero.

Using this basic idea, you can replace the increment with a different
constant or a random number generator. You can also add code to create
a check digit.

Another method is to hash the columns that make up a compound key so
that you have single short column that can be reconstructed if you need
to verify it.

>> And you think that is useless? What happens if the state_code is

changed by the standard comittee? <<

Then you change the table to reflect the change in the external reality
being modeled. Otherwise, your mail will not be delivered. Keys can
change and the set of keys can grow, like any other attribute; the most
important property of a key is that they uniquely identify an entity in
the database. If the committee goes crazy and decides that we will
have two states with the same two-letter postal code or two states
named 'Texas', then we would be in trouble.

Some nice properties of a key are that they do not change over time,
they can be verified against the real world, they are short and easy to
understand, etc. But that is not what makes them a key.

>> What is a natural key? <<

An attribute in the entity that makes it unique. We don't invent it;
Mother Nature put it there. DNA and fingerprints are pretty good for
people. Longitude and latitude are pretty good for locations. Etc.

You have to do is rate the strength and verifiability of a key. If it
changes, it should be becuase the reality changed, not because you
moved the data inside the physical database from one table to another
in a different order.

Jan Hidders

unread,
Jan 8, 2001, 5:37:30 PM1/8/01
to
Joe Celko wrote:
>
> >> But if you are using it as such then the term PRIMARY KEY is
> nonsense, because that describes an implementation aspect. So if your
> position is that surrogate keys have no place in a logical model, then
> primary keys should also be avoided. <<
>
> Unh? Dr. Codd would be VERY surprised to find out that his papers had
> no mention of primary keys in them <g>. I agree with the basic idea of
> the relational model that a table MUST have a key, or it is not a
> table.

That is not the issue. I didn't say that *candidate* keys have no place
in the logical model. But appointing one of the candidate keys as *the*
key is in my opinion an implementation decision. Of course, it makes it
easier to decide which attributes to include as a foreign key if you
want to make a reference. But the fact that you have to make this
decision shows precisely why the relational model is sometimes not a
very good way to describe the logical data model. Even Codd more or
less acknowledges this as you can see in his article on the extended
relational model RM/T in 1979.

> What you want is a natural key, if one exists and is easy to record.
> Why? Because you can verify the data in the database against the
> reality it is supposed to model.

Oh, I completely agree with that. In fact, I often like to quote
Willard van Orman Quine (my favorite philosopher who sadly has died
recently) on that: "no entity without identity". That is a deep fact
and tells us that if you don't have any decent candidate key, you
probably do not really understand what the entity is that you
are talking about.

--
Jan Hidders

DODO

unread,
Jan 9, 2001, 10:44:33 AM1/9/01
to
Joe,

Thank you for the long answer.
I totally agree with the fact that IDENTITY keys are bad(at least the
implementation up to System 11 was miserable, once we had to reconstruct the
database because of the gaps that Sybase generated in the Identity columns-
the server went down, the server went up and astronomic numbers "happened"
in the columns). I can add that if you use them in a distributed database,
you loose their primary meaning of life, unicity.
I understand that you don't really disagree with using surrogate keys.
Personally, I'll stick with the idea of a surrogate key as primary key and
the natural(whichever seems more natural from candidate keys) as alternate.
This schema has 2 advantages: it makes wonders for joins and OO mappings and
provides an Enterprise wide identifier for every tuple in your databases.
And they are final.

DODO

PS: Sometime the natural key might not be in your best interest: I remember
two guys who put a time stamp as primary key and started inserting data
using the server clock as key generator. They were baffeled by the server
who complained that some data integrity violation occured. But they were
brave and solved the problem: they put wait(100ms) in the code.

"Joe Celko" <71062...@compuserve.com> wrote in message

news:93de1q$cjk$1...@nnrp1.deja.com...

Larry Coon

unread,
Jan 9, 2001, 1:45:05 PM1/9/01
to
Joe Celko wrote:

[snipped for brevity except:]

> Real
> SQL programmers use real keys and do not try to imitate 1950's magnetic
> tape or punch card systems in a relational model. In the early days of
> programming languages, we exposed a lot of the physical and hardware
> implementation to the programmer. For example, COBOL and FORTRAN
> depend on physically contiguous storage of data, which made a ROWID
> (i.e. physical location reference) possible.

[. . .]


> SQL and other modern programming languages carry this idea further and
> try to completely separate logical and physical implementations. The
> idea is that SQL is based on sets (which have no ordering)

[. . .]


> Sequence was a basic way of looking at data.

[. . .]


> Programmers did not know how to cope, so the vendors exposed
> the physical implementation and called these things "features" and
> locked their products to particular architectures.

[Disclaimer: I agree with much of what you say about identity columns.
However, I disagree with some of what you wrote, and my arguing of
those points should not be taken to mean I disagree with your entire
point.]

You are implying that identity columns are evil because there
is a connection between identity values and physical storage
and that ordinality is an essential property. I use identity
columns (where I feel they are justified) and never rely on
either of these things. The only thing I care about is that
they are UNIQUE. Yes, identities are numeric, and numbers are
ordinal, and that gives you sequence. But that only makes
identities bad if you rely on their ordinality.


> 2) IDENTITY looks like a datatype, but it is not. Create a table with
> one column in it and make it an IDENTITY column. Insert a number into
> the table and see what happens. Try to set it to NULL. If you cannot
> insert, update and delete all the columns, then this is not a table!

Can't set ANY primary key to null, so that's beside the point.
(And in Sybase you CAN insert & modify identity values by turning
on the "identity insert" option.)

> 3) IDENTITY looks like a constraint, but it is not. Try to create a
> table with two IDENTITY columns and it fails.

PRIMARY KEY looks like a constraint, but it is not. Try to create a
table with two PRIMARY KEY columns and it fails.


> 4) It is not relational. Consider this statement on a table, Foo,
> which has an identity column. Assume the query returns more than one
> row.
>
> INSERT INTO Foo (x)
> SELECT a FROM Bar;
>
> You will get a result like this:
>
> IDENTITY X
> ============
> 1 'a'
> 2 'b'
> 3 'c'
>
> but if the query changed an index or was put on the physical disk data
> page differently, you might have gotten:
>
> IDENTITY X
> ============
> 1 'b'
> 2 'c'
> 3 'a'
>
> Explain why one result is the logically correct choice for an
> identifier and all other choices are not, without any reference to the
> physical implementation. You cannot.

But it's the same problem even with your "scalar subquery"
alternative, isn't it? Either method depends on the order
in which the rows from Bar happen to be returned. But since
Bar is an unordered set, they could have come out in any
order, so necessarily the values assigned (by either an
identity or your alternative) are essentially random.

If you care about the difference between your two result sets
above, aren't you just relying on identity values for more
than they're supposed to be providing? Again, it doesn't
matter WHAT the values are, only that they're UNIQUE.


> 5) If you have designed your tables correctly, they will have a
> meaningful primary key derived from the nature of the entity they
> model. The IDENTITY column should be a redundant key.

I'll spare you the gory details, but there are entities in
our problem domain for which this is not true. At least,
it takes the entire tuple with a lot of columns to achieve
identity. And in my case, it's not a matter of understanding
the problem domain well enough, as another poster suggested.


> 6) It is a bitch to do calculations on IDENTITY column values. Well,
> it was hard to do direct math on the sequential position of a record in
> a 1950's punch card system and that it what the IDENTITY is mimicking.

Why would you want to do calculations on identity column values?
That's like saying statistics are evil simply because they can be
misused.


> 7) There is no check digit in an IDENTITY columns value, so you have no
> way of verifying it if you use it as a key.

What would you want to verify? If its only purpose is to provide
a unique value, that can be verified simply enough.


> 8) If you use IDENTITY as a key, the values tend to cluster on physical
> data pages because they are sequential. The result is that if the most
> recent rows are the most likely to be accessed, there will be locking
> contention for control of those physical data pages. What you really
> wanted in a key is some spread of the rows over physical storage to
> avoid having every user trying to get to the same page at the same
> time.

Agreed. This is a big problem with identity values.

> 9) The actual implementation of the IDENTITY column has been
> problematic since Version 7.0. You can look up threads in the news
> groups to get assorted tales of woe.

Doesn't make the concept wrong.


[Re: mutability of data in primary key columns]


> Then you change the table to reflect the change in the external reality
> being modeled. Otherwise, your mail will not be delivered. Keys can
> change and the set of keys can grow, like any other attribute; the most
> important property of a key is that they uniquely identify an entity in
> the database. If the committee goes crazy and decides that we will
> have two states with the same two-letter postal code or two states
> named 'Texas', then we would be in trouble.

This was one of the factors that went into our decision
of when to use identity columns. By separating business
entities from database mechanisms and relying on identities
as primary keys, it's much easier (I know, that doesn't
make it right) to change these values.

All that being said, we only went with identities where we
felt it was justified. Either a candidate key didn't exist
without taking the entire tuple; it was really inconvenient
(composite of three varchar(255)'s, for example); or the
values changed a lot.

And just so I don't disagree with everything you said, some
additional identity disadvantages I didn't see you mention:

-- Assuming you throw an index on the "natural" key anyway
(because that's how you normally look up information),
an identity column means there is one more index the
system has to incur the overhead of maintaining.

-- With natural primary keys there are also natural foreign
keys. If information you need from the parent table is
is part of the parent's primary key, it'll also be in
the child table's foreign key. This means a join of the
child table and parent table can be avoided. With
surrogate keys, you always have to do a join with the
parent table to get the information you need.


Larry Coon
University of California
la...@assist.org
and lmc...@home.com

Lucky Leavell [RIS]

unread,
Jan 9, 2001, 3:08:36 PM1/9/01
to Joe Celko
Hello Joe, its been a while ....

I recently began a new position in an Informix shop. Informix has a
"serial" data type which "stores a sequential integer assigned
automatically by the database..." which would seem to be sort of like the
"IDENTITY" data type under discussion here.? Almost every table here has a
serial column used to uniquely identify rows instead of a "real" primary
key which, in the states example, would be the state_code, right?

(They also make extensive use of the ROWID which is an entirely different
topic ...)

On Mon, 8 Jan 2001, Joe Celko wrote:

>
> >> Then in the example
> CREATE TABLE US_States
> (state_id IDENTITY,
> state_code CHAR(2) NOT NULL PRIMARY KEY,
> state_name VARCHAR(15) NOT NULL UNIQUE);
>
> state_id would be an artificial key? <<
>

Thank you,
Lucky

Lucky Leavell Phone: (800) 481-2393 or (812) 366-4066
UniXpress - Your Source for SCO FAX: (888) 231-9640 or (812) 366-3618
1560 Zoar Church Road NE Email: lu...@UniXpress.com
Corydon, IN 47112-7374 WWW Home Page: http://www.UniXpress.com

Jan Hidders

unread,
Jan 10, 2001, 5:37:58 AM1/10/01
to
Larry Coon wrote:
>
> I'll spare you the gory details, but there are entities in
> our problem domain for which this is not true. At least,
> it takes the entire tuple with a lot of columns to achieve
> identity. And in my case, it's not a matter of understanding
> the problem domain well enough, as another poster suggested.

Would that be me? All I said was that you have to have at least one
natural (candidate) key. It is not neccessarily a problem if that key
consists of the entire tuple.

--
Jan Hidders

Vaughan Powell

unread,
Jan 10, 2001, 6:01:29 AM1/10/01
to
In article <3A5B5C...@assist.org>,
la...@assist.org wrote:

Ah - at last the voice of pragmatism combined with good supporting
arguments.

The is one of Joe's quoted disadvantages of identity columns that can
be overcome:

>> 8) If you use IDENTITY as a key, the values tend to cluster on
physical
>> data pages because they are sequential. The result is that if the
most
>> recent rows are the most likely to be accessed, there will be locking
>> contention for control of those physical data pages. What you really
>> wanted in a key is some spread of the rows over physical storage to
>> avoid having every user trying to get to the same page at the same
>> time.

>Agreed. This is a big problem with identity values.

In SQL Server, specifically, this problem occurs only if the identity
column is created as a clustered index. It can be worthwhile creating
the natural key (or some other search column) as the clustered index
and the identity key as a non-clustered index - this prevents the above
problem on INSERT but can slow down retrieval where joins are made via
the primary key. Where large number of INSERTS are required it is
worth doing this, but where the volume of INSERTS is low then it is
not. It just one of those trade-offs we have to make when designing.

--
Vaughan Powell MCDBA, MCSD, MCSE
Data Architecture and Applications Design Manager
BuildOnline

Larry Coon

unread,
Jan 10, 2001, 1:13:01 PM1/10/01
to
Vaughan Powell wrote:

> In SQL Server, specifically, this problem occurs only if the identity
> column is created as a clustered index.

Don't know if it's different in SQL Server vs. Sybase, but
in Sybase, without a clustered index, the table acts as a
heap and all inserts go to the last data page by default,
causing a hot spot.

> It can be worthwhile creating
> the natural key (or some other search column) as the clustered index
> and the identity key as a non-clustered index - this prevents the above
> problem on INSERT but can slow down retrieval where joins are made via
> the primary key.

With that strategy, while the data rows are dispersed in the
table, a single index page probably contains the entries for
the surrogate keys currently being inserted, so the hot spot
simply moves to the last leaf-level page of the non-clustered
index.

Larry Coon

unread,
Jan 10, 2001, 1:16:47 PM1/10/01
to
Jan Hidders wrote:

> Would that be me? All I said was that you have to have at least one
> natural (candidate) key. It is not neccessarily a problem if that key
> consists of the entire tuple.

Not necessarily a usable solution, either. This
gets out of theory and into the specifics of the
Sybase implementation, but in Sybase you can't
index a column of the text datatype (varchar longer
than 255). Since a primary key has to have an
index, a text column therefore can't participate
in a primary key. We have cases where the entire
tuple is needed to provide identity, and the tuple
includes attributes which become text columns in
Sybase. In these cases, we had no choice but to
use surrogate keys.


Larry Coon
University of California
la...@assist.org
and lmc...@home.com

The NBA Salary Cap FAQ:
http://www.members.home.net/lmcoon/salarycap.htm

Joe Celko

unread,
Jan 10, 2001, 6:58:39 PM1/10/01
to

>> PS: Sometime the natural key might not be in your best interest: I
remember two guys who put a time stamp as primary key and started
inserting data using the server clock as key generator. They were
baffeled by the server who complained that some data integrity
violation occured. But they were brave and solved the problem: they put
wait(100ms) in the code. <<

My first guess was that they set their clock to Daylight Saving Time
and shot themselves in the foot...

0 new messages