Design: multiple columns for primary key

Lisa Pearlson

unread,

Dec 2, 2005, 6:37:22 AM12/2/05

to

Hi,

A very basic question:
Imagine I have a table with multiple fields, some strings, some integers
etc.
None of these fields are unique on their own, but all fields together is
unique per record and thus can be used as the primary key.

However, referring to a record based on all these columns is quite
cumbersome (and not efficient, or does indexing take care of this nearly
entirely compared to a single identity field?), so how is this situation
best solved in practice.

Do I use an ugly IDENTITY column just for 'convenience' or continue to use
all fields, resulting in huge queries that lose legibility, or is there
another way?

Lisa

Stijn Verrept

unread,

Dec 2, 2005, 6:39:55 AM12/2/05

to

Lisa Pearlson wrote:

> Do I use an ugly IDENTITY column just for 'convenience' or continue
> to use all fields, resulting in huge queries that lose legibility, or
> is there another way?

I would use "an ugle IDENTITY column" without doubt! Maybe others have
different opinions. It'll make your life a less easier. Otherwise if
you want to refer to that row in a foreign key you need to include all
the columns of the key which isn't really doable.

--

Kind regards,

Stijn Verrept.

Erland Sommarskog

unread,

Dec 2, 2005, 7:01:52 AM12/2/05

to

It doesn't have to be IDENTITY, you can roll your own as well. But judging
from the shallow description you give, it appears that this could be a
solution. But you should add UNIQUE constraint on the other columns as well
to ensure their uniqueness.

I like to point out that adding a surrogate key does not always make things
simpler. I had a table with a four-column key, and then I needed to add a
sub-table with two more keys. Since a six-column key sounded too much, I
added a surrogate key to the main table. Years later I had reason to
write code to maintain these tables. Turned out that the surrogate key
made this a whole lot more complex. So the next time I had revise those
tables, I removed the surrogate key. (I was also able to remove one the
columns in the four-column key, and one of the keys in the sub-table.)

--
Erland Sommarskog, SQL Server MVP, esq...@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

Brian Selzer

unread,

Dec 2, 2005, 7:17:15 AM12/2/05

to

Using the terms "field" and "record" will bring down the wrath of Celko!

I use IDENTITY for a number of reasons, and none of them are for
convenience:

(1) An IDENTITY value is immutable.
(2) Cascading updates are not required to maintain integrity.
(3) IDENTITY reduces redundancy.
(4) Joins generally perform better on IDENTITY values.
(5) IDENTITY reduces lock contention.

There are other reasons, and I could expound further on the merits of each
of the above.

It is not unusual in a logical database design to have tables with compound
primary keys; however, when it comes time to implement the design,
surrogates--including IDENTITY--make a lot more sense.

One more thing: if you do use an IDENTITY PRIMARY KEY, be sure to create a
UNIQUE constraint or index on the combination of columns that are together
unique per row. Alternate keys should be enforced by the database with a
UNIQUE constraint.

"Lisa Pearlson" <n...@spam.plz> wrote in message
news:%23hOXsSz...@TK2MSFTNGP11.phx.gbl...

Lisa Pearlson

unread,

Dec 2, 2005, 9:50:21 AM12/2/05

to

I understand 3, 4 and take 5 for granted (some internal DBMS matter I
presume), but could you elaborate a bit on 1 and 2?
1) immutable means you can't do UPDATE MyTable SET identcol=123 WHERE
identcol=456 ?
2) What does cascading have to do with it? (as I understand cascading, it's
like triggers where change in one record triggers changes in other
tables/records?)

"Brian Selzer" <br...@selzer-software.com> wrote in message
news:OIHDWpz9...@TK2MSFTNGP10.phx.gbl...

Erland Sommarskog

unread,

Dec 3, 2005, 8:21:25 AM12/3/05

to

Lisa Pearlson (n...@spam.plz) writes:
> I understand 3, 4 and take 5 for granted (some internal DBMS matter I
> presume), but could you elaborate a bit on 1 and 2?
> 1) immutable means you can't do UPDATE MyTable SET identcol=123 WHERE
> identcol=456 ?
> 2) What does cascading have to do with it? (as I understand cascading,
> it's like triggers where change in one record triggers changes in other
> tables/records?)

Actually, I think of the reasons that Brian listed, only 5 is really
applicable to IDENTITY columns, although 1 has a touch to it: if you
have an IDENTITY property on the key, you know that the value cannot
be updated, not even by mistake.

Points 2-4 applies to surrogate keys in general, no matter if they have
IDENTITY or not.

I guess what Brian means with cascading is that if you use entirely
natural keys these can change. Say that you get the idea to use the
stock symbol as the key for financial instruments. Then the company changes
the name, and gets a new symbol. With the symbol as key, you have to update
all tables where the symbol appears. With a surrogate key, there is no
need to.

Brian Selzer

unread,

Dec 3, 2005, 11:40:40 AM12/3/05

to

Immutable primary keys yield many benefits.

A mutable primary key will make your code less stable. It's possible in a
concurrent environment to read a row with a particular primary key value and
while you're working on it for another transaction or transactions to make
changes so that when you initiate your update, it appears that the row no
longer exists, or even worse, that the row exists even though it's really
another row. Here's a simple example: Assume that you're tracking a part
as it moves from machine to machine on an assembly line. Assume also that
only one part can be manipulated by one machine at one time. The mutable
primary key is this example is {PartNo, Location}. Now, you read the row
with key {'45G', 'PRESS01'} prior to some manipulation. While you're
working on the row, the part is moved from 'PRESS01' to 'FORMER01', so it
now appears to you that the row no longer exists. If another part '45G' is
moved onto 'PRESS01' prior to your update, then it will appear to you that
the row still exists even though it's a different part, and you may
erroneously update the row for the wrong part. To counter this, you must
either lock and hold the row when you read it (not a very attractive
prospect because it will severely reduce concurrency and will preclude the
use of disconnected datasets, message queues, etc.), or write a ton of code
on the client end to detect the change--which may not always be possible or
practical. Note that this problem increases in complexity when there are
related tables, because it's possible for a row to look the same on the
primary key table, but to actually refer to a different row with a different
set of related rows. With an IDENTITY primary key, as each part is placed
in production a new row is added and a new IDENTITY value is generated.
Because that value cannot change, when you go to perform your update, you
can determine not only that the part has moved, but exactly where it is now.
In addition, the problem with related tables cannot occur, because the
related rows refer to a value that cannot change.

Another problem lies with UPDATE triggers that are used for auditing or to
implement transition constraints. SQL Server update triggers have two
pseudotables, deleted and inserted, which contain the old and new values for
each row that was updated. When an update affects more than one row, there
is no supported mechanism to determine which row in the inserted pseudotable
corresponds to each row in the deleted pseudotable. (Oracle has a FOR EACH
ROW trigger, which I've been begging Microsoft to implement.) IDENTITY
solves this problem because since the key cannot change, you can join the
deleted and inseted pseudotables on the IDENTITY column and determine
exactly what happened to each row.

Cascading updates cause more problems than they're worth.

The most common form of avoidable deadlock is caused by multiple
transactions obtaining and holding locks on rows in more than one table in a
different order. To combat this, you must make sure that you obtain locks
on tables in the same order in every procedure, function, trigger, and
batch. Cascading updates throw a wrench into this. There is no way to
determine with any degree of certainty the order in which related rows will
be locked when there is a cascading heirarchy present. This makes it much
more difficult--if not impossible--to determine a locking order that will
eliminate avoidable deadlocks.

Another problem with cascading updates is that whenever a change is made,
the rowversion (timestamp) on each affected row is updated. Do you really
want to indicate that a Sales Order has been changed when only the
salesperson's employee number has been changed? It makes sense to indicate
a change when the Sales Order is assigned to another salesperson, but in
this case the change is cosmetic, not material, and in my opinion should not
occur.

This brings up another problem: triggers on every affected table throughout
the cascading heirarchy also fire. It's much more difficult to determine
whether they fired for simply a cosmetic change, or if there is a material
change that should be validated against business rules.

All of these problems occur because a database that uses natural primary
keys is riddled with redundancy. For this reason among others, I advocate
the use of surrogate keys and in particular, IDENTITY at the physical level.

"Lisa Pearlson" <n...@spam.plz> wrote in message

news:%23nPAi%2309FH...@TK2MSFTNGP10.phx.gbl...

--CELKO--

unread,

Dec 3, 2005, 9:06:36 PM12/3/05

to

>> Let's get back to the basics of an RDBMS. Rows are not records; fields are not columns; tables are not files.

If (a,b,c) is a key in the data model, then you have to make it a key
in the schema. Well, you want things screwed.

Ignoring that BY DEFINITION, the proprietary IDENTITY is not a data
type, not a relational and not verifiable, it is redundant in the face
of a real key.

How do you guarantee that you have the IDENTITY and the real key in
synch? Most "ID-iots" do not bother with a real key (they mimic a
sequential file and pointer chains instead) and get redundant rows when
someone posts the same data multiple times.

Answer: it is impossible and therefore data integrity is impossible. I
am just starting to do some SOX consulting work; I will flunk your
database for this. There was some discussion of this at CA a few weeks
ago.

A good RDBMS will handle the access for you, so that you do not have to
drop down to that level.

Erland Sommarskog

unread,

Dec 4, 2005, 6:47:09 AM12/4/05

to

--CELKO-- (jcel...@earthlink.net) writes:
> Ignoring that BY DEFINITION, the proprietary IDENTITY is not a data
> type, not a relational and not verifiable, it is redundant in the face
> of a real key.

New readers should note that this is Joe Celko's private definition, and
thus nothing to bother about.

I could say that by definition Joe Celko is always wrong, but that would
not be a very strong argument. A much stronger argument is that far too many
of his posts consists of inaccurate standard rants, and where the main
rationale for the post is to insult the person who asked.

> How do you guarantee that you have the IDENTITY and the real key in
> synch?

In Lisa's case, it appears to be simple: she should have a UNIQUE
cosntraints on the real key. However, there are plentyful of cases where
there is no real key, or where any "real key" is far beyond being
practically usable.

Tony Rogerson

unread,

Dec 4, 2005, 6:57:21 AM12/4/05

to

So, DB2, Oracle, Sybase and Microsoft SQL Server are not 'good RDBMS'? Seems
the market disagrees.

The whole point about using a surrogate key with the IDENTITY property is so
that you gain better performance, reduce complexity for backups, security
etc... as well as being able to change your natural key without having to
build a horrendous transaction of multiple update statements that will lock
up your tables while its being done - usually people cluster on the primary
key (natural key).

People only get redundant rows if they forget to add constraints to their
table, the surrogate key definition would be,..

create table individual (
id int not null identity constraint sk_individual unique
clustered,

last_name varchar(50) not null,
first_name varchar(50) not null,
dob int not null,
constraint pk_individual primary key nonclustered( last_name,
first_name, dob )
)

Mind you, we both know that the natural key on individual isn't the above,
in fact their isn't an easy one - consider a user group like my own, do you
want me to ask people for the NI (social security) number ? I wouldn't get
many members, I'd probably not even get my registration under the Data
Protection act here in the UK passed either.

Well fud - go out and get some bloody experience on real systems please!

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials

"--CELKO--" <jcel...@earthlink.net> wrote in message
news:1133661996.6...@o13g2000cwo.googlegroups.com...

Brian Selzer

unread,

Dec 4, 2005, 10:20:51 AM12/4/05

to

According to C. J. Date in An Introduction to Database Systems, Seventh
Edition:

Surrogate keys are keys in the usual relational sense but have
the following specific properties:

- They always involve exactly one attribute.

- Their values serve solely as surrogates (hence the name)
for the entities they stand for. In other words, such values serve
merely to represent the fact that the corresponding entities exists--
they carry no additional information or meaning whatsoever.

- When a new entity is inserted into the database, it is given a
surrogate key value that has never been used before and will
never be used again, even if the entity in question is subsequently
deleted.

Ideally, surrogate key values would be system-generated, but
whether they are system- or user-generated has nothing to do
with the basic ideas of surrogate keys as such.

I believe that columns with the IDENTITY property meet this criteria.
Although the actual integer value may be the same in more than one table,
every column in which they are used should also participate in a declaritave
referential integrity constraint, which serves to make the value unique
throughout the database because a DRI constraint includes the table to which
it belongs or refers in its definition.

Even E. F. Codd embraced the concept of surrogates in a paper he wrote in
1979: Extending the Database Relational Model to Capture More Meaning,
which, by the way, you've misinterpreted in some of your rants.

"--CELKO--" <jcel...@earthlink.net> wrote in message
news:1133661996.6...@o13g2000cwo.googlegroups.com...