>> Would any one care to explore the pros & cons of using Identity
numbers versus the uniqueidentifier in SQL2000? <<
Welcome to a religious war <g>! I stand on the side of relational
purity, good data modeling and all that is Holy. My opponents, who I
am sure will drawn here shortly, are in league with the Forces of
Darkness <g>.
Real SQL programmers use real keys and do not try to imitate 1950's
magnetic tape or punch card systems in a relational model. In the
early days of programming languages, we exposed a lot of the physical
and hardware implementation to the programmer. For example, COBOL and
FORTRAN depend on physically contiguous storage of data, which made a
ROWID (i.e. physical location reference) possible.
Later, we designed languages to independent of any physical and
hardware implementation. Thus, an INTEGER datatype did not have to be
eight binary bits in a two complement format. It just had to behave
like an integer in the program and the programmer did not have to worry
about how the hardware did its work -- or even know what the hardware
was.
SQL and other modern programming languages carry this idea further and
try to completely separate logical and physical implementations. The
idea is that SQL is based on sets (which have no ordering) and a
standard that defines its logical behavior. The behavior has nothing
to do with whether a product uses B-Tree indexes, bit vectors or
hashing; two complement or base ten arithmetic; whether the host
program calling the SQL is C or Cobol; etc.
The IDENTITY column in SQL Server is an attempt to return to those
thrilling days of yeasteryear, and the the conceptual junk left over
from the days when people did not know much about relational
databases. What we knew was sequential file systems -- punch cards and
magnetic tapes. Even the disk file systems mimicked these systems,
adding only simple indexes.
Sequence was a basic way of looking at data. People thought in terms
of them at a primitive level. A sequence of bits make a byte, a
sequence of bytes make a field, a sequence of fields make a record and
sequence of records make a file. Very low level, very close to the
machinery.
Then along comes the relational model. It is based on sets; a set is a
completed whole, without any ordering to it. No sequences! Very
abstract! Programmers did not know how to cope, so the vendors exposed
the physical implementation and called these things "features" and
locked their products to particular architectures. I can go into
details on that problem, but let me say that when we went to the bar
after ANSI X3H2 meetings, the vendors griped about what they had to do
to these extensions to preserve them in the next platform, how they
could not scale up to data warehouse size databases, etc.
The IDENTITY column is one of these mistakes.
1) It is not part of the SQL-92 Standard and it is highly proprietary
to the Sybase family. It is not portable -- not quite the same thing
as proprietary, since you can often translate one SQL dialect into
another with a simple replacement (i.e. the % operator becomes the MOD
() function). So your code will not move over to a new database.
2) IDENTITY looks like a datatype, but it is not. Create a table with
one column in it and make it an IDENTITY column. Insert a number into
the table and see what happens. Try to set it to NULL. If you cannot
insert, update and delete all the columns, then this is not a table!
3) IDENTITY looks like a constraint, but it is not. Try to create a
table with two IDENTITY columns and it fails. If you cannot add it to
a column, then it is not a constraint. It is possible to write a a set
of constraints that prohibit data from ever being put in the table
(their predicate is always FALSE). It is possible to write a a set of
constraints that allow anything in the table (their predicate is always
TRUE). But no constraint can prohibit the creation of the table
itself -- that is a meta-constraint.
4) It is not relational. Consider this statement on a table, Foo,
which has an identity column. Assume the query returns more than one
row.
INSERT INTO Foo (x)
SELECT a FROM Bar;
You will get a result like this:
IDENTITY X
============
1 'a'
2 'b'
3 'c'
but if the query changed an index or was put on the physical disk data
page differently, you might have gotten:
IDENTITY X
============
1 'b'
2 'c'
3 'a'
Explain why one result is the logically correct choice for an
identifier and all other choices are not, without any reference to the
physical implementation. You cannot.
Instead of treating the query as a set, you are doing 1950's sequential
processing using the underlying sequential file system the Sybase
family started with.
5) If you have designed your tables correctly, they will have a
meaningful primary key derived from the nature of the entity they
model. The IDENTITY column should be a redundant key. The reason
IDENTITY columns are popular as keys is that they are easy to declare.
This is also the same reason that people build non-normalized databases
and put pennies in fuse boxes -- easy is not right.
6) It is a bitch to do calculations on IDENTITY column values. Well,
it was hard to do direct math on the sequential position of a record in
a 1950's punch card system and that it what the IDENTITY is mimicking.
7) There is no check digit in an IDENTITY columns value, so you have no
way of verifying it if you use it as a key.
8) If you use IDENTITY as a key, the values tend to cluster on physical
data pages because they are sequential. The result is that if the most
recent rows are the most likely to be accessed, there will be locking
contention for control of those physical data pages. What you really
wanted in a key is some spread of the rows over physical storage to
avoid having every user trying to get to the same page at the same
time.
9) The actual implementation of the IDENTITY column has been
problematic since Version 7.0. You can look up threads in the news
groups to get assorted tales of woe.
There are other ways of getting a unique identifier for a table. The
most portable method for getting a new identifier number which is not
in the set is something like this:
INSERT INTO Foobar (keycol, a, b, c...)
VALUES(COALESCE((SELECT MAX(keycol) FROM Foobar) +1, 0),
aa, bb, cc, ...);
The scalar subquery expression returns the current high value for the
key column, and then increments it. If there is no maximum value (i.e.
this is the first row to be inserted), then it returns zero.
Using this basic idea, you can replace the increment with a different
constant or a random number generator. You can also add code to create
a check digit.
Another method is to hash the columns that make up a compound key so
that you have single short column that can be reconstructed if you need
to verify it.
--CELKO--
Joe Celko, SQL Guru & DBA at Trilogy
When posting, inclusion of SQL (CREATE TABLE ..., INSERT ..., etc)
which can be cut and pasted into Query Analyzer is appreciated.
---
Trilogy, FORTUNE, and Goldman Sachs are hosting the E-Business event of
the year. Find out more by visiting www.battleroyale2001.com
Sent via Deja.com
http://www.deja.com/