Hi all, I'm enjoying reading C. J. Date's new book "Database in Depth", in which I know the www.dbdebunk.com website. I visited the site just now, and find the following statement in its "QUOTE OF THE WEEK" section:
** QUOTE ** Don't use primary keys that have meaning for the end user, such as invoice number or the ISBN value.
--Giuseppe DiMauro & Francesco Balena, PRACTICAL GUIDELINES AND BEST PRACTICES FOR MICROSOFT VISUAL BASIC AND VISUAL C# DEVELOPERS ** QUOTE **
It seems that they quote this idea to point out how wrong it is. But I myself always regard this idea as a good practice in database design! Am I right? If not, what's wrong with it? Please comment.
On Thu, 18 Aug 2005 00:25:57 -0700, Mike Meng wrote: > ** QUOTE ** > Don't use primary keys that have meaning for the end user, such as > invoice number or the ISBN value. [...] > ** QUOTE **
> It seems that they quote this idea to point out how wrong it is. But > I myself always regard this idea as a good practice in database design! > Am I right? If not, what's wrong with it? Please comment.
This is probably an area where Darwen/Date and Celko would agree(!).
It has often been discussed in this group and comp.databases. And it has been written about at various places on the web, eg
My position is that generated columns are sometimes needed because the world isn't perfect. But it's a stupid habit when people _automatically_ (by heart) create a generated "ID" column in tables and make it the primary key without thinking. Often, it will contain useless information because columns can be uniquely identified by other column(s) in the table. Sometimes, using such generated values for primary keys will allow wrong, redundant information to accumulate in the non-primary key columns (if the domain key(s) aren't also enforced with UNIQUE constraints).
> Hi all, > I'm enjoying reading C. J. Date's new book "Database in Depth", in > which I know the www.dbdebunk.com website. I visited the site just now, > and find the following statement in its "QUOTE OF THE WEEK" section:
> ** QUOTE ** > Don't use primary keys that have meaning for the end user, such as > invoice number or the ISBN value.
> --Giuseppe DiMauro & Francesco Balena, PRACTICAL GUIDELINES AND BEST > PRACTICES FOR MICROSOFT VISUAL BASIC AND VISUAL C# DEVELOPERS > ** QUOTE **
> It seems that they quote this idea to point out how wrong it is. But > I myself always regard this idea as a good practice in database design! > Am I right? If not, what's wrong with it? Please comment.
It might seem like an acceptable way of preventing an incorrect manually entered key value from being propagated through the database. If a user enters a licence plate number wrong, and that is used as a foreign key in dozens of tables, you've got dozens of places to correct the error (meaning lots of code has to exist or be written--and added to in perpetuity as the database design grows).
The better solution is to create foreign keys with an ON UPDATE CASCADE constraint so that when the users enter a wrong value and subsequently correct it, the DBMS automatically propagates the correction. This is preferable because you don't need a key generator function; you don't need a cross-reference table; you don't obfuscate data with surrogate values; and most importantly IMO, you don't conceal duplications. The latter allows apparently correctly-formed queries to give incorrect answers.
>>Hi all, >> I'm enjoying reading C. J. Date's new book "Database in Depth", in >>which I know the www.dbdebunk.com website. I visited the site just now, >>and find the following statement in its "QUOTE OF THE WEEK" section:
>>** QUOTE ** >>Don't use primary keys that have meaning for the end user, such as >>invoice number or the ISBN value.
>>--Giuseppe DiMauro & Francesco Balena, PRACTICAL GUIDELINES AND BEST >>PRACTICES FOR MICROSOFT VISUAL BASIC AND VISUAL C# DEVELOPERS >>** QUOTE **
>> It seems that they quote this idea to point out how wrong it is. But >>I myself always regard this idea as a good practice in database design! >>Am I right? If not, what's wrong with it? Please comment.
> It might seem like an acceptable way of preventing an incorrect manually > entered key value from being propagated through the database. If a user > enters a licence plate number wrong, and that is used as a foreign key in > dozens of tables, you've got dozens of places to correct the error (meaning > lots of code has to exist or be written--and added to in perpetuity as the > database design grows).
> The better solution is to create foreign keys with an ON UPDATE CASCADE > constraint so that when the users enter a wrong value and subsequently > correct it, the DBMS automatically propagates the correction. This is > preferable because you don't need a key generator function; you don't need a > cross-reference table; you don't obfuscate data with surrogate values; and > most importantly IMO, you don't conceal duplications. The latter allows > apparently correctly-formed queries to give incorrect answers.
> Roy
I agree.
I often take the following:
"one should not increase, beyond what is necessary, the number of entities required to explain anything"
and modify it to this:
"one should not increase, beyond what is necessary, the number of entities required to store anything"
> Hi all, > I'm enjoying reading C. J. Date's new book "Database in Depth", in > which I know the www.dbdebunk.com website. I visited the site just now, > and find the following statement in its "QUOTE OF THE WEEK" section:
> ** QUOTE ** > Don't use primary keys that have meaning for the end user, such as > invoice number or the ISBN value.
> --Giuseppe DiMauro & Francesco Balena, PRACTICAL GUIDELINES AND BEST > PRACTICES FOR MICROSOFT VISUAL BASIC AND VISUAL C# DEVELOPERS > ** QUOTE **
> It seems that they quote this idea to point out how wrong it is. But > I myself always regard this idea as a good practice in database design! > Am I right? If not, what's wrong with it? Please comment.
It depends. There are times when it's wise to use keys like ISBN and invoice number, and times when it's not.
The important factor is not whether the user understands the keys or does not, but whether new instances of real world entities come with dependable keys or not.
Examples: depending on Social Security # to identify persons is a bad idea, if you are going to be faced with a significant number of people that don't have an account with Social Security. example: the prison inmate population.
Depending on ISBN to key books in a library is a bad idea if you are going to have a significant number of books that come without an ISBN, like certain government publications.
And, of course, the generator has to avoid duplicates, among other issues.
There are other cases where surrogate keys are a good idea.
But, in general, Date is right. Natural keys are better, as long as they are dependable.
It's not a dogmatic question. There are situation where one choice is right, and others where the other choice is right.
And that's what I don't like about dogmatic responses. They tend to assume that design judgements fall on a simple linear scale from badness to goodness. Goodness is not a one dimensional measure. And that's why I stay way from dbdebunk. At a deeper level, they don't get it.
>> .. if you are going to be faced with a significant number of people that don't have an account with Social Security. example: the prison inmate population. <<
Actually, we either deport them or give them an SSN on release or parole as part of the attempt to do job placement. If they are deported then they get a number from the INS on the way out.
The fact that this question is being considered is a sign that we have a lot of really bad SQL programmers. By definition a relational key is a subset of attributes of an entity or relationship being modeled in the schema. This is absolute foundations of the Relational Model kind of stuff.
IDENTITY is an exposed PHYSICAL locator. What does that phrase mean? The value is created by looking at the internal state of the hardware at the time a PHYSICAL record containing a row is inserted into storage. Its purpose is to locate the row without any regard to what the data means.
Think about using a pointer or a track/sector number; same thing but different mechanism. But SQL does not have a pointer data type or the mechanisms to handle pointer operators, garbage collection and housekeeping, so 20+ years ago the original Sybase SQL Server exposed an integer that can map back to the contiguous storage model used under the covers.
IDENTITY is not an attribute in the data model and can never be an attribute in the data model because it does not exist in the reality from which you derive your data model. The purpose is to fake a sequential file's positional record number, so I can reference the PHYSICAL storage location. Sure, I lose all the advantages of an abstract data model, SQL set oriented programming, carry extra data, destroy the portability of code and have no data integrity. Who cares? It is easier to write than a real normalized RDBMS schema.
More and more programmers who have absolutely no database training are being told to design a database. It takes six years to become a Journeyman Union Carpenter in New York State. But a programmer with only a few years in a procedural or OO language is expected to produce a correct and usable database.
What is he going to do? The smart ones will get some help and beg for training, knowing they are over their heads and can ruin the company. Most of them simply start programming SQL as if it were their native programming language. They grab at GUIDs, IDENTITY, ROWID and other proprietary auto-numbering "features" in SQL products to imitate either a record number (sequential file system mindset) or OID (OO mindset) since they don't know anything else.
They write code with cursors to mimic record-at-a-time file handling -- it is easy do a global replace of READ() with FETCH. They write to temp tables to mimic scratch files in a series of procedural steps - it is hard to think of a single declarative statement to get the answer. They use dynamic SQL and let the user figure out how the system should work on the fly - they never had a software engineering course and don't know what coupling and cohesion are.
Experienced database designers look for industry standard codes for their keys first. Try to tell the IRS you don't have anybody's SSN at tax time but Cindy Lou Who was employee 42 when you put her into the Personnel table and you have the IDENTITY value to prove it. Try to sell a car without a VIN - using IDENTITY for this would be like identifying a vehicle by the current local parking space number (think about it - that is exactly what an IDENTITY value is -- a parking space in one physical location).
We know that we have to use industry standard codes, either de facto or de jure. We also appreciate the fact that we can exchange data with the rest of the world. We appreciate the fact that someone else will maintain and define these codes. A trusted external source is a good thing to have.
If you don't have an industry standard or natural key and have to design your own codes, it is hard work to do it right. I know that newbie programmers want to start coding first and thinking later. It does not work that way and all the "wish magic" in the world will not change that fact. I have several chapters in my books on how to design encoding schemes, but that is another topic.
The first practical consideration is that IDENTITY is proprietary and non-portable, so you know that you will have maintenance problems when you change releases or port your system to other SQL products. Newbies actually think they will never port code! Perhaps they only work for companies that are failing and will be gone. Perhaps their code is such crap nobody else want their application. Otherwise, you will port code; you will share data with some other database; data does not exist in isolation.
But let's look at the logical problems. First try to create a table with two columns and try to make them both IDENTITY. If you cannot declare more than one column to be of a certain data type, then that thing is not a data type at all, by definition. It is a property which belongs to the PHYSICAL table, not the LOGICAL data in the table.
Next, create a table with one column and make it an IDENTITY. Now try to insert, update and delete different numbers from it. If you cannot insert, update and delete rows from a table, then it is not a table by definition.
Finally create a simple table with one IDENTITY and a few other columns. Use a few statements like
INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1'); INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2'); INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
To put a few rows into the table and notice that the IDENTITY sequentially numbered them in the PHYSICAL order they were presented. If you delete a row, the gap in the sequence is not filled in and the sequence continues from the highest number that has ever been used in that column in that particular table. This is how we did record numbers in pre-allocated sequential files in the 1950's, by the way. A utility program would then "pack" or "compress" the records that were flagged as deleted or unused to move the empty space to the PHYSICAL end of the PHYSICAL file. IDENTITY leaves the gaps unless you write your own routine to do the compression.
But now use a statement with a query expression in it, like this:
INSERT INTO Foobar (a, b, c) SELECT x, y, z FROM Floob;
Since a query result is a table, and a table is a set which has no ordering, what should the IDENTITY numbers be? The entire, whole, completed set is presented to Foobar all at once, not a row at a time. There are (n!) ways to number (n) rows, so which one did you pick? Why? The answer has been to use whatever the *PHYSICAL* order of the result set happened to be. That non-relational phrase "PHYSICAL order" again!
But it is actually worse than that. If the same query is executed again, but with new statistics or after an index has been dropped or added, the new execution plan could bring the result set back in a different PHYSICAL order. Indexes and statistics are not part of the logical model.
Can you explain from a logical model why the same rows in the second query get different IDENTITY numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical.
How do you verify that an entity has the right key when you use an IDENTITY? If I use a VIN for a vehicle, I can go to parking lot and read it off the dashboard. If I use a UPC for a candy bar, I can read the bar code on the wrapper. But why is little Cindy Lou Who employee 42? If I turn her upside, will I find that she has 42 tattooed somewhere? If I call an external trusted source, will they know that she is employee 42?
In the Relational Model, you do not invent a key in the storage. You discover a key (and the other attributes) in the real world and model it. If you create your own encoding for a key, then you have to maintain it, provide audit trails and do all the work that an industry standard organization would do for you.
If I lean on a mouse button, I can insert the same data with a new IDENTITY over and over. Cindy Lou Who is now employed two times and none of my reports are right! Now I have to write some procedural code like a trigger or a UNIQUE constraint on her SSN to prevent this, thus making the IDENTITY redundant. But we were assuming that we use only IDENTITY as a key, so we are screwed.
Newbies often design tables without bothering to look for a relational key, so they are so surprised when they do a data warehouse and nobody else has any idea what they are doing in their subsystem.
Cindy Lou Who now has two rows in Personnel. When we sign her up for the Dental Plan, we get the row with 42. When we sign her up for the Bowling Team, we get the row with 43. We find our error, and delete the row with 42 because we have a row that was created later and we assume it is more current.
Another common way to get this is to have two procedures, one for inserting a new employee to the Dental Plan and one for inserting a new employee to the Bowling Team. Both procedures create a row in Personnel since they use only IDENTITY as a key.
Well, now we have an orphan row in Personnel. In fact, in such systems, you will find a lot of orphans. I worked for a company that used GUIDs for OIDs substitutes and our software fell apart in about a year of actual use by a client.
Another cute way to destroy data integrity:
BEGIN ATOMIC DELETE FROM Foobar WHERE id = <<some identity value>>; INSERT INTO Foobar VALUES (<<recreate deleted row>>) END;
Logically this should do nothing, but since IDENTITY has gaps, it trashes the data. When one query uses the IDENTITY and another uses the real key, you are like a man with two watches, you are never sure what time it is.
Finally, an appeal to authority, with a quote from Dr. Codd: "..Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434.
This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in
...
> The fact that this question is being considered is a sign that we have > a lot of really bad SQL programmers. By definition a relational key is > a subset of attributes of an entity or relationship being modeled in > the schema. This is absolute foundations of the Relational Model kind > of stuff.
> IDENTITY is an exposed PHYSICAL locator. What does that phrase mean? > The value is created by looking at the internal state of the hardware > at the time a PHYSICAL record containing a row is inserted into > storage. Its purpose is to locate the row without any regard to what > the data means.
> Think about using a pointer or a track/sector number; same thing but > different mechanism. But SQL does not have a pointer data type or the > mechanisms to handle pointer operators, garbage collection and > housekeeping, so 20+ years ago the original Sybase SQL Server exposed > an integer that can map back to the contiguous storage model used under > the covers.
> IDENTITY is not an attribute in the data model and can never be an > attribute in the data model because it does not exist in the reality > from which you derive your data model. The purpose is to fake a > sequential file's positional record number, so I can reference the > PHYSICAL storage location. Sure, I lose all the advantages of an > abstract data model, SQL set oriented programming, carry extra data, > destroy the portability of code and have no data integrity. Who cares? > It is easier to write than a real normalized RDBMS schema.
> More and more programmers who have absolutely no database training are > being told to design a database. It takes six years to become a > Journeyman Union Carpenter in New York State. But a programmer with > only a few years in a procedural or OO language is expected to produce > a correct and usable database.
> What is he going to do? The smart ones will get some help and beg for > training, knowing they are over their heads and can ruin the company. > Most of them simply start programming SQL as if it were their native > programming language. They grab at GUIDs, IDENTITY, ROWID and other > proprietary auto-numbering "features" in SQL products to imitate either > a record number (sequential file system mindset) or OID (OO mindset) > since they don't know anything else.
> They write code with cursors to mimic record-at-a-time file handling -- > it is easy do a global replace of READ() with FETCH. They write to temp > tables to mimic scratch files in a series of procedural steps - it is > hard to think of a single declarative statement to get the answer. They > use dynamic SQL and let the user figure out how the system should work > on the fly - they never had a software engineering course and don't > know what coupling and cohesion are.
> Experienced database designers look for industry standard codes for > their keys first. Try to tell the IRS you don't have anybody's SSN at > tax time but Cindy Lou Who was employee 42 when you put her into the > Personnel table and you have the IDENTITY value to prove it. Try to > sell a car without a VIN - using IDENTITY for this would be like > identifying a vehicle by the current local parking space number (think > about it - that is exactly what an IDENTITY value is -- a parking space > in one physical location).
> We know that we have to use industry standard codes, either de facto or > de jure. We also appreciate the fact that we can exchange data with the > rest of the world. We appreciate the fact that someone else will > maintain and define these codes. A trusted external source is a good > thing to have.
> If you don't have an industry standard or natural key and have to > design your own codes, it is hard work to do it right. I know that > newbie programmers want to start coding first and thinking later. It > does not work that way and all the "wish magic" in the world will > not change that fact. I have several chapters in my books on how to > design encoding schemes, but that is another topic.
> The first practical consideration is that IDENTITY is proprietary and > non-portable, so you know that you will have maintenance problems when > you change releases or port your system to other SQL products. Newbies > actually think they will never port code! Perhaps they only work for > companies that are failing and will be gone. Perhaps their code is such > crap nobody else want their application. Otherwise, you will port code; > you will share data with some other database; data does not exist in > isolation.
> But let's look at the logical problems. First try to create a table > with two columns and try to make them both IDENTITY. If you cannot > declare more than one column to be of a certain data type, then that > thing is not a data type at all, by definition. It is a property which > belongs to the PHYSICAL table, not the LOGICAL data in the table.
> Next, create a table with one column and make it an IDENTITY. Now try > to insert, update and delete different numbers from it. If you cannot > insert, update and delete rows from a table, then it is not a table by > definition.
> Finally create a simple table with one IDENTITY and a few other > columns. Use a few statements like
> INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1'); > INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2'); > INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
> To put a few rows into the table and notice that the IDENTITY > sequentially numbered them in the PHYSICAL order they were presented. > If you delete a row, the gap in the sequence is not filled in and the > sequence continues from the highest number that has ever been used in > that column in that particular table. This is how we did record numbers > in pre-allocated sequential files in the 1950's, by the way. A utility > program would then "pack" or "compress" the records that were flagged > as deleted or unused to move the empty space to the PHYSICAL end of the > PHYSICAL file. IDENTITY leaves the gaps unless you write your own > routine to do the compression.
> But now use a statement with a query expression in it, like this:
> INSERT INTO Foobar (a, b, c) > SELECT x, y, z > FROM Floob;
> Since a query result is a table, and a table is a set which has no > ordering, what should the IDENTITY numbers be? The entire, whole, > completed set is presented to Foobar all at once, not a row at a time. > There are (n!) ways to number (n) rows, so which one did you pick? Why? > The answer has been to use whatever the *PHYSICAL* order of the result > set happened to be. That non-relational phrase "PHYSICAL order" again!
> But it is actually worse than that. If the same query is executed > again, but with new statistics or after an index has been dropped or > added, the new execution plan could bring the result set back in a > different PHYSICAL order. Indexes and statistics are not part of the > logical model.
> Can you explain from a logical model why the same rows in the second > query get different IDENTITY numbers? In the relational model, they > should be treated the same if all the values of all the attributes are > identical.
> How do you verify that an entity has the right key when you use an > IDENTITY? If I use a VIN for a vehicle, I can go to parking lot and > read it off the dashboard. If I use a UPC for a candy bar, I can read > the bar code on the wrapper. But why is little Cindy Lou Who employee > 42? If I turn her upside, will I find that she has 42 tattooed > somewhere? If I call an external trusted source, will they know that > she is employee 42?
> In the Relational Model, you do not invent a key in the storage. You > discover a key (and the other attributes) in the real world and model > it. If you create your own encoding for a key, then you have to > maintain it, provide audit trails and do all the work that an industry > standard organization would do for you.
> If I lean on a mouse button, I can insert the same data with a new > IDENTITY over and over. Cindy Lou Who is now employed two times and > none of my reports are right! Now I have to write some procedural code > like a trigger or a UNIQUE constraint on her SSN to prevent this, thus > making the IDENTITY redundant. But we were assuming that we use only > IDENTITY as a key, so we are screwed.
> Newbies often design tables without bothering to look for a relational > key, so they are so surprised when they do a data warehouse and nobody > else has any idea what they are doing in their subsystem.
> Cindy Lou Who now has two rows in Personnel. When we sign her up for > the Dental Plan, we get the row with 42. When we sign her up for the > Bowling Team, we get the row with 43. We find our error, and delete the > row with 42 because we have a row that was created later and we assume > it is more current.
> Another common way to get this is to have two procedures, one for > inserting a new employee to the Dental Plan and one for inserting a new > employee to the Bowling Team. Both procedures create a row in Personnel > since they use only IDENTITY as a key.
> Well, now we have an orphan row in Personnel. In fact, in such systems, > you will find a lot of orphans. I worked for a company that used GUIDs > for OIDs substitutes and our software fell apart in about a year of > actual use by a client.
> Another cute way to destroy data integrity:
> BEGIN ATOMIC > DELETE FROM Foobar > WHERE id = <<some identity value>>; > INSERT INTO Foobar > VALUES (<<recreate deleted row>>) > END;
> Logically this should do nothing, but since IDENTITY has gaps, it > trashes the data. When one query uses the IDENTITY and another uses the > real key, you are like a man with two watches, you are never sure what > time it is.
> Finally, an appeal to authority, with a quote from Dr. Codd: > "..Database users may cause the system to generate or delete a > surrogate, but they
JGE wrote: > ... > ...There are many situations where one needs to store information about people
where one has no right to ask for and the
> person has no obligation to deliver, their SSN. Also, is not the SSN simply > an artificial key, generated by the government? As such, how does it differ > from the lovely primary key or 'identity' that Access is so generous to > provide? In principle, rather than in scale. >...
i thought that the long message about IDENTITY being an exposed PHYSICAL indicator was more mystical than logical. i'd say your message is closer to natural logic, even if Access is flawed. don't claim to know the answer, since the question, if i've got it right makes no sense, ie. which should we prefer, a generated key, or a 'natural' one? all depends on the situation. my own opinion at the moment is that databases don't have to be global to be useful. in fact, local databases may be safer. a few months ago i was handcuffed by an Oregon state trooper who thought i was using stolen plates. fault of a database that didn't recognize canuck plates. on the other hand, i was released about an hour later after the canuck db was consulted. shudder to think what would have happened had there been only one, wrong, db.
> And that's what I don't like about dogmatic responses. They tend to assume > that design judgements fall on a simple linear scale from badness to > goodness. Goodness is not a one dimensional measure. And that's why I stay > way from dbdebunk.
I fear you miss the point of DBDebunk. I doubt Pascal's purpose is to tell us how to do things (since--for one thing--he knows very well we haven't the tools or products to do things "his" way). I believe his purpose is to show that a systematic, logical way of thinking can take us places where common sense fails or misleads. His purpose is to show that "design judgements" are improved when one has a grasp of a mathematically respectable theory for what one is doing (and I don't mean theory in the sense it is often used, as a synonym for conjecture or speculation). And his often stated purpose is to point out that there are good reasons to think that better tools are possible, and that we won't get them if we don't know to ask for them.
Making it a policy to stay away from DBDebunk seems rather dogmatic to me. There is lots there that is good, and most of the rest is worth testing intellectually even if you ultimately decide it's wrong. Staying away from DBDebunk as a matter of policy seems dogmatic at best, and may even be arrogant.
David Cressey wrote: > And that's what I don't like about dogmatic responses. They tend to assume > that design judgements fall on a simple linear scale from badness to > goodness. Goodness is not a one dimensional measure. And that's why I stay > way from dbdebunk. > At a deeper level, they don't get it.
But at dbdebunk they DO NOT dogmatically say avoid surrogates - e.g. here:
"We do not recommend the use of surrogate keys as a rule, but only in some rare circumstances and for very specific reasons. Chapter 3 in my book gives ah explanation of the criteria for choosing keys, which also help determining when those special circumstances apply."
It is the "quote of the week" that is dogmatic: "Don't use primary keys that have meaning for the end user..."
> On 2005-08-18, -CELKO- <jcelko...@earthlink.net> wrote: [..] > Hello Joe:
> I'd be an idiot to question anything above, but I will make a comment about > SSN's. Here in Canada, a SIN. There are many situations where one needs to > store information about people where one has no right to ask for and the > person has no obligation to deliver, their SSN. Also, is not the SSN simply > an artificial key, generated by the government? As such, how does it differ > from the lovely primary key or 'identity' that Access is so generous to > provide? In principle, rather than in scale.
There is no difference in its genesis - only in the interest that we attach to it after it is created!
Errr, Access - shudders violently - surely the single most damaging product ever to be foisted on an unsuspecting world when measured in terms of promoting the bad bad practices Joe is railing against. BTW (before I get flamed) I will declare that I have developed a few Access based apps that were quite viable outcomes - but it required lots of experience and discipline to get to achieve those results. Not exactly the skillset of the targeted market.
It does remind me of a recent experience with one of my internet banking facilities (sniggers at the memory). The account was suddenly locked because of 3 failed logins. Suspecting a keylogger hack I quickly rang the help desk to get some forensic info so I could start to id (sic) the affected PC. Oh no she said "its probably a customer who transposed a few digits, didn't notice and then used their password to get your account locked! We used to allocate the numbers sequentially but found it was causing too many problems, so we now use random numbers to avoid typo collisions". My paranoia faded instantly and my evil twin contemplated the DOS from hell - imagine the help desk meltdown trying to reset irate customer passwords if you started pounding from 1 onwards, and then did it again a day later! Simple decisions made by simpletons - BIG headaches later!
> The fact that this question is being considered is a sign that we have > a lot of really bad SQL programmers. By definition a relational key is > a subset of attributes of an entity or relationship being modeled in > the schema. This is absolute foundations of the Relational Model kind > of stuff.
[..] (or I would have top posted).
I understand all the points and concerns they generate...but in my own personal toolkit of techniques I have found I can use surrogate keys and suffer none of the grief you attribute to them.
I guess I should declare 27 years of involvement with gadgets starting in the age of PDP-8/11's, Altairs, Alpha Micro, CDC Cybers (what a machine that was) might have taught me a few tricks in this regard. So all you pups searching for gems should prolly accord 80 in an 80/20 measure of Joes dissertation.
>> Here in Canada, a SIN. There are many situations where one needs to store information about people where one has no right to ask for and the person has no obligation to deliver, their SSN. <<
We have some of the same privacy laws here, but you still need it to pay taxes.
>> Also, is not the SSN simply an artificial key, generated by the government? As such, how does it differ from the lovely primary key or 'identity' that Access is so generous to provide? In principle, rather than in scale. <<
TOTALLY different In principle! -- I have an article on how to validate and verify an SSN sitting at www.dbazine.com in the slush pile. The SSN has nothing to do with the internal state of a single piece of hardware. It is generated by a set of rules in a rigid fomat, verified by the Social Security Adminstration and issued numbers are posted once a month on the Internet so you can look them up. Certain SSN are never issued, certain ranges are available for assignment to persons without an SSN, etc. This is not just a serial number.
>> Oh no she said "its probably a customer who transposed a few digits, didn't notice and then used their password to get your account locked! We used to allocate the numbers sequentially but found it was causing too many problems, so we now use random numbers to avoid typo collisions". <<
I have to use that story! People keep bitching about my mania for check digits because they cannot or do not wish to do the math. Well, until the data is corrupted ..
> I understand all the points and concerns they generate...but in my own > personal toolkit of techniques I have found I can use surrogate keys and > suffer none of the grief you attribute to them.
I believe you are fooling yourself.
> I guess I should declare 27 years of involvement with gadgets starting in > the age of PDP-8/11's, Altairs, Alpha Micro, CDC Cybers (what a machine that > was) might have taught me a few tricks in this regard. So all you pups > searching for gems should prolly accord 80 in an 80/20 measure of Joes > dissertation.
I have been in this field just as long as you. (I still have a copy of the MITS Altair newsletter with a Chirstmas photo of Bill Gates and the rest of the Microsoft team--all 16 of them.)
In that time I have seen/written/endured untold mountains of spaghetti code do to foolish and misguided things, that had I known and followed the advice of people like Date and Pascal (and maybe even Celko on occasion) I could have mostly avoided.
If you were to explain to me how unnecessary surrogate keys silently conceal corruption, and how you guaranteed that they wouldn't, and that it all ended up being cheaper, simpler, and more maintainable than if you hadn't used them, then I might be prepared to accept that you haven't suffered any of the grief. But since you haven't explained that, I'm more inclined to think you just haven't noticed or recognized it. Not noticing that you have cancer is not the same as not having it.
It is not too late to explain now. I keep an open (but properly sceptical) mind.
> I have found I can use surrogate keys and > suffer none of the grief you attribute to them. > [...] > I guess I should declare 27 years of involvement
In my travels, I have been to all corners of North America. I've been atop mountans. I've peered out of skyscraper. I've even flown in a plane. With all my traveling experience, I have yet to find any evidence whatsoever that the earth is round; as far as I can tell, it's flat as a board (moutants et al withstanding).
Despite my personal observations, I have not joined the Flat Earth Society. I rely on the expertise of others to mathematically prove a round earth. I trust the experience of those who have personally observed that the earth is in fact round.
When you teach yourself something, you end up relying on your own observations and assumptions. When you rely on what you know, you lose out on volumes of collective knowledge and experience. Had you not read it in a science book, do you really think you'd come to the conclusion that the earth is round?
The same holds true for databases. When a programmer self-teaches himself databases, he applies his own experience: procedural programming. This is what happened in the fifties when there were no databases and, therefore, no experience or expertise to go by. Programmers back then modeled data exactly like you do now: like a programmer.
Fifty years, the data field has made a lot of progress. The best model we have now is the relational model. One of the fundamentals of this model is that data do not need to be accessed by pointers (meaningless keys); they can be accessed (and related) by their value.
When you ignore this fundamental, your model becomes closer and closer to the ineffective ways of the past. Worse still, you lose the tools they had back then (IDMS pointer functions, for example) to keep their models running. As Santayana said, "Those who cannot remember the past are condemned to repeat it."
> The fact that this question is being considered is a sign that we have > a lot of really bad SQL programmers. By definition a relational key is > a subset of attributes of an entity or relationship being modeled in > the schema. This is absolute foundations of the Relational Model kind > of stuff.
> IDENTITY is an exposed PHYSICAL locator. What does that phrase mean? > The value is created by looking at the internal state of the hardware > at the time a PHYSICAL record containing a row is inserted into > storage. Its purpose is to locate the row without any regard to what > the data means.
> Think about using a pointer or a track/sector number; same thing but > different mechanism. But SQL does not have a pointer data type or the > mechanisms to handle pointer operators, garbage collection and > housekeeping, so 20+ years ago the original Sybase SQL Server exposed > an integer that can map back to the contiguous storage model used under > the covers.
> IDENTITY is not an attribute in the data model and can never be an > attribute in the data model because it does not exist in the reality > from which you derive your data model. The purpose is to fake a > sequential file's positional record number, so I can reference the > PHYSICAL storage location. Sure, I lose all the advantages of an > abstract data model, SQL set oriented programming, carry extra data, > destroy the portability of code and have no data integrity. Who cares? > It is easier to write than a real normalized RDBMS schema.
> More and more programmers who have absolutely no database training are > being told to design a database. It takes six years to become a > Journeyman Union Carpenter in New York State. But a programmer with > only a few years in a procedural or OO language is expected to produce > a correct and usable database.
> What is he going to do? The smart ones will get some help and beg for > training, knowing they are over their heads and can ruin the company. > Most of them simply start programming SQL as if it were their native > programming language. They grab at GUIDs, IDENTITY, ROWID and other > proprietary auto-numbering "features" in SQL products to imitate either > a record number (sequential file system mindset) or OID (OO mindset) > since they don't know anything else.
> They write code with cursors to mimic record-at-a-time file handling -- > it is easy do a global replace of READ() with FETCH. They write to temp > tables to mimic scratch files in a series of procedural steps - it is > hard to think of a single declarative statement to get the answer. They > use dynamic SQL and let the user figure out how the system should work > on the fly - they never had a software engineering course and don't > know what coupling and cohesion are.
> Experienced database designers look for industry standard codes for > their keys first. Try to tell the IRS you don't have anybody's SSN at > tax time but Cindy Lou Who was employee 42 when you put her into the > Personnel table and you have the IDENTITY value to prove it. Try to > sell a car without a VIN - using IDENTITY for this would be like > identifying a vehicle by the current local parking space number (think > about it - that is exactly what an IDENTITY value is -- a parking space > in one physical location).
> We know that we have to use industry standard codes, either de facto or > de jure. We also appreciate the fact that we can exchange data with the > rest of the world. We appreciate the fact that someone else will > maintain and define these codes. A trusted external source is a good > thing to have.
> If you don't have an industry standard or natural key and have to > design your own codes, it is hard work to do it right. I know that > newbie programmers want to start coding first and thinking later. It > does not work that way and all the "wish magic" in the world will > not change that fact. I have several chapters in my books on how to > design encoding schemes, but that is another topic.
> The first practical consideration is that IDENTITY is proprietary and > non-portable, so you know that you will have maintenance problems when > you change releases or port your system to other SQL products. Newbies > actually think they will never port code! Perhaps they only work for > companies that are failing and will be gone. Perhaps their code is such > crap nobody else want their application. Otherwise, you will port code; > you will share data with some other database; data does not exist in > isolation.
> But let's look at the logical problems. First try to create a table > with two columns and try to make them both IDENTITY. If you cannot > declare more than one column to be of a certain data type, then that > thing is not a data type at all, by definition. It is a property which > belongs to the PHYSICAL table, not the LOGICAL data in the table.
> Next, create a table with one column and make it an IDENTITY. Now try > to insert, update and delete different numbers from it. If you cannot > insert, update and delete rows from a table, then it is not a table by > definition.
> Finally create a simple table with one IDENTITY and a few other > columns. Use a few statements like
> INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1'); > INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2'); > INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
> To put a few rows into the table and notice that the IDENTITY > sequentially numbered them in the PHYSICAL order they were presented. > If you delete a row, the gap in the sequence is not filled in and the > sequence continues from the highest number that has ever been used in > that column in that particular table. This is how we did record numbers > in pre-allocated sequential files in the 1950's, by the way. A utility > program would then "pack" or "compress" the records that were flagged > as deleted or unused to move the empty space to the PHYSICAL end of the > PHYSICAL file. IDENTITY leaves the gaps unless you write your own > routine to do the compression.
> But now use a statement with a query expression in it, like this:
> INSERT INTO Foobar (a, b, c) > SELECT x, y, z > FROM Floob;
> Since a query result is a table, and a table is a set which has no > ordering, what should the IDENTITY numbers be? The entire, whole, > completed set is presented to Foobar all at once, not a row at a time. > There are (n!) ways to number (n) rows, so which one did you pick? Why? > The answer has been to use whatever the *PHYSICAL* order of the result > set happened to be. That non-relational phrase "PHYSICAL order" again!
> But it is actually worse than that. If the same query is executed > again, but with new statistics or after an index has been dropped or > added, the new execution plan could bring the result set back in a > different PHYSICAL order. Indexes and statistics are not part of the > logical model.
> Can you explain from a logical model why the same rows in the second > query get different IDENTITY numbers? In the relational model, they > should be treated the same if all the values of all the attributes are > identical.
> How do you verify that an entity has the right key when you use an > IDENTITY? If I use a VIN for a vehicle, I can go to parking lot and > read it off the dashboard. If I use a UPC for a candy bar, I can read > the bar code on the wrapper. But why is little Cindy Lou Who employee > 42? If I turn her upside, will I find that she has 42 tattooed > somewhere? If I call an external trusted source, will they know that > she is employee 42?
> In the Relational Model, you do not invent a key in the storage. You > discover a key (and the other attributes) in the real world and model > it. If you create your own encoding for a key, then you have to > maintain it, provide audit trails and do all the work that an industry > standard organization would do for you.
> If I lean on a mouse button, I can insert the same data with a new > IDENTITY over and over. Cindy Lou Who is now employed two times and > none of my reports are right! Now I have to write some procedural code > like a trigger or a UNIQUE constraint on her SSN to prevent this, thus > making the IDENTITY redundant. But we were assuming that we use only > IDENTITY as a key, so we are screwed.
> Newbies often design tables without bothering to look for a relational > key, so they are so surprised when they do a data warehouse and nobody > else has any idea what they are doing in their subsystem.
> Cindy Lou Who now has two rows in Personnel. When we sign her up for > the Dental Plan, we get the row with 42. When we sign her up for the > Bowling Team, we get the row with 43. We find our error, and delete the > row with 42 because we have a row that was created later and we assume > it is more current.
> Another common way to get this is to have two procedures, one for > inserting a new employee to the Dental Plan and one for inserting a new > employee to the Bowling Team. Both procedures create a row in Personnel > since they use only IDENTITY as a key.
> Well, now we have an orphan row in Personnel. In fact, in such systems, > you will find a lot of orphans. I worked for a company that used GUIDs > for OIDs substitutes and our software fell apart in about a year of > actual use by a client.
> Another cute way to destroy data integrity:
> BEGIN ATOMIC > DELETE FROM Foobar > WHERE id = <<some identity value>>; > INSERT INTO Foobar > VALUES (<<recreate deleted row>>) > END;
> Logically this should do nothing, but since IDENTITY has gaps, it > trashes the data. When one query uses the IDENTITY and another uses the > real key, you are like a man with two watches, you are never sure what > time it is.
> Finally, an appeal to authority, with a quote from Dr. Codd: > "..Database users may cause the system to generate or delete a
>>I have found I can use surrogate keys and >>suffer none of the grief you attribute to them. >>[...] >>I guess I should declare 27 years of involvement
> In my travels, I have been to all corners of North America. I've been > atop mountans. I've peered out of skyscraper. I've even flown in a > plane. With all my traveling experience, I have yet to find any > evidence whatsoever that the earth is round; as far as I can tell, it's > flat as a board (moutants et al withstanding).
> Despite my personal observations, I have not joined the Flat Earth > Society. I rely on the expertise of others to mathematically prove a > round earth. I trust the experience of those who have personally > observed that the earth is in fact round.
> When you teach yourself something, you end up relying on your own > observations and assumptions. When you rely on what you know, you lose > out on volumes of collective knowledge and experience. Had you not read > it in a science book, do you really think you'd come to the conclusion > that the earth is round?
> The same holds true for databases. When a programmer self-teaches > himself databases, he applies his own experience: procedural > programming. This is what happened in the fifties when there were no > databases and, therefore, no experience or expertise to go by. > Programmers back then modeled data exactly like you do now: like a > programmer.
> Fifty years, the data field has made a lot of progress. The best model > we have now is the relational model. One of the fundamentals of this > model is that data do not need to be accessed by pointers (meaningless > keys); they can be accessed (and related) by their value.
> When you ignore this fundamental, your model becomes closer and closer > to the ineffective ways of the past. Worse still, you lose the tools > they had back then (IDMS pointer functions, for example) to keep their > models running. As Santayana said, "Those who cannot remember the past > are condemned to repeat it."
Dan wrote: > On 8/19/2005 10:15 AM, Alex Papadimoulis wrote:
>>> ... >> In my travels, I have been to all corners of North America. I've been >> atop mountans. I've peered out of skyscraper. I've even flown in a >> plane. With all my traveling experience, I have yet to find any >> evidence whatsoever that the earth is round; as far as I can tell, it's >> flat as a board (moutants et al withstanding).
> Ever watch a ship disappear over the horizon? > ...
Heh, all I was told was that the horizon was a line and the closer the ship got to it, the smaller it would seem until it disappeared. Didn't occur to me that the ship might be going downhill. I figured that the curved horizon must be a side-effect of the thick windows in airplanes. Personally, I think it's just that it disappeared because of my imperfect eyesight.
On Fri, 19 Aug 2005 02:14:47 GMT, paul c <toledobythe...@oohay.ac> wrote:
[snip]
>i thought that the long message about IDENTITY being an exposed PHYSICAL >indicator was more mystical than logical. i'd say your message is
Not mystical at all. If you expose physical details and they get used, you have lock-in.
>closer to natural logic, even if Access is flawed. don't claim to know >the answer, since the question, if i've got it right makes no sense, ie. >which should we prefer, a generated key, or a 'natural' one? all >depends on the situation. my own opinion at the moment is that >databases don't have to be global to be useful. in fact, local >databases may be safer. a few months ago i was handcuffed by an Oregon
You got handcuffed because of a local database. This was safer because why?
>state trooper who thought i was using stolen plates. fault of a >database that didn't recognize canuck plates. on the other hand, i was >released about an hour later after the canuck db was consulted. shudder >to think what would have happened had there been only one, wrong, db.
Think of the consequences if the database consulted in the first place had been right.
On Fri, 19 Aug 2005 10:51:51 -0500, Dan <d...@nospam.com> wrote: >On 8/19/2005 10:15 AM, Alex Papadimoulis wrote: >>>I have found I can use surrogate keys and >>>suffer none of the grief you attribute to them. >>>[...] >>>I guess I should declare 27 years of involvement
>> In my travels, I have been to all corners of North America. I've been >> atop mountans. I've peered out of skyscraper. I've even flown in a >> plane. With all my traveling experience, I have yet to find any >> evidence whatsoever that the earth is round; as far as I can tell, it's >> flat as a board (moutants et al withstanding).
>Ever watch a ship disappear over the horizon?
Did you figure out yourself that that meant the earth is round, or did you find out about that from elsewhere?
[snip]
>> When you teach yourself something, you end up relying on your own >> observations and assumptions. When you rely on what you know, you lose >> out on volumes of collective knowledge and experience. Had you not read >> it in a science book, do you really think you'd come to the conclusion >> that the earth is round?
Being "practical" as it is often used means ignoring what anyone else has to say.
>If you were to explain to me how unnecessary surrogate keys silently conceal
^^^ =Mr. Celko.
>corruption, and how you guaranteed that they wouldn't, and that it all ended >up being cheaper, simpler, and more maintainable than if you hadn't used >them, then I might be prepared to accept that you haven't suffered any of >the grief. But since you haven't explained that, I'm more inclined to think
He has explained it several times, and just recently (this week), too.
>you just haven't noticed or recognized it. Not noticing that you have >cancer is not the same as not having it.
>It is not too late to explain now. I keep an open (but properly sceptical) >mind.
>>Hi all, >> I'm enjoying reading C. J. Date's new book "Database in Depth", in >>which I know the www.dbdebunk.com website. I visited the site just now, >>and find the following statement in its "QUOTE OF THE WEEK" section:
>>** QUOTE ** >>Don't use primary keys that have meaning for the end user, such as >>invoice number or the ISBN value.
>>--Giuseppe DiMauro & Francesco Balena, PRACTICAL GUIDELINES AND BEST >>PRACTICES FOR MICROSOFT VISUAL BASIC AND VISUAL C# DEVELOPERS >>** QUOTE **
>> It seems that they quote this idea to point out how wrong it is. But >>I myself always regard this idea as a good practice in database design! >>Am I right? If not, what's wrong with it? Please comment.
> It depends. There are times when it's wise to use keys like ISBN and > invoice number, and times when it's not.
> The important factor is not whether the user understands the keys or does > not, but whether new instances of real world entities come with dependable > keys or not.
> Examples: depending on Social Security # to identify persons is a bad idea, > if you are going to be faced with a significant number of people that don't > have an account with Social Security. example: the prison inmate > population.
> Depending on ISBN to key books in a library is a bad idea if you are going > to have a significant number of books that come without an ISBN, like > certain government publications.
> And, of course, the generator has to avoid duplicates, among other issues.
> There are other cases where surrogate keys are a good idea.
> But, in general, Date is right. Natural keys are better, as long as they > are dependable.
> It's not a dogmatic question. There are situation where one choice is > right, and others where the other choice is right.
> And that's what I don't like about dogmatic responses. They tend to assume > that design judgements fall on a simple linear scale from badness to > goodness. Goodness is not a one dimensional measure. And that's why I stay > way from dbdebunk. > At a deeper level, they don't get it.
The main problem of dbdebunk is that they do not propose anything new and their main argument is that we do not need anything new because we have already one good theory and that is enough -- the main problem (in their opinion) is that this theory is not used correctly and there exist bad boys who do not understand it (bad programmers, bad database developers and so on). Those bad boys produce bad products and try to enhance this ultimate theory and its is crime. Why that theory is good? Just because it is based on set theory (stupid argument, is not it). I would qualify dbdebunk as religious fanatics who identify themselves with this one theory and speak on bahalf of it. They think their role consists in protecting this theory and the way they chose consists in isolating it from the outside world, frozing in its current state and fighting with any type of heretics. So it is really dogmatic and orthodox site (unfortunately). And anybody who wants to learn something new, unusual, surprising and interesting should keep a distance from this site because they think that they know what truth is.