I'd like to know - are you against the datatype "Identity" or are you really
against having an "artificial" key in *every* case?
Here is simplified example from the database I'm currently working on: (I am
using a relational DB - specifically MS Sql Server 2k - I have to work
within certain constraints - lots of responses say "use OODB", or "use this
product", those are not options for me)
I have a table called Employee - the candidate primary key is the Employee's
Internet Email Address, but I resorted to using an Identity column called
EmployeeId.
Furthermore - I also have a table called "Order" where 1 Employee can create
many orders - so - if the Employee primary key is EmailAddress does it make
sense to store something that large in the "Order" table? There are many
other tables where EmployeeId is a foreign key - am I wrong to think I
shouldn't be storing something large like email address all over?
I guess the advantage of using something like "EmailAddress" would be that I
don't have a "nonsense" number in my database - however - if EmailAddress is
a primary key then I can't change an Employee EmailAddress without changing
it *everywhere* - in fact - I'm not sure which Employee attribute could
never change.
Granted - I only use Identity columns as a "Last Resort" - for example:
An order can be in one of three states: Pending, Approved, Denied
So - I have a table called OrderStatus where the primary key is char(1) and
the values are 'P', 'A', 'D' It's nice because there is meaning to the
primary key - why didn't I do that for Employee? There are 100,000 of them.
Names can conflict with each other - I can't guarantee I'll always have an
email address, etc. So - I resorted to using an Identity column.
One more question - sometimes there are two columns making a table unique -
without creating a single artificial key - how would you *easily* create a
foreign key?
Thanks for any and all replies -
But, we can ignore that and generalize the question to "must I always
use a natural key"? I would say that you need to think about this at
two levels: your __relational__ design and your __implementation__
design. In your relational design you should strive to model your
relations so that that model is as close to reality as possible. So,
if you know that some subset of attributes of each relation is unique
and invariant, that is going to be a key for that relation. If you
don't have such a subset of unique invarient attributes, you should
probably ask yourself why not. This is when a unique identifier (e.g.
customer id) will infect your relational design.
When you implement your design, you should stay as close your your
relational design as possible. However, you might have to "break some
relational rules" in order to satisfy your application's requirements.
I don't see how anybody can argue with this because it seems obvious
to me that a usable design that isn't perfectly relational is
preferable to a perfectly relational design that isn't usable.
But, I really recommend that you get an expert's opinion about this.
- Brian
Most of my experience is in data warehouse design so my opinion may be a
little slanted.
Are you building the company's employee records system? If this is the
case, then you could create an EmployeeId field and your application would
be the source of record. If not, what does the employee records system use
to uniquely identify an employee? This would probably be your safest key.
But back to email address...Are all employees assigned an email address?
How soon after hire? Is there a standard by which the address is assigned
(ie first initial + middle initial + last name + tie breaker)? Is it
possible for the email address to change? Do all employees belong to the
same mail domain? If so, you could just carry the email id instead of the
entire address. I know I'll get flak for this question, but what about
using SSN, or SSN + last name? I am currently consulting for a large
government agency. They use first initial + middle initial + last initial +
the last four digits of SSN to uniquely identify it's employees. Once its
assigned, it doesn't change even with a name change.
When you create a surrogate key, you normally have a translation table that
will translate the natural key to the surrogate key. That is how you are
able to find the correct employee record when you need to change the
employee phone extension, or employee name. You really do need to come up
with a valid natural "identifier" whether you are going to use surrogate, or
natural keys in your database.
I usually don't let the extra storage space dictate whether I use natural
keys or not. Suppose you can use the email id portion of the address only.
Let's say it's an average of 10 bytes across all 100,000 employees. If the
100,000 employees generated 10,000,000 orders, the natural key would only
generate an additional 60 meg of storage, minus the half meg for not
carrying the surrogate key in the employee table. 60 meg is nothing (my
VLDB warehouse experience may be swaying my opinion though:).
As far as creating a multi-column foreign key. You would just migrate all
columns.
For what it's worth,
Russ
"John" <ru_re_re_move_zel@ecl_re_re_move_ipse.net> wrote in message
news:JnQY7.359$8j1.34...@news.netcarrier.net...
In the case where more than one coulumn is required to make a candidate key,
it would be ok to make a single artificial (surrogate) key (again observing
the constraint on the natural candidate key).
I would say that your own feelings for what is good and what is bad are
sound ...
"John" <ru_re_re_move_zel@ecl_re_re_move_ipse.net> skrev i en meddelelse
news:JnQY7.359$8j1.34...@news.netcarrier.net...
My remarks are addressed to the issue of "natural keys" versus "surrogate
keys". I have a vague suspicion that the "identity" issue covers more
ground than this, but it's useful to explore the surrogate key issue first.
Let's take a typical natural key, like "PRODUCT_ID". Further, let's say
that different departments assign product ID's independently of each other,
with some protocol in place to keep the same Id from being assigned twice.
But let's say that in the course of analysis, we come up with the fact
that, in the past, the same Product Id has, in fact, been assigned to more
than one product, by mistake.
Now the question becomes, "what do you do when that happens"? One answer
is that one of the two products gets its original Product Id pulled, a new
product Id is assigned, and all records that used the "wrong" product Id
are retroactively updated to refer to the corrected ID. Another answer is
that each department continues to use the product id, even though they
conflict, and "if you database people want to put them in one big table, I
guess that's your problem, isn't it?" This sounds absurd on its face, but
I have heard it too many times in the real world to dismiss it as a
consideration. In a case like this, I would use surrogate keys.
A second case is where the assigning authority is not accountable to the
owners of the database. Every now and then, the Social Security
administration has assigned the same SSN more than once. They are not
accountable to your company, so if you make SSN a primary key of a table,
and you get duplicates, it's you problem. I'll admit it: sometimes I've
gone ahead and risked it, and used SSN as a natural primary key.
Ralph Kimball, in his books on data warehousing, discusses the benefits of
using surrogate keys to link fact tables and dimension tables. I tend to
follow his advice.
Apart from the exceptions given above, I would use natural keys when
possible.
--
Regards,
David Cressey
www.dcressey.com
> I guess the advantage of using something like "EmailAddress" would be that I
> don't have a "nonsense" number in my database - however - if EmailAddress is
> a primary key then I can't change an Employee EmailAddress without changing
> it *everywhere* - in fact - I'm not sure which Employee attribute could
> never change.
My gut feeling is that having a "nonsense number" as a primary key is
often good - if it is not nonsense, then it has some meaning to non-db
people, and one day they will (quite legitimately) want to change it.
> Granted - I only use Identity columns as a "Last Resort" - for example:
> An order can be in one of three states: Pending, Approved, Denied
> So - I have a table called OrderStatus where the primary key is char(1) and
> the values are 'P', 'A', 'D' It's nice because there is meaning to the
> primary key
I often do this - it has worked quite well, but can sometimes cause
confusion - for example, if, later, it was decided to change the wording
of "Denied" to "Rejected".
> - why didn't I do that for Employee? There are 100,000 of them.
> Names can conflict with each other - I can't guarantee I'll always have an
> email address, etc. So - I resorted to using an Identity column.
I have yet to find any easily obtainable immutable property of people -
so you have to make one up.
--
Ben Clifford be...@hawaga.org.uk
http://www.hawaga.org.uk/ben/ GPG: 30F06950
webcam: http://barbarella.hawaga.org.uk/benc-cgi/watchers.cgi
Work required: http://www.hawaga.org.uk/~benc/resume
> They use first initial + middle initial + last initial +
> the last four digits of SSN to uniquely identify it's employees. Once its
> assigned, it doesn't change even with a name change.
There are user-interface problems with:
> Once its assigned, it doesn't change even with a name change.
People get upset that their unique ID still has their old name in it,
especially non-technical folk.
Although, I have seen systems (a US gov't system) where the unique ID
appeared to be a combination of SSN and name - and *changed* when the
students name was changed. This is even worse!
"Ben Clifford" <be...@hawaga.org.uk> wrote in message
news:3C34A168...@hawaga.org.uk...
> For people in the U.S., SSN is generally a perfect PK. Yes, there have been
> duplicates issued by the SSA, but they are extremely rare, and if your
> universe of people is known (i.e., not potentially anyone, but rather a
> company's employees), then SSN is the way to go. If you should happen upon a
> duplicate, change the first character to an A. If this happens, you'll need
> another column that contains the true SSN for tax reporting purposes.
It is possible to employ someone who has no SSN - eventually they will
get an SSN but possibly not for several months.
This means that you possibly have to make up a fake ID string for them
and deal with the consequences later - eg. human operators being unable
to cope with the ID not being the persons SSN.
Or you have to not put them in the database for several months, which
means that the benefits of the database are not provided for that
employee (presumably there are some benefits associated with being in
the database, either for the employee or employer - otherwise, there is
not much point in having the db in the first place)
And for people who wish to wrangle on law, a system so designed would
(indirectly) prejudice against immigrants and so may be in violation of
US Federal Law, I think. Although pretty much everyone discriminates in
this way and seems to get away with it.
Thanks for all of your responses.
I think my current database design is on the right track - I had used "email
address" as just an example of a "possible" candidate key for the Employee
table. The system we've built is an online ordering system for my companies
customers - so we get whatever information the customer can give us - we
don't have much control over the content of the Employee table.
One last question - does anybody have any design resources for building a
generic approval routing scheme? Right now - we have a table called "Req"
(short for Requisition - we've abreviated it everywhere in the database) and
another table called "ApprovalRouteQueue"
create table Req (ReqId int)
create table ApprovalRouteQueue (ReqId int, ApprovalCounter int)
So - obviously - there is a 1 to many relationship between Req and
Approvals. Each time a Req gets approved the current "Pending" row in
ApprovalRouteQueue gets updated, emails get sent to the Employee who created
the Req, who approved the Req, and the next Approver. Then the next
Approver clicks the link in the email, comes to the website and the process
starts all over - this keeps repeating until the Req is denies somewhere
along the way - or Approved - the number of times it needs to be approved is
defined differently for each customer. This is really a simplified model -
there are about 10-15 tables involved here that define the levels of
approval, email notices that get sent, etc.
The problem with this design - it only works well for Serial approvals.
Some customers would like "Parallel" approving - which means instead of an
Approval Queue I need an Approval Collection. Has anybody seen an exisiting
design of this?
First, I would name the table "Personnel" and not "Employee" -- it is
a set, so you want a plural or collective noun, not a singular
(scalar) one. Next, I assume that you have a business rule that says
everyone gets a company assigned email address when we hire them.
Yes, I would use it; it has a beautiful property that an artifacial
key does not -- you can verify it by sending out an email!! You know
that they guy was fired if it bounces back to you, you can aks him his
name, or whatever else you need to know. Next bst thing to putting
the person themselves in the database.
The IDENTITY column is a hold over from the early programming language
which were <i>very<i> close to the hardware. For example, the fields
in a COBOL or FORTRAN program were assumed to be physically located in
main storage in the order they were declared in the program. This
meant that you could define a template that overlaid the same physical
space and read the representation in several different ways. In
COBOL, the command was REDEFINES, EQUIVALENCE in FORTRAN and a union
in 'C'.
From a logical viewpoint, this redefinition makes no sense at all. It
is confusing the numeral with the number the numeral represents. The
history of programming after this point in time has been to divorce
the logical and physical models completely.
The early SQLs were based on existing file systems. The data was kept
in physically contiguous disk pages, in physically contiguous rows,
made up of physically contiguous columns. In short, just like a deck
of punch cards or a magnetic tape.
But physically contiguous storage is only one way of building a
relational database and it is not always the best one. But aside from
that, the whole idea of a relational database is that user is not
supposed to know how things are stored at all, much less write code
that depends on the particular physical representation in a particular
release of a particular product.
One of the biggest errors is the IDENTITY column in the Sybase family
(SQL Server and Sybase). People actually program with this "feature"
and even use it as the primary key for the table! Now, let's go into
painful details as to why this thing is bad.
The practical considerations are that IDENTITY is proprietary and
non-portable, so you know that you will have maintenance problems when
you change releases or port your system to other products. It also
has some very strange bugs in both Sybase and SQL Server; go to a
newsgroup and do a search.
But let's look at the logical problems. First try to create a table
with two columns and try to make them both IDENTITY columns. If you
cannot declare more than one column to be of a certain datatype, then
that thing is not a datatype at all, by definition.
Next, create a table with one column and make it an IDENTITY column.
Now try to insert, update and delete different numbers from it. If
you cannot insert, update and delete rows from a table, then it is not
a table by definition.
Finally create a simple table with one IDENTITY column and a few other
columns. Use a few statements like
INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
to put a few rows into the table and notice that the IDENTITY column
sequentially numbered them in the order they were presented. If you
delete a row, the gap in the sequence is not filled in and the
sequence continues from the highest number that has ever been used in
that column in that particular table.
But now use a statement with a query expression in it, like this:
INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;
Since a query result is a table, and a table is a set which has no
ordering, what should the IDENTITY numbers be? The entire, whole,
completed set is presented to Foobar all at once, not a row at a time.
There are (n!) ways to number (n) rows, so which one do you pick?
The answer has been to use whatever the physical order of the result
set happened to be. That non-relational phrase "physical order"
again.
But it is actually worse than that. If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order. Can you explain from a logical model why
the same rows in the second query get different IDENTITY numbers? In
the relational model, they should be treated the same if all the
values of all the attributes are identical.
Think about trying to do replication on two databases that differ only
by an index, or by cache size or something that occasionally gives
them different execution plans for the same statements. Want to try
to maintain such a system?
There are better ways of creating identifiers.
>Next, I assume that you have a business rule that says
> everyone gets a company assigned email address when we hire them.
Actually - we don't. The system is designed to let our customers employees
purchase from us. I can't always require that I get an email address from
our
customers that I can load into my database.
However - let's assume I *could* get an email address for each and every
employee.
This is the heart of my question - if I use email address - and then that
email address
changes (as they sometimes do) then I'm stuck with foreign keys all over the
database
that I have to change. Does that make sense? I can see your argument
against
the datatype IDENTITY - but I don't see the absolute need to elimonate
artificial
keys in *every* case.
I disagree.
The entity type and the base table derived from it should by named as an
abstract representation of its instances, thus a singular noun, such as
Employee.
Consult Date or another standard textbook.
> Next, I assume that you have a business rule that says
> everyone gets a company assigned email address when we hire them.
> Yes, I would use it; it has a beautiful property that an artifacial
> key does not -- you can verify it by sending out an email!! You know
> that they guy was fired if it bounces back to you, you can aks him his
> name, or whatever else you need to know.
I disagree. Your argument shows exactly why a key should not carry
information: what if the company wants to keep a record of fired employees.
A record in a table should be "a fact"; that is not the case if the person
no longer holds the email-address.
A key should be imutable.
> Next bst thing to putting
> the person themselves in the database.
How do you "put the person themselves in the database"?
You are right that IDENTITY in eg. SQL Server has some bad properties as
shown.
There are, however, better ways of creating surrogate keys as identifiers.
While a natural candidate key in any case should be controlled by a
constraint, and could of course be used as primary key, there can be very
good reasons to use a surrogate key. This has nothing to do with "old
programming languages" or physical dependence.
As an example you might think of records of invoices, which may have a
natural candidate key combined of foreign keys to Customer and Order.
Nevertheless, a running integer is often used as a primary key, eg. to
document (for the auditor) that no sale is forgotten .
Perhaps Celko might consult "SQL for Smarties". I'm not sure what it says
in there on the topic of
singular or plural table names. But I'll bet Celko knows. (Hi, Joe).
> As an example you might think of records of invoices, which may have a
> natural candidate key combined of foreign keys to Customer and Order.
> Nevertheless, a running integer is often used as a primary key, eg. to
> document (for the auditor) that no sale is forgotten .
>
>
I would argue that, in the case you outline here, Invoice number is not a
"surrogate key" but is, instead, a "natural key generated by the system".
The reason I'm calling it a natural key is that it has use outside the
system, eg, by the auditor.
BTW, I agree with you on the use of invoice number, and on the reasons
why.
Probably! - let's hear him :-)
> > As an example you might think of records of invoices, which may have a
> > natural candidate key combined of foreign keys to Customer and Order.
> > Nevertheless, a running integer is often used as a primary key, eg. to
> > document (for the auditor) that no sale is forgotten .
> >
> >
>
> I would argue that, in the case you outline here, Invoice number is not a
> "surrogate key" but is, instead, a "natural key generated by the system".
> The reason I'm calling it a natural key is that it has use outside the
> system, eg, by the auditor.'
Such a distinction between "surrogate key" and "natural key generated by
system" is a "by definition", and does not change the use of it in a
practical way.
I think we agree on the essence of the matter...
> BTW, I agree with you on the use of invoice number, and on the reasons
> why.
... as this indicate.
"--CELKO--" <71062...@compuserve.com> wrote in message
news:c0d87ec0.02010...@posting.google.com...
Oh boy, by inspection I must correct to: Consult a standard textbook, except
"Date: An introduction....". For some reason Date seem to use plural, while
"most others" use singular, eg: Elmasri & Navathe: "...use singular names
for entity types, rather than plural ones, because the entity type name
applies to each individual entity ....".
We migth take a vote on it <g> ...
I think that's a good warning in this case. The relationship between
inboxes and persons is too transient to count on it.
Others have said almost the same thing already.
I'm against using an artificial key where there is a usable natural key.
There are many situations where there is no natural key available. The
commonest one is where there is a need to identify a person, people have
no natural keys. The nearest equivalent is the SSN which is simply an
artificial key generated by a trusted third-party. Other artificial keys
are things like invoice numbers.
>
>Here is simplified example from the database I'm currently working on: (I am
>using a relational DB - specifically MS Sql Server 2k - I have to work
>within certain constraints - lots of responses say "use OODB", or "use this
>product", those are not options for me)
>
>I have a table called Employee - the candidate primary key is the Employee's
>Internet Email Address, but I resorted to using an Identity column called
>EmployeeId.
You could call it payroll number or some similar name. Using the
employee's e-mail address is only useful if there is a mechanism to make
sure that each employee has one and only one address and that each
address refers to one and only one employee. One way of doing this would
be to assign a payroll number to each employee and assign the number as
their e-mail address, making addresses similar to the old Compuserve
addresses.
In essence, use a natural key only when you have the ability to ensure
that it is unique.
>Furthermore - I also have a table called "Order" where 1 Employee can create
>many orders - so - if the Employee primary key is EmailAddress does it make
>sense to store something that large in the "Order" table? There are many
>other tables where EmployeeId is a foreign key - am I wrong to think I
>shouldn't be storing something large like email address all over?
There are overheads either way. Either you store a large key or you
store a smaller key and spend some processing time to decode it. Which
is more important to you?
>
>I guess the advantage of using something like "EmailAddress" would be that I
>don't have a "nonsense" number in my database - however - if EmailAddress is
>a primary key then I can't change an Employee EmailAddress without changing
>it *everywhere* - in fact - I'm not sure which Employee attribute could
>never change.
There are no easily available natural unchanging attributes of an
individual person that can be used as a primary key. If you insist on a
true natural key then get your employees to specify the place and time
of their birth. Getting the time to a precision of 1 second and the
latitude, longitude and altitude to a precision of ~1ft should guarantee
uniqueness.
>
>Granted - I only use Identity columns as a "Last Resort" - for example:
>An order can be in one of three states: Pending, Approved, Denied
>So - I have a table called OrderStatus where the primary key is char(1) and
>the values are 'P', 'A', 'D' It's nice because there is meaning to the
>primary key - why didn't I do that for Employee? There are 100,000 of them.
>Names can conflict with each other - I can't guarantee I'll always have an
>email address, etc. So - I resorted to using an Identity column.
>
>One more question - sometimes there are two columns making a table unique -
>without creating a single artificial key - how would you *easily* create a
>foreign key?
There's no theoretical requirement for a foreign key to be a single
column.
--
Bernard Peek
b...@shrdlu.com
In search of cognoscenti
The reason for sequential numbers was that they were printed on the
paper order forms, so they existed before the actual order. The
auditor wanted to be able to go to the PHYSICAL document for each
order in the PHYSICAL filing cabinet.
That one does not work unless you have insanely good precision in both
time and space. They did a study of Astrology looking at hospital in
New York City. The number of people born each day with EXACTLY the
same horoscope was quite high. An orthodox Jewish hospital is in the
same area as a public hospital that services a poor Latino
neightborhood, etc.
A sequential numbering is still used, regardless of physical storing of
documents in a filing cabinet or not.
The number (the sequence) is primarily used to ascertain that no sale is
left unaccounted for or deleted from the records.
If an invoice number 12 and number 14 is registered, so should an invoice
number 13 be registered. It thus serve an additional purpose than to order
and identify: it reflects a "business rule". This rule can not be reflected
by the "natural key", unless a strict 1-1 is demanded between orders and
invoices. I think this may be your assumption, as you refer to the
pre-printed order forms. In that case the same rule applies to orders: the
order number should be sequential with no gaps.
I do agree that it is a matter of opinion to regard the running number as a
surrogate key or as a natural key, as it originates
from a conceptual number "printed on a paper form". I chose to call it a
surrogate key because it is "invented" due to lack of a natural key
(whatever natural could be by orders and invoices...).
The practical consequence and advise will be the same: use the running
integer as the key. Whether a concrete DBMS has a usable built-in function
for autogenerating such numbers or not is another matter - a matter of
physical design.
>>> If you insist on a true natural key then get your employees to
>specify the place and time of their birth. Getting the time to a
>precision of 1 second and the latitude, longitude and altitude to a
>precision of ~1ft should guarantee
>uniqueness. <<
>
>That one does not work unless you have insanely good precision in both
>time and space.
Well, isn't that exactly what Joe said? It should be clear to anyone
that what he suggests is not even remotely possible.
--
Nis Jorgensen
Amsterdam
Please include only relevant quotes, and reply below the quoted text. Thanks
> If an invoice number 12 and number 14 is registered, so should an invoice
> number 13 be registered. It thus serve an additional purpose than to order
> and identify: it reflects a "business rule". This rule can not be
reflected
> by the "natural key", unless a strict 1-1 is demanded between orders and
> invoices.
Your point is a very good one. I think the "no gaps" rule is applied,
consciously or unconsciously, again and again in the world of business
processing. What follows should be taken in that light.
One view of invoices is to see them as a set. When viewing them as a set,
the invoice number has a feature that is indispensable for use as a primary
key. Invoice numbers have identity. That is, two different invoice
numbers never look alike. However, to determine "gaps" or their absence,
you need to rely on another feature that invoice numbers have, namely
sequence. What I'm calling "sequence" might really be two features,
namely "order" and "interval". I'm not sure.
In any event, the fact that 13 is the successor of 12 and the fact that 14
is greater than twelve, is inherent to the reasoning that says "if 12 and 14
are present, but 13 is not present, it means there is a gap." Now I will
claim that
in turn, depends on the following rule: "Invoices are issued in a
sequence." And that in turn is a small departure from viewing invoices as a
set and only as a set.
I'm pointing this out because I think that an excessive demand that
everything we do be expressed in set theory sometimes leads us to overlook
valuable possible design features like gap detection. We all know the
opposite tendency: people who have worked for years with sequential files,
and can't think of a table as a set. But sometimes "sequence" is a more
useful concept than "set" after all. Only sometimes, of course.
I say the above because Joe tends to harp on beginners who aren't thinking
in terms of sets when they should be. It's part of bringing beginners up to
speed. But sets aren't the only mathematical or ontological concept that is
useful in IT.
> In any event, the fact that 13 is the successor of 12 and the fact that
14
> is greater than twelve, is inherent to the reasoning that says "if 12 and
14
> are present, but 13 is not present, it means there is a gap." Now I will
> claim that
> in turn, depends on the following rule: "Invoices are issued in a
> sequence." And that in turn is a small departure from viewing invoices as
a
> set and only as a set.
Thank you for helping to clarify this.
> I'm pointing this out because I think that an excessive demand that
> everything we do be expressed in set theory sometimes leads us to overlook
> valuable possible design features like gap detection. We all know the
> opposite tendency: people who have worked for years with sequential
files,
> and can't think of a table as a set. But sometimes "sequence" is a more
> useful concept than "set" after all. Only sometimes, of course.
I agree; a "set" when ever applicable in the context of the business, but
not at any price.
>On 9 Jan 2002 10:26:14 -0800, 71062...@compuserve.com (--CELKO--)
>wrote:
>
>>>> If you insist on a true natural key then get your employees to
>>specify the place and time of their birth. Getting the time to a
>>precision of 1 second and the latitude, longitude and altitude to a
>>precision of ~1ft should guarantee
>>uniqueness. <<
>>
>>That one does not work unless you have insanely good precision in both
>>time and space.
>
>Well, isn't that exactly what Joe said? It should be clear to anyone
>that what he suggests is not even remotely possible.
No, of course it wasn't Joe who said it. It just read it within Joe's
quotes ... excuses to the people involved.
You can, in fact, see them as a set. What you do have to see as a sequence
is the act of issuing invoices. That's where the invoice number comes from.
The invoice number itself is a sequence, not just a set.
Something parallel goes on when you audit a DBMS to make sure all the
transactions have been applied to it. If each transaction is tagged with
something called a "transaction sequence number" or some such phrase, it
can be of interest to ensure that no transactions end up missing and
unaccounted for, that there are no "gaps". Given that some transactions
end in rollbacks, this needs to be accounted for, as well.
> However - let's assume I *could* get an email address for each and every employee.
> This is the heart of my question - if I use email address - and then that email
> address changes (as they sometimes do) then I'm stuck with foreign keys all over
> the database that I have to change. Does that make sense?
Why not leave it up to the customer (in this case the employee of the
partner company) to create an identifier for him/her? That is what often
seems to be used on discussion boards and e-shops.
In the case you have described there seems to be no absolute need for
identification in the sens that you need to allow only certain
customers. Your main problem seems to be how to contain a consistent
history of customer records.
What you need is an login dialog where the customer identifies himself.
If it is the first time or if the identification fails (perhaps he
forgot his password or customer number), you need to supply a
combination of apparent facts (=e-mail) and individual facts (=password).
This approach will not garantee that a customer will not appear with
several identities. But hopefully his identity will remain constant long
enough for each order.
jp
I noticed that you said "representation of its instances" and not
"set" or "class" and that might be important. OO people think of
instances rather than sets. I would say that a forest contains trees;
you would say that there is no such construct as a forest, but only
instances of trees. This goes back to the concepts that Cantor
introduced with set theory.
>> Consult Date or another standard textbook. <<
Date uses both plural and singular names. I am re-writing mine to
favor collective nouns. A quick look at the stuff in the SQL books on
my shelf shows more plurals in later books. Try the INCITS L8
Metadata Standards Committee data element naming rules. Scalars are
singular in their rules.
>> I disagree. Your argument shows exactly why a key should not carry
information: what if the company wants to keep a record of fired
employees.
A record in a table should be "a fact"; that is not the case if the
person
no longer holds the email-address. A key should be immutable. <<
A data warehouse is a different problem. If you REALLY want an
immutable key for a person, then we use fingerprints and DNA -- the
prison system again. Most employers are not that concerned about the
location of ex-employees; most of them are not crooks <g>.
But a company assigned email address can be a unique string that is
checked to see that it is issued only once. AOL, Yahoo, et al seem to
be able to do this with lots of people by adding a number after a name
the customer can remember. They have a high turn over rate as well.
>> As an example you might think of records of invoices, which may
have a
natural candidate key combined of foreign keys to Customer and Order.
Nevertheless, a running integer is often used as a primary key, eg. to
document (for the auditor) that no sale is forgotten. <<
If my business rules say that an invoice has one customer number and
one order number, then I want to have a UNIQUE(customer-nbr,
order_nbr) constraint in my schema.
The advantage of the separate invoice number is that you can relate it
to a physical piece of paper or to multiple orders on the same invoice
or multiple invoices for a single order. The customer has his own
purchase order number on the invoice, but it is possile that two
customers might duplicate a purchase order number.
> But a company assigned email address can be a unique string that is
> checked to see that it is issued only once. AOL, Yahoo, et al seem to
> be able to do this with lots of people by adding a number after a name
> the customer can remember. They have a high turn over rate as well.
This is a time-sensitive key: it is only a unique identifier between
two points in time. In AOL's case, your email address only uniquely
identifies (part of) an account between the time you choose it and six
months after you stop using it (AOL will put the email address back
into the "open" after six months, if I remember correctly). Usually,
AOL will ask you for your "primary screen name" (accountholder email
address) when doing business with them, and this is enough to identify
you in the majority of cases because usually they only care about the
one customer using that screen name at the current point in time.
If all of your queries only need information about the here and now,
then an email address is probably suitable, but I don't see the
benefit of email address versus a machine-generated employee id.
- Brian
I don't understand what you mean by "OO people think of instances
rather that sets".
Certainly, if the instances forms a collection (eg a "Bag"), we don't
think of it (the collection) as a set. But what is the point?
In OO-terminology "class" is the abstract representation of its "objects".
In the context of this debate, "class" and "entity type" may be regarded
as synonyms.
Why do you think that I would say, "that there is no such construct as a
forest"?
You are right that I might say "TREE" and not "FOREST" ("TREES")
in a semantic model for trees; that depends on whether the "unit of thought"
is "tree" or not in the problem domain.
> >> Consult Date or another standard textbook. <<
>
> Date uses both plural and singular names. I am re-writing mine to
> favor collective nouns. A quick look at the stuff in the SQL books on
> my shelf shows more plurals in later books. Try the INCITS L8
> Metadata Standards Committee data element naming rules. Scalars are
> singular in their rules.
Please acknowledge that I corrected my self; indeed Date is not quite
consistent.
In his introduction (Ch. 1, An overview ...) he shows and E/R-diagram
with plural nouns (Fig 1.6).
But anywhere else (as far as I can see), he uses the singular form, in
special in
his more complete treatment of semantic modelling and E/R (eg. Fig. 13.2).
I'm not quite sure what you mean by "SQL books". For me, SQL is a language,
not a tool for semantic modelling, and not a synonym for (relational)
databases
and data modelling. In any case, my shelf shows the opposite, and if I take
modern books on modelling in general, and OO and UML in particular,
in account, a reach the conclusion that the semantic entity-types/classes is
proposed to be named in the singular form. There are very good reasons for
that,
but this would lead this thread too far.
When it comes to naming the (relational) tables, I think it is at matter of
opinion
and taste; but it is certainly valid (and for me: advantageous) to make a
smooth
transition from the semantic model by keeping a singular wording. If need
be,
one could go EMPLOYEE -> EMPLOYEE_TBL (or even ->EMPLOYEES).
>> I disagree. Your argument shows exactly why a key should not carry
>> information: what if the company wants to keep a record of fired
>> employees.
>> A record in a table should be "a fact"; that is not the case if the
>> person
>> no longer holds the email-address. A key should be immutable. <<
> A data warehouse is a different problem. If you REALLY want an
> immutable key for a person, then we use fingerprints and DNA -- the
> prison system again. Most employers are not that concerned about the
> location of ex-employees; most of them are not crooks <g>.
I take it as you agree that no usable, non-mutable natural key exists
(DNA patterns might be expressed as a number, but not in a practical way
<g>).
I think you take the example too far in the wrong direction regarding the
topic
of this exchange - my point does not relate to whether employers are
concerned
about their ex-employees, but should be understood for the sake of the
argument:
the key should not change, and should certainly not lead to a "non-fact" in
the DB.
> But a company assigned email address can be a unique string that is
> checked to see that it is issued only once. AOL, Yahoo, et al seem to
> be able to do this with lots of people by adding a number after a name
> the customer can remember. They have a high turn over rate as well.
I frankly don't see your point here.
>> As an example you might think of records of invoices, which may
>> have a
>> natural candidate key combined of foreign keys to Customer and Order.
>> Nevertheless, a running integer is often used as a primary key, eg. to
>> document (for the auditor) that no sale is forgotten. <<
>
> If my business rules say that an invoice has one customer number and
> one order number, then I want to have a UNIQUE(customer-nbr,
> order_nbr) constraint in my schema.
Certainly. This was also implied in my presentation:
("But you must ... declare constraints (such as unique
index) on the (natural) candidate-keys to keep the database healthy")
> The advantage of the separate invoice number is that you can relate it
> to a physical piece of paper or to multiple orders on the same invoice
> or multiple invoices for a single order. The customer has his own
> purchase order number on the invoice, but it is possile that two
> customers might duplicate a purchase order number.
I think this issue is covered in another sub-thread.
It does work if your precision in time and space is adequate. I wasn't
proposing it as a practical way of identifying people. I still maintain
that there is no easy to use natural key for identifying people so if
you have a table that has to record people then you have to use an
artificial or surrogate key.
> They did a study of Astrology looking at hospital in
>New York City. The number of people born each day with EXACTLY the
>same horoscope was quite high. An orthodox Jewish hospital is in the
>same area as a public hospital that services a poor Latino
>neightborhood, etc.
From what I know of astrology it's possible to use the precise place and
time of birth to prepare an individual chart. I've seen charts that use
time to the nearest minute and position to the nearest minute of arc.
I'm not sure how the variables are quantised.
>>> There seems to be a lot of talk in this NG about the good, bad and
>ugly
>aspects of primary keys - especially Identity columns. <<
>
There may be good reasons why a particular proprietary system of
creating a numerical psuedo column is full of fleas; but having a
numerical, meaningless, system assigned unique primary key can be
a good thing, in my experience.
The arguments that win my heart for numerical "artificial" PKs are
that they are under our control, and that numerical indeces are likely
to perform better than indeces based on strings. Think of the PK as a
numerical 'handle" to your data row. The quickest way to fetch that
row (absent some proprietary thing such as Oracles's "rowid" ) would
be by its indexed numerical PK.
They make great Foreign keys, and chances are you will want indexes on
all the columns that are FK's to facilitate Joins.
One big problem with "natural" keys is that very often there's no such
unique animal. You'll need a fragmented PK which can be awkward , and
in any case, the values you user tells you never change, will change.
Your user is a fine fellow but he didnt think the odd case of the
transexual swapping M and F was worth mentioning when there were so
many more important issues he wanted you to get right. . It happens
every time and tell me it aint so?
There's a theoretical argument against them as follows:
It aint 3NF:
FirstName,Lastname in combination are unique in tje PERSON table. Your
numerical PK AND your unique First NAme Last NAme mean in effect that
you have two candidate keys, and shame on you!.
Rebuttal:
What happens when your PERSON "sees the light" and changes his name
to Malcolm X? The unique primary key goes bye-bye but your PK stays,
and all the records which have the PERSON_ID as FK are rock solid,
whereas all the ones with "Malcolm Small" are toast. Your "natural"
PK was an illusian, it was "accidentally" a PK because the current
values happened to be jointly unique, but they were subject to change
after all. Lest you think the Malcolm X example is far fetched...what
about Miss Piggy and Mrs Kermit T Frog? What about Mr Jones who
becomes Dr Jones, Rev Jones, or Capt Jones KCB? John Q
Kaddidlhopper Jr who decides he's a big boy now and uses John Q
Kaddidlhopper without the Jr?
The only down side is that instead of COLOR=red you see COLOR_ID=12345
(the PK of the RED column in the COLOR table). Easily fixed with a
view joining the base table to the color table on the numerical
COLOR_ID.
If you can see a serious down side, I'm all ears, but I havent run
into one since I started playing with (and getting paid to muck with)
this stuff in around '84 or so. Maybe I was just lucky?
YMMV but I'll take the numeric "meaningless" primary key, thank you
very much.
Why? Having two candidate keys is no problem in 3NF, or in any other of the
standard normal forms.
-- Jan Hidders
Here's why I go for singular:
When looking at table "Order" I see columns OrderId, Quantity, Price, etc.
If I make the table plural - shouldn't everything be plural?
I have a number of naming rules in place so my group doesn't go
nuts when writing database code - one of them being that if we make
a synthetic key for a table - it's always "TableNameId" and
OrdersId seems kinda awkward - but I know you hate those synthetic keys
anyway <s>
So how about another table called "ApprovalStatus" - I think of an orders in
one of many states which are defined in that table. Should it be
"ApprovalStati"
That may make more sense for the ApprovalStatus table but Orders would
have a foreign key called "ApprovalStatiCode" - but there is only 1
ApprovalStatus
associated with each Order - so ApprovalStateCode seems more appropriate.
However - I can see a big argument for your way.
On my computer I have a "Projects" folder - in fact - there are plenty of
things
that are sets and the name derives from that fact. But then there are
things like "The Rule Book" - which is really "A Set of Rules"
The more I thought about this - the more I really see it as an issue
of how you view a set - is it a "set of things" or "a set of that thing of a
certain type"
The funny thing about this thread - however - is how somebody brought
up the issue of books - alluding to how we should follow something because
it's in a book. One common coding convention in SQL books is to
use upper case for all sql keywords. The thing I found in project
maintenance
is that I don't care so much about the sql as I do about everything else
in the proc, view, query or whatever it is I'm looking at.
Case in point:
case1: SELECT OrderId, Quantity FROM Orders
-or-
case2: select OrderId, Quantity from Orders
I've been doing a little research into graphic design theory and it's
carried over into other areas of my development life. Having something
in UPPER CASE draws attention to it - is that where you want the
attention drawn? I'm pretty good with the sql syntax at this point so
I'd rather have attention drawn to the other parts of the sql code.
Funny you should mention it. I've seen the same thing myself. But I got
exposed to the opposite convention
early on by an "Extract" statement that composes DDL from the metadata.
This particular command put the sql keywords in lower case, and the data
names in upper case. I don't know whether it's just force of habit,
but the "keywords in lower case" convention has always made more sense to
me.
And I think it's as you say. By having the data names in upper case, my
attention is drawn to those
things I really need to see.
I'll note in passing that all the keywords in "C" are in lower case. So by
adopting a "keywords in lower case" convention, you do a little less mental
shifting when you switch back and forth.. Is this also true of Java?
>Here's why I go for singular:
>When looking at table "Order" I see columns OrderId, Quantity, Price, etc.
>If I make the table plural - shouldn't everything be plural?
Not necessarily
>I have a number of naming rules in place so my group doesn't go
>nuts when writing database code - one of them being that if we make
>a synthetic key for a table - it's always "TableNameId" and
>OrdersId seems kinda awkward - but I know you hate those synthetic keys
>anyway <s>
Well, in the case of plural table names, I tend to use singular for
the id field. The point: You have some rules you are currently using.
Joe is proposing some other rules. Then you shouldn't argue that his
rules don't work because they are incompatible with yours - but rather
that you prefer yours for reasons 1, 2 etc
>The funny thing about this thread - however - is how somebody brought
>up the issue of books - alluding to how we should follow something because
>it's in a book. One common coding convention in SQL books is to
>use upper case for all sql keywords. The thing I found in project
>maintenance
>is that I don't care so much about the sql as I do about everything else
>in the proc, view, query or whatever it is I'm looking at.
>Case in point:
>
>case1: SELECT OrderId, Quantity FROM Orders
>-or-
>case2: select OrderId, Quantity from Orders
>
>I've been doing a little research into graphic design theory and it's
>carried over into other areas of my development life. Having something
>in UPPER CASE draws attention to it - is that where you want the
>attention drawn? I'm pretty good with the sql syntax at this point so
>I'd rather have attention drawn to the other parts of the sql code.
My 2 cents: The important part is to have the keywords and the
identifiers easily distinguishable from one another. Poor man's syntax
colouring.
My first relational interface was a language called "RDO" (The acronym has
since been reused).
In that language, there was a construct that looked like this:
for EMPLOYEE in EMPLOYEES
with ... EMPLOYEE.EMPLOYEE_NAME ...
begin ... end.
"EMPLOYEE" was a local context variable, and not stored in the DB. The
general practice was to use plural relation names, but singular context
variables, or even abbreviations for context variables. The loop structure
of the language made such a construct natural, since, inside the loop, one
was generally thinking about one row at a time. (Don't yell at me, Joe
Celko... I'm not advocating that kind of thinking.)
SQL is different than RDO, and generally preferable, but my habit of
naming tables in the plural never went away.
> And I think it's as you say. By having the data names in upper case, my
> attention is drawn to those
> things I really need to see.
>
> I'll note in passing that all the keywords in "C" are in lower case. So by
> adopting a "keywords in lower case" convention, you do a little less mental
> shifting when you switch back and forth.. Is this also true of Java?
Another point -- since data names are often multiple words
strung together, the names are often more readable if mixed
case is used. If you force data names to upper case, it may
be more difficult to read.
Larry Coon
University of California
la...@assist.org
and lmc...@home.com
Good point. Also, if one uses indenting like so:
SELECT foo.bar, baz.bop
FROM foo, baz
WHERE foo.bar = baz.bar
AND foo.bop = baz.bop
Then the upper case keywords help you tell immediately which part of the
statement is referenced and it's easy to see the logic of the statement
at a glance. In long and complex statements, one wants the keywords to
be most visible to be able to follow the statement's logic and one wants
the data words to be most legible (i.e. mixed case) to be able to spot
typos.
--
Jeff
That is one I do no remember at all! I think I have a QUEL manual in
the garage, but I would really have to dig for it.
Datatrieve marked my transition from traditional programming to relational
databases.
A few years later, DEC made its entry into the relational database arena
with a product called "Rdb". Rdb was (and still is) a relational database.
Its interface language was called "RDO". RDO's data manipulation statements
were heavily influenced by Datatrieve.
In about 1986, a version of Rdb was released with support for SQL as well as
RDO. Eventually, SQL became the primary interface, and RDO was kept around
only for the sake of old applications. Rdb became known as "Rdb/VMS", and
eventually a version was produced to run under DEC Unix.
DEC's Rdb product was sold to Oracle in 1994. Oracle renamed it "Oracle
Rdb". Last time I checked, Oracle Rdb STILL had support for RDO as an
alternate interface to SQL.
I hope this helps place RDO in context.
Regards,
David Cressey
www.dcressey.com
"--CELKO--" <71062...@compuserve.com> wrote in message
news:c0d87ec0.02020...@posting.google.com...
I always use singular names too. Before you can define a table (set),
the element type must be defined. I think SQL is wrong to obfuscate
things by making you do both in one step and weave in set-constraints
like uniqueness and referential integrity to the element definition.
The element type definition is clearly singular, and in an O-O mapping
the associated class is defined in the singular also (these should match).
It also avoids the double-plural problem when you have a set of elements
that are intrinsically plural (each element identifies some collection).
--
Clifford Heath
I'm really just trying to figure out the best way to name database tables.
I'm not sure it's a compatability issue, If I pick one set of rules over
another
I'm not doing it because that's what I prefer, I'm choosing those rules on
the
basis of how it will make my project run smoothly in the long haul,
for everybody involved.
There doesn't seem to be a common concensus on this naming issue,
so in this case it seems as if you have to just pick one way or the other.
It's not like indenting your code - everybody knows that's important and
chooses
to indent over not indenting. This naming debate seems like it will linger
on for a while
before there is a generally accepted way of doing it.
Yes!
The semantic entities (in the logical model) should be named in singular
(this is not an O-O matter, but a general modelling/semantic matter)
I agree that the transition to tables should not obfuscate that by
"changing" the names.
It IS of course a matter of opinion and choice.
I think this makes the point that the entire notion of natural keys is
absurd. You mentioned using position, but there is no concept of
absolute position for anything, we all exist in relative terms.
That's Einstein for ya, and it's human nature too.
Humans define everything in relative terms, our vocabulary and very
means of learning seek to categorize things based upon their relative
position within the overall space of all things knowable. To say that
there is an absolute identifier of something is utterly impossible.
Talk about something? Talk about cars? Which car? A toyota? A ford?
Which is your favorite? the blue ford, the red ford? or the toyota?
the camry? or the thunderbird? the camry with the cd player or the
ford with the cooler body? cooler than what? the camry?
There is no natural key. It all depends on what I'm talking about,
what I'm doing, what my business is doing, what the needs are. All
information is relative.
For example, say I have to make a database on the shelf. On the shelf
I have a red ball, a blue ball, and a green ball.
Now, you or I could understand the context of the situation and walk
into a room and say: "hand me the green.", or just "green", and know
via the context of our conversation, what the object we are discussing
is.
Databases do not know context. We have system management tools that
try and transform a relative world into absolute terms. No wonder it
costs so much money for a big company to just find out how many
widgets it sold. what kind of widget? the red one, the red one I
made last year? the red one that I sold for 10 cents off during
christmas or the ones that I gave away when someone bought two blue
ones during the superbowl...
So I make a table called balls and I set the primary key to be the
color - red, green, or blue. Someone comes along and adds another
green ball to the shelf. In all respects this new green ball is
identical to the first ball, yet, there are four of them. How do I
uniquely identify which ball?
If it doesn't matter, then my database could not drive a robot arm to
grab me a single ball. Which green one? If my robot arm were Windows
based, that means GPF... technote.. Microsoft Robot Arm cannot do this
or that, so it crashes, this feature is by design. So, it does
matter. Maybe I could use the date that I added the ball to the
shelf. Maybe I could use the position of the ball.
Whatever I come up with for a "natural" primary key, I'm ultimately
selecting a single characteristic based upon the whole context on
behalf of the users of my database, and one of those users is bound to
be forced by circumstance to change the context I am working with and
bust my convention for uniqueness.
Any "natural key" I select, I'm making up, just as much as a guy who
uses one, two, three, or four. Neither I nor the guy choosing a
sequence is safe - he's got to contend with some other kind of thing
that somebody else might need to label as "one", and I've got to
contend with some other object being labelled as "ball closer" or
"ball farther". In both cases, the uniqueness of scheme is wrecked.
All keys are a synthetic imposition of the database designer's
understanding of the context in which objects must be unique. Natural
or synthetic, your guess is as good as mine. Maybe someday we will
have systems that painlessly let context be added to new keys to
differentiate them, rather than having to go destroy old ones.
I posit that the only natural key of a single object is the entire
object itself within its context. Because the context contains some
duplication, the entire key is not valid 1NF, and therefor, is not a
key. Natural keys are thus a contradiction in terms.
Perhaps we need to let go of the idea of objects being discrete, and
rather come up with systems that let objects be defined by the context
they are used in...
I wouldn't go as far as saying that the notion is absurd. A time and
latitude/longitude/altitude positioning system is perfectly adequate for
the task. The aim is to find a system, however artificial, where
uniqueness is guaranteed by the laws of physics. If you accept the
many-worlds hypothesis then even this doesn't guarantee uniqueness.
My point is that there is a spectrum from true natural keys (which are
rare) to completely useless keys. The SSN is close to being a natural
key because duplicates only exist in rare pathological conditions.
That's because there is an effective system that tries hard to maintain
uniqueness. Near the opposite end of the spectrum is the
forename/surname pair. Within a small group it will work most of the
time.
For a database designer the key issues are the probability of getting a
duplicate, and the amount of effort needed to handle the clash. In the
case of the SSN (if you can guarantee to get the true value) duplicates
will be so rare that the additional effort to program around duplicates
is unjustifiable.
At the opposite end of the spectrum you may have an algorithm to
generate a loginID from the users' names. When the second John Smith
joins the company a simple algorithm will fail.
Hmmm. I think you were replying in a confused way to the confused
thesis I had presented, and as a result the above is as confusing to
me as I'm sure that my writing is confusing to you. So...
I do actually agree that latitude / longitude would be appropriate to
the task. The operative phrase is "to the task". We say that task
gives the latitude, longitude context. Observe that the art of
database design separates task from the tools, there's no context in a
table of X,Y coordinates. To say that X,Y is better than sequence
number is incorrect because it implies that X,Y closer fits the task.
The real primary key of something in a database design is really
TASK+key, so TASK+X,Y or TASK+sequence really aren't all that
different, because the major distinguishing factor in the key is
completely outside of the design of your database... It's assumed in a
design that the thing you are designing is for a particular task, yet,
it's impact on the longevitiy of your systems is huge!
I would say that because TASK is so big in the overall unique
identification of TASK+key for a database object, the choice of key
being natural or autogenerated is really relevant. TASK is always
relative, in and of itself, so no matter what you pick key to be, it
is going to be ultimately relative too.
> For a database designer the key issues are the probability of getting a
> duplicate, and the amount of effort needed to handle the clash. In the
> case of the SSN (if you can guarantee to get the true value) duplicates
> will be so rare that the additional effort to program around duplicates
> is unjustifiable.
Saying that SSN is suitable to be a primary key does not automatically
make it a preferred key because it has some pseudo real world mapping,
and an id doesn't. It only makes it a convenient key for those humans
who happen to know everyone's social security number. The rest of us
go by names, which are duplicate, or names and addresses, which
violate 1nf.?