Until recently, my kneejerk reaction has been to use surrogate keys almost
exclusively: identity columns or manually-generated integers.
Now that I'm writing stored procedures instead of JET queries I'm starting to
rethink my design habits..
What do "real" SQL Server develpers do for things like tax codes, short vendor
names, short department names, short group names, states of residence, hours
types, and so-forth? By "Short" I mean names of, say, 15 characters or less.
On one hand, I hear that Oracle doesn't like joins on text keys - so I guess the
surrogate strategy favors migration.
On the other hand, when testing large stored procedures, I can see that all
those joins degrade response time noticibly. Also, it would seem that
cascading updates could give the same ability to change the text of something
like "REG HRS" to "Regular" as long as it wasn't done as a matter of course.
On the other other hand, when doing something like a pivot presentation of hours
worked into seven columns - one for each day of the week - surrogate keys can be
convenient when boiling many records down to one. You can just, for instance,
do a SUM on Tax Code ID to make all the tax codes come out in the pivoted data.
I would still for instance, have a tlkpDepartment table: both for populating
combo boxes and to show the user that "T&D" is really "Transmission &
Distribution".
Finally, (pure FUD/CYA, I guess), I'd guess that if somebody brought a data
architect in to appraise a DB with lots of natural keys they'd write a critical
observation to the effect that data was not properaly normalized.
What do most SQL Server developers do?
-----------------------
Pete Cresswell
Just an aside here, but neither SQL Server nor Jet support true surrogate
keys. A true surrogate key is an internally-generated hash code used as
short hand to refer to a compound key. In the Access and SQL Server world we
have to fake it with an IDENTITY column or such, (I think this is properly
called an "Artificial" Key). Anyhow, that's all semantics.
I'm sure you know all the debates this topic has generated. But there's one
angle you may not have considered if you've recently moved to SQL Server and
that is clustering. In Jet the PK is always the clustered index. You have no
choice in the matter. In SQL Server on the other hand, you get to choose the
clustering key. This is important, since the clustered key defines physical
sort order on disk. In Jet one of the strongest arguments against an
Autonumber PK is that a unique, monotonically increasing integer makes a
poor choice for a clustered index.. Firstly it's way too selective to be
useful,
(how often are you going to sort on it?) Secondly, since new records are
added in close proximity, this can lead to contention or "hot spots" with
many users trying to access the most recent records. In SQL Server things
are different - you can avoid this problem by clustering on a non-primary
key. Here's an example
--------------
USE tempdb
GO
CREATE TABLE dbo.customers
(
cust_id INT IDENTITY(1,1) NOT NULL
PRIMARY KEY NONCLUSTERED,
cust_name VARCHAR(50) NOT NULL
)
GO
CREATE UNIQUE CLUSTERED INDEX customers_UK
ON dbo.customers(cust_name)
GO
--------------
This design means you still get the advantages of using an artificial
(surrogate) key, without the disadvantages. If customer Shirly Jones gets
married and becomes "Shirley Smith" you just go and change her name.
Simple - as long as there's no other "Shirley Smith's" :-)
You only get one clustered index per table, and choosing it wisely is one of
the biggest performance boosts you can give to your application. My
experience is that going through many joins is an almost negligible
performance hit, provided you have chosen your indexes correctly, (I have
some projects with 10 or so JOIN's and I've never found the joins themselves
to be a significant contributor to poor performance).
There is really no one correct approach.
Purists just as Joe Celko will argue for natural keys, especially since SQL
Server 2000 supports the CASCADE UPDATE and CASCADE DELETE options when
defining a primary key.
The usual argument against natural keys is that even with the CASCADE
options, if natural keys are used to link tables, and the primary key of a
table changes, there is an UPDATE on each of the linked tables ... and
UPDATEs are not free of charge. They do require locks and resources to
accomplish.
IMHO, if you KNOW that you have fairly stable primary keys (ones that are
VERY unlikely to be changing ... the USA Postal Office state abbreviations
being my favorite example), then go with natural keys. If your primary keys
are subject to change then you need to evaluate how often they might change,
and the potential impact (performance wise) on your application if they do.
If you are designing an application that will handle 5 transactions per
minute, it's all well and good to use natural keys. If your application
however has to handle 5000 transactions per second, then go with surrogate
keys.
Of course, as you noted, I'd be very unlikely to design a primary key that
is of significant length simply for the performance hit of matching on, say,
40 bytes, rather than 4 bytes (assuming something like an int IDENTITY as a
primary key) ... but again, even here, the performance hit on an application
handling 5 transactions per minute is probably going to be unnoticeable,
while it might definitely negatively impact an application that has to
handle 5000 transactions per second.
> What do most SQL Server developers do?
Bottom line ... most developers probably approach this somewhat from rote
... either always using natural keys or always using surrogate keys, because
this tends to be a "philosophical" issue for most developers rather than
just a technical one. But at least for me, it's like most anything else ...
evaluate the circumstances, factor in one's personal preferences, and arrive
at a decision based upon the facts at hand :-)
-------------------------------------------
BP Margolin
Please reply only to the newsgroups.
When posting, inclusion of SQL (CREATE TABLE ..., INSERT ..., etc.) which
can be cut and pasted into Query Analyzer is appreciated.
"(Pete Cresswell)" <x@y.z> wrote in message
news:d8tjuuord47cvp1ah...@4ax.com...
> (I think this is properly called an "Artificial" Key)
Joe Celko insists upon the term "Artificial" key, and insists that even
though the rest of the world calls it a "Surrogate" key, that the rest of
the world is wrong, and he's right :-)
> In SQL Server things
> are different - you can avoid this problem by clustering on a non-primary
> key.
While you are correct that a surrogate key is frequently a poor candidate
for a clustered index, I think you have fallen for more of the crap that Joe
Celko has been known to post about SQL Server. Joe has been known to
repeatedly post about the performance hit of "hot spots" in SQL Server. His
statements are based upon the SQL Server 6.5 and earlier architecture, where
page locks were the default. Since SQL Server 7.0, row locks are the
default. Thus a "hot spot" can actually increase performance by causing the
relevant data to remain cached for longer periods of time, thereby reducing
physical I/O, and with minimal increased blocking or contention issues.
However you do make a good point about clustered indexes. Changing the value
of a column in a clustered index is very likely to cause the row to move
from one data page to another, and possibly cause a page split as well. Of
course, the "logical" extension is therefore that because of this a
clustered surrogate key ... which never gets changed ... is a good choice
for a clustered index :-) Of course, there are a number of reasons that
frequently overwhelm the logic of having a clustered surrogate key. But it
is an excellent sample of the type of analysis that really needs to be done
in order to "tune" a database for the best performance.
-------------------------------------------
BP Margolin
Please reply only to the newsgroups.
When posting, inclusion of SQL (CREATE TABLE ..., INSERT ..., etc.) which
can be cut and pasted into Query Analyzer is appreciated.
"John Winterbottom" <jo...@assaynet.co.m> wrote in message
news:hqtG9.36979$lj.7...@read1.cgocable.net...
I'm thinking about modifying my own rote to a hybrid configuration.
To wit:
State of residence e.g. PA = Pennsylvania: always a natural key.
Hours type e.g. REG=Straight time: ditto.
Local tax code e.g. PHL=Philadelphia: ditto.
Rule-of-thumb: if it is something that is less than 15 charcters, is not
volitile in and of itself and we can save some joins/complexxity when presenting
the data in reports and screens, use the natural key. Otherwise surrogate.
i.e. the code PA for Pennsylvania will probably never change and something like
hours type REG would only change under unusual circumstances and would warrant
some sort of batch update in the off hours.
OTOH, I would never even *think* about using a person's name as a primary
key....
Am I on the Good, Right, and Holy path?
-----------------------
Pete Cresswell
What was your own rote method previously? In the original post you asked
about autonumber fields vs. strings for foreign key relationships. The
short answer is that you almost never use strings to join related fields,
though you often index on them and use them in composite primary keys.
|
| To wit:
|
| State of residence e.g. PA = Pennsylvania: always a natural key.
Not for a join. The primary key would be the string, but you would
normally still join on a long integer field (an autonumber in Access.)
|
| Hours type e.g. REG=Straight time: ditto.
|
| Local tax code e.g. PHL=Philadelphia: ditto.
Disagree in JET or SQL for all three.
|
| Rule-of-thumb: if it is something that is less than 15 charcters, is not
| volitile in and of itself and we can save some joins/complexxity when
presenting
| the data in reports and screens, use the natural key. Otherwise
surrogate.
The number of characters is not the issue. Volatility doesn't really matter
either. I would not recommend using strings in foreign-key relationships
even if they are static and one character.
Take this from the previous posts in this thread (and the thousands of
similar threads before it):
1. Don't cluster on an autonumber or like field (this is obviously
self-defeating.) You accomplished this in Access by making the PK a string
or composite and the identity field an autonumber.
2. Don't join on a string field (remember that an identity field used in a
foreign-key relationship need not be the primary key nor clustered and
should never be the latter.) 7.0 supports cascading updates, but with an
obvious performance hit, so it is a moot point.
3. Even though clustered indexes that change infrequently are best (to
avoid page fragmentation), the fact that autonumber fields never change does
not make them good candidates for clustered indexes (for obvious reasons),
though it makes them absolutely splendid for foreign-key relationships (as
they are best kept static to avoid using cascading updates.)
That's about it.
[Snip]
|
| OTOH, I would never even *think* about using a person's name as a primary
| key....
Why not? See above (and previous posts and previous threads, etc.)
|
| Am I on the Good, Right, and Holy path?
Not quite yet.
I always use a blind number. Usually an identity column, but a manually-entered
int when it suites the situation.
Time for a little dancing: I think I left something out of my spiel on "15
chars or less, non-volitile, an so-forth".
In the back of my mind was that these are fields that, although referential
integrity is enforced on, are never joined on.
Tax codes, for instance:
My "old" way:
===============================
tlkpHoursType
HoursTypeID HoursTypeShort HoursTypeLong
---------------------------------------------------------------------
1 REG Regular Time
2 OT Overtime
tblHour
HoursWorked HoursWorkedDate HoursTypeID
===============================
My proposed "new" way:
===============================
tlkpHoursType
HoursTypeCode HoursTypeDescription
---------------------------------------------------------------------
REG Regular Time
OT Overtime
tblHour
HoursWorked HoursWorkedDate HoursTypeCode
===============================
Note that one would never actually join on HoursTypeCode except in some unusual
situation where somebody wanted to see HoursTypeDescription on a report.
Instead, HoursTypeDescription would is used 99% of the time just as a clarifier
in dropdown lists.
Referential integrity would, however, need to be enforced on
tblHour.HoursTypeCode against tlkpHoursType.HoursCode.
Does this work or would the RI enforcement run up against the same problems as
joins?
The immediate situation that's got me going on this is a time reporting/project
tracking system that I am about 80% done on.
When I run an iterative test of the code that loads data for a timesheet into a
printed report or on to the user's screen there's a huge difference between
tblHours.HoursTypeID-->tlkpHours.HoursType and having the HoursTypeCode already
in tblHour.
To be sure, there's another player with the same situation: a tax code.
Using 100% T-SQL with the tblHourHoursTypeID-->tlkpHoursType scheme takes 2,400
seconds for 500 iterations. Whereas the same 500 iterations with HoursTypeCode
and TaxCode already embedded in tblHour takes only 285 seconds - almost a
10-fold difference.
Maybe I'm just joining wrong or something - but everything's indexed and it
seems to stand to reason that 20+ joins (three for each weekday: tblHour,
tlkpHoursType, tlkpTaxCode) would take more time than only 7 joins.
Does any of this sound even remotely reasonable?
-----------------------
Pete Cresswell
So why even create the other table? Why not leave to the odd report author
to create the verbage in the report?
|
| Instead, HoursTypeDescription would is used 99% of the time just as a
clarifier
| in dropdown lists.
|
| Referential integrity would, however, need to be enforced on
| tblHour.HoursTypeCode against tlkpHoursType.HoursCode.
|
| Does this work or would the RI enforcement run up against the same
problems as
| joins?
|
|
The latter.
| The immediate situation that's got me going on this is a time
reporting/project
| tracking system that I am about 80% done on.
|
| When I run an iterative test of the code that loads data for a timesheet
into a
| printed report or on to the user's screen there's a huge difference
between
| tblHours.HoursTypeID-->tlkpHours.HoursType and having the HoursTypeCode
already
| in tblHour.
|
| To be sure, there's another player with the same situation: a tax code.
I normally use identity fields with them as well.
|
|
| Using 100% T-SQL with the tblHourHoursTypeID-->tlkpHoursType scheme takes
2,400
| seconds for 500 iterations. Whereas the same 500 iterations with
HoursTypeCode
| and TaxCode already embedded in tblHour takes only 285 seconds - almost a
| 10-fold difference.
|
| Maybe I'm just joining wrong or something - but everything's indexed and
it
| seems to stand to reason that 20+ joins (three for each weekday: tblHour,
| tlkpHoursType, tlkpTaxCode) would take more time than only 7 joins.
|
| Does any of this sound even remotely reasonable?
|
So does each row contain a week? There's the problem in my opinion. If it
is 80% done, you can safely proceed without using surrogate keys for your
tax codes. It's not like it will destroy the program if a two row lookup
table is used or not.
|
|
| -----------------------
| Pete Cresswell
Thanks, that's very good to know. Yes, you're right, I got the above from
the Joe C school of database design :-)
To populate dropdowns in a way that allows adding new values without code
changes
>|
>| Instead, HoursTypeDescription would is used 99% of the time just as a
>clarifier
>| in dropdown lists.
>|
>| Referential integrity would, however, need to be enforced on
>| tblHour.HoursTypeCode against tlkpHoursType.HoursCode.
>|
>| Does this work or would the RI enforcement run up against the same
>problems as
>| joins?
>The latter.
Unless you say otherwise, I'll take that as a "Yes, RI enforcment will run up
against the same problem as joins on alphanumeric fields".
This evening, I'm going to try an end run around the join/performance problem.
I'll ADO the pivoted timesheet lines with just MonHoursID, TueHoursID,
WedHoursID so-forth, then also ADO down to the app llittle tables containing the
original tblHours records, the tlkpTaxCode records, and the tlkpHoursType
records..
Then I'll write those to a local DAO/JET work table and do my monster join on
the user's machine - hoping that the very few records involved will make it go
pretty fast....stay tuned...
-----------------------
Pete Cresswell
Right, but if it is an island, you could enter all sorts of hours codes in
the main table unless you want to use cascading updates or triggers.
|
|
| >|
| >| Instead, HoursTypeDescription would is used 99% of the time just as a
| >clarifier
| >| in dropdown lists.
| >|
| >| Referential integrity would, however, need to be enforced on
| >| tblHour.HoursTypeCode against tlkpHoursType.HoursCode.
| >|
| >| Does this work or would the RI enforcement run up against the same
| >problems as
| >| joins?
|
| >The latter.
|
| Unless you say otherwise, I'll take that as a "Yes, RI enforcment will run
up
| against the same problem as joins on alphanumeric fields".
I wouldn't relate two string fields at all. I don't think I ever have. If
you do, cascading updates are necessary unless the table is static.
I settled on a compromise position.
I kept the relationships between tblHour and tlkpHoursType/tlkpTaxCode based on
integer surrogate keys.
What I did was add two de-normalized fields to tblHour: TaxCode and HoursType.
The cost of populating these when the timesheet is saved is close to nil. The
benefit of having them there when it's time to download a timesheet is a speed
increase of over ten times (i.e. 1,000 percent).
I tried just downloading the codes and then doing the joins locally in a DAO
work table - and it helped a lot, in fact it got me more than half the benefit
of the denormalized approach. But it also created a situation that I *think*
(don't know much about HTML) would have made it difficult for somebody to write
an HTML page to retrieve the timesheet's info. This way it's still happening
100% on the server, the speed is pretty good, referential integrity is mostly
guaranteed, and there's minimal hassle supporting the denormalized fields.
-----------------------
Pete Cresswell
There you go. Denormalize slightly for speed.
|
| The cost of populating these when the timesheet is saved is close to nil.
The
| benefit of having them there when it's time to download a timesheet is a
speed
| increase of over ten times (i.e. 1,000 percent).
|
| I tried just downloading the codes and then doing the joins locally in a
DAO
| work table - and it helped a lot, in fact it got me more than half the
benefit
| of the denormalized approach. But it also created a situation that I
*think*
| (don't know much about HTML) would have made it difficult for somebody to
write
| an HTML page to retrieve the timesheet's info. This way it's still
happening
| 100% on the server, the speed is pretty good, referential integrity is
mostly
| guaranteed, and there's minimal hassle supporting the denormalized fields.
As long as you don't rename an hours or tax code, you are fine on the RI
front. Sounds cool to me.
| -----------------------
| Pete Cresswell
> "BP Margolin" <bpm...@attglobal.net> wrote in message
> news:3dea6...@news1.prserv.net...
>> John,
>>
>>> (I think this is properly called an "Artificial" Key)
>>
>> Joe Celko insists upon the term "Artificial" key, and insists that
>> even though the rest of the world calls it a "Surrogate" key, that
>> the rest of the world is wrong, and he's right :-)
>>
> Thanks, that's very good to know. Yes, you're right, I got the
> above from the Joe C school of database design :-)
C.J. Date offers a great explanation of what the rest of the world
understands as a surrogate key.
From _An Introduction to Database Systems_, 7th edition, Addison Wesley
Longman, 1999, p. 444:
<excerpt>
"Surrogate keys are keys in the usual relational sense but have the
following specific properties:
"* They always involve exactly one attribute.
"* Their values serve solely as surrogates (hence the name) for the
entities they stand for. In other words, such values serve merely to
represent the fact that the corresponding entities exists—they carry no
additional information or meaning whatsoever.
"* When a new entity is inserted into the database, it is given a
surrogate key value that has never been used before and will never be
used again, even if the entity in question is subsequently deleted.
"Ideally, surrogate key values would be system-generated, but whether
they are system- or user-generated has nothing to do with the basic idea
of surrogate keys as such.
"It is worth emphasizing that surrogates are not (as some writers seem
to think) the same thing as "tuple IDs." For one thing—to state the
obvious—tuple IDs identify tuples and surrogates identify entities, and
there is certainly nothing like a one-to-one correspondence between
tuples and entities (think of tuple IDs for derived tuples in
particular). Furthermore, tuple IDs have performance connotations, while
surrogates do not; access to a tuple via its tuple ID is usually assumed
to be fast (we are assuming here that tuples—at least, tuples in base
relations—map fairly directly to physical storage, as is in fact the
case in most of today's products). Also, tuple IDs are usually concealed
from the user, while surrogates must not be (because of The Information
Principle); in other words, it is not possible to store a tuple ID as an
attribute value, while it certainly is possible to store a surrogate as
an attribute value.
"In a nutshell: Surrogates are a logical concept; tuple IDs are a
physical concept."
</excerpt>
Linda
Bas
"BP Margolin" <bpm...@attglobal.net> wrote in message
news:3dea6...@news1.prserv.net...
> I always use a blind number. Usually an identity column, but a manually-entered
> int when it suites the situation.
when is that?
> Tax codes, for instance:
> My proposed "new" way:
> ===============================
> tlkpHoursType
> HoursTypeCode HoursTypeDescription
> ---------------------------------------------------------------------
> REG Regular Time
> OT Overtime
Ok. Just for arguments sake, lets say a current valid value is
'FTO' for Flexible time off, and your company gets bought out and they
use "PTO".
Your system gets a little ugly. For sure fixable, but an example of
how something that is never changing ends up changing.
> Using 100% T-SQL with the tblHourHoursTypeID-->tlkpHoursType scheme takes 2,400
> seconds for 500 iterations.
How many records do you have? 5 seconds seems like a lot for each
iteration. I guess I question this difference in response times. I've
never seen this degree of difference unless the engine is choosing the
wrong index.
> Maybe I'm just joining wrong or something - but everything's indexed and it
> seems to stand to reason that 20+ joins (three for each weekday: tblHour,
> tlkpHoursType, tlkpTaxCode) would take more time than only 7 joins.
The joins themselves for lookup tables IMO take very little time, if
the main file is being searched efficiently.
Here is my experience on the natural key issue.
I once was a FIRM believer in the natural key, and thought that for
efficiency it made total sense. I ended up working for 18 months on a
really normalized system where they used surrogate keys exclusively
(eg a surrogate key invoice id with an invoice number already there).
I did not design the system, and when I started I hated it because to
do anything was a 15 table join.
But, over time I grew to love it.
Yeah, it was a 15 table join to do even the littleest anything, but
OTOH you could to EVERYTHING with fairly simple joins, and the names
all made sense and I didnt' have to hunt through crap to remember
which key values were the "special values" that got "special
treatment".
Today, I design system using identity fields with a rigid adherance to
a naming standard for those identity fields.
So it may be a teeny bit faster to have less lookup tables, but when
it comes time to maintain it, and for longer term stability, I think
the identity fields actually make things easier.
YMMV.
Doug Miller
Actually, I am supposed to be a conference in Sweden right now, but
O'Hare Airport had snow problems and I had to turn around. I really
hate sleeping in airports at my age.
Actually I want to define keys in more detail. I hope my ASCII graphics
don't get too messed up, but her is a little chart to start discussions:
natural artifacial unique-ifier surrogate
======================================================================
Constructed from attributes |
in the reality |
of the data model | Y N N Y
|
verifiable in reality | Y N N N
|
verifiable in itself | Y Y N N
|
visible to the user | Y Y Y N
A natural key is a subset of attributes which occur in a table and act
as a unique identifier. They are seen by the user. You can go to the
external reality and verify them.
Example: UPC codes on consumer goods (read the package barcode),
geographic co-ordinates (get a GPS).
An artifacial key is "an extra attribute" added to the table and which
is seen by the user. It does not exist in the external reality, but can
be verified for syntax or check digits inside itself.
Example: the open codes in the UPC scheme which a user can assign to his
own local stuff. The barcode check digits still work, but you have to
verify them inside your own enterprise. I did a whole chaptrer on how
to design encoding schemes in DATA & DATABASES.
A "unique-ifier" (please get me a better word! "meaningless generated
key"? anything!) is not based on any attributes in the data model and is
exposed to user. There is no way to predict it or verify it. The
system obtains a value thru some physical process totally unrelated to
the logical data model.
Example: IDENTITY columns, other auto-numbering devices.
A surrogate key is system generated to replace the actual key behind the
covers where the user never sees it. It is based on attributes in the
table.
Example: Teradata hashing algorithms.
This grid is a good place to start to talk about classifying types of
keys. There are 16 possible combinations of 'yes' and 'no' and I have
only mentioned four of them.
Date never talks about relating a table back to reality or about data
verification. His view of databases is very abstract and
self-contained; have you ever seen him write a CHECK() constraint or
DEFAULT clause in his books?
--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.
*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
> Actually I want to define keys in more detail. I hope my ASCII graphics
> don't get too messed up, but her is a little chart to start discussions:
> natural artifacial unique-ifier surrogate
> ======================================================================
> Constructed from attributes |
> in the reality |
> of the data model | Y N N Y
> |
> verifiable in reality | Y N N N
> |
> verifiable in itself | Y Y N N
> |
> visible to the user | Y Y Y N
Hmmmmm.
How about looking at "Guaranteed Unique" and "Optimal Physical
Performance"?
What are the advantages of Surrogate over the unqiuifier?
If you can show the uniquifier, why can't you show them the Surrogate?
> A natural key is a subset of attributes which occur in a table and act
> as a unique identifier. They are seen by the user. You can go to the
> external reality and verify them.
Is this important?
> An artifacial key is "an extra attribute" added to the table and which
> is seen by the user. It does not exist in the external reality, but can
> be verified for syntax or check digits inside itself.
>
> Example: the open codes in the UPC scheme which a user can assign to his
> own local stuff. The barcode check digits still work, but you have to
> verify them inside your own enterprise. I did a whole chaptrer on how
> to design encoding schemes in DATA & DATABASES.
But when you are all done, is this not just a way to have the users
type in a uniquifier? And I would think the odds of these changing
over time would be higher then some other key choices.
>
> A "unique-ifier" (please get me a better word! "meaningless generated
> key"? anything!) is not based on any attributes in the data model and is
> exposed to user. There is no way to predict it or verify it. The
> system obtains a value thru some physical process totally unrelated to
> the logical data model.
I'm reading into the above that you think there is something bad in
the above. But I guess I'm not seeing the bad part?
There are advantages of a uniqueifer. You aren't mentioning those?
> A surrogate key is system generated to replace the actual key behind the
> covers where the user never sees it. It is based on attributes in the
> table.
>
> Example: Teradata hashing algorithms.
Ick. Advantages are?
Disadvantages might be they can still change depending on data, they
are probably longer then an integer, they are a pain to type in for
the DBA to go look for stuff, they do not guarantee uniqueness, and
require more processing to create and dig out.
????????
> Date never talks about relating a table back to reality or about data
> verification. His view of databases is very abstract and
> self-contained; have you ever seen him write a CHECK() constraint or
> DEFAULT clause in his books?
So I'm real. I know my phone number. I know my name. A salesperson can
look "me" up in his database based upon this stuff. Is it really
important how the database stores internal references to "me" as long
as "me" can be found, and all the information about "me" can be
retrieved quickly (quickEST?) ?
-doug miller
So I reread my post today, and was apalled at the tone that came
across.
My goal is to become a better database person. My tactic was to
question in order to learn WHY and WHAT the thoughts are behind all
this.
I really meant to have a discussion, not attack.
My apologies for how this came across.
-Doug Miller
Uniquenes is a basic requirement for any key or "unique-ifier", so it
does not belong in the chart; it is assumed. What might be there is a
consideration of re-use; is this key value ever going to be re-used for
a different entity after the original entity is gone? (yeah, yeah, that
means it wasn't a real key to start with since by definition a
particular set of attribute values defines one and only one entity).
"Optimal Physical Performance" does not belong in the chart because of
the word "physical"; implementation is never part of a logical
definition. I grant that a lot of Newbies cannot tell the differnce
between logical and physical models, but ....
>> If you can show the uniquifier, why can't you show them the
Surrogate? <<
I think you are missing the point of the chart. I am trying to build a
taxonomy based on logical characteristics of these approaches. Being
exposed and being hidden are very different at a logical level. At the
physical level, some hacker could do a core dump and look at everything
down to the bits!
>> [natural keys: You can go to the external reality and verify them.]
Is this important? <<
Gee, a database that holds verifiably accurate data! What a concept!
Yes, it is important. If I don't have to have the right answers and be
able to prove they are the right answers, then I can make the damn thing
run like a bat out of hell.
>> [open UPC codes] But when you are all done, is this not just a way to
have the users type in a uniquifier? <<
The open UPC codes have a standard check digit built in so the3y can be
scanned. They have a standard way of identifying themselves as a local
id code to databases outside the local system. Things like IDENTITY or
other proprietary auto-increments do not have those fundamental
properites.
>> And I would think the odds of these changing over time would be
higher then some other key choices. <<
They are only as good as the company's internal policies about data
integrity.
>> There are advantages of a uniqueifer. You aren't mentioning those? <<
Name some of those advantgages, please; this is a discussion. All I can
come up with is that a newbie does not have to research industry
standard id codes, or think about how to create such things because the
vendor extensions.
>> [Teradata hashing] Ick. Advantages are? <<
1) The hashing and re-hashing are hidden from the programmer. You use
the natural keys; the system computes the hash under the covers. It
also re-orgs the database and does a lot of other housekeeping for you.
2) MASSIVELY parallel search algorithms are possible. Each hash bucket
can get a processor because their hashing creates a partitioning of the
set of data elements.
3) No data element is more than 2 probes; 70-80% are direct access.
Compare this to a B-tree index with a Petabyte sized table. If more than
15-19% of a table is returned from a query, then a table is preferred
over an index.
4) If you build a mutli-column hash in Teradata, you get the effect of a
hash on all permutations of those columns. That is; HASH(a,b,c) implies
HASH(a), HASH(b), HASH(c), HASH(a,b), HASH (b,a), HASH(c,b,a), etc.
Compare this to CREATE INDEX Foo ON Bar(a,b,c) which implies only
INDEX(a,b,c), INDEX(a,b) and INDEX(a).
Remember that Teradata is for data warehouses, not OLTP.
>> So I'm real ... <<
. or a clever bot sent to this newsgroup to plug the latest movie from
Sony Pictures!!
>> A salesperson can look "me" up in his database based upon this stuff.
<<
Only if the database has that information. And only if that information
narrows the universe down to you with some degree of certainity. That's
why they ask you for your mother's maiden name and "who is your favorite
stooge?" (Moe! same everyhone else!)
>> Is it really important how the database stores internal references to
"me" as long as "me" can be found, and all the information about "me"
can be retrieved quickly? <<
I know what you are saying. But "accurate" is more important than
"quick" and some internal schemes are error prone by their nature.
If the access method is exposed to a user, a user will screw it up.
"Against invincible stupidity, not even the gods themselves can prevail"
-- Roman proverb.
If it is redundant (e.g. a natural key and an artifical key for same
data element); at best, they are a waste of storage and speed and at
worst, they will destroy data integrity.
Nobody ever apologizes to me in this newsgroup. I'm the guy with the
reputation for frying newbies.
And I actually like the adversarial approach to discussions -- formal
debates, jury trials, bar room fights -- the things that make Westrern
civilization work!!
so one, I noted that you weren't offended by me.
two, please read my comments under the thread Re: SP Adding
Parent/Child recs: How to capture each parent's Identy val for
children?
Three, I'm learning lots here, which is my goal.
Hope you are finding me entertaining, or whatever your goal is!!!!!!
see below:
> "Optimal Physical Performance" does not belong in the chart because of
> the word "physical"; implementation is never part of a logical
> definition.
Ok. I'm talking about how to create a table to optimally meet business
requirements. I'm meaning which fields get created in the SQL Server
database to do this.
You are talking about the logical representation.
How about for your logical representation, we all agree that everyone
needs to think of a row being uniquely defined by natural keys, but
when it comes time to actually build the thing in EM, we use Identity
keys cuz it works better?
;-) ;-)
> I grant that a lot of Newbies cannot tell the differnce
> between logical and physical models, but ....
Ahhhh. See my previous paragraph comments.
Are you building a logical model, or a business implementation?
> >> If you can show the uniquifier, why can't you show them the
> Surrogate? <<
>
> I think you are missing the point of the chart. I am trying to build a
> taxonomy based on logical characteristics of these approaches. Being
> exposed and being hidden are very different at a logical level. At the
> physical level, some hacker could do a core dump and look at everything
> down to the bits!
So maybe there are 3 levels. There is the optimal logical layer upon
which the "busines implementation" is built. And there is the physical
disk junk which is how the database stores all the this stuff.
The "business implementation" layer is what you actually type into
Enterprise Manager or SQL editor to create the tables.
My BI layer needs to take into account the logical stuff, the physical
stuff, the ability of the support guys to support this thing 4 years
from now, and still be flexible, fast, and understandable.
> Gee, a database that holds verifiably accurate data! What a concept!
> Yes, it is important. If I don't have to have the right answers and be
> able to prove they are the right answers, then I can make the damn thing
> run like a bat out of hell.
Is it enough to be able to defend that the data is not wrong? For
instance, it seems to me that most of the data proofs I've done are
easiest explained to the end users by showing them what they typed in,
how I kept the integrity of it, and how it is still the same as when
they typed in.
Most go blank when you start talking set theory, let alone logical
design.
> >> And I would think the odds of these changing over time would be
> higher then some other key choices. <<
>
> They are only as good as the company's internal policies about data
> integrity.
Ahhhh. So your approach is only as good as a company's internal
policies. My approach is as good as SQL Server's identity
implementation.
I think I have you there. ;-)
> >> There are advantages of a uniqueifer. You aren't mentioning those? <<
>
> Name some of those advantgages, please; this is a discussion. All I can
> come up with is that a newbie does not have to research industry
> standard id codes, or think about how to create such things because the
> vendor extensions.
The use of a uniquifier (identity) allows simple changing of attribute
fields when they change (see internal company policies above) without
having to change the data elements of the children.
The identity is shorter, making for smaller indexes and faster seeks
for children tables.
The identity allows easier programming on the client side - to keep
track of one specific row, you only have to keep track of the
identity, not the columns that make up the natural key.
The identity is easier for me as an administrator, as I only have to
type in the integer to go to a certain row.
The identity makes a database or schema easier to learn - the
uniqueness is obvious, where there might be several unique column
definitions (this one is weak I know.)
> >> [Teradata hashing] Ick. Advantages are? <<
Uh oh. I had a different understanding of how this would be
implemented on sql server.
I'm gonna back pedal on this one - see my question below.
> 2) MASSIVELY parallel search algorithms are possible. Each hash bucket
> can get a processor because their hashing creates a partitioning of the
> set of data elements.
How do I implement this on SQL Server 2000? I did not know it was an
option to change the indexing algorhythms in SQL Server 2000.
> >> A salesperson can look "me" up in his database based upon this stuff.
> <<
>
> Only if the database has that information.
Picking Nits, I did say "look "ME" up", I didn't say "Find ME". I was
trying to convey the thought, which I think we agree on, that there is
no guarantee that "ME" is in teh database. And I was also trying to
convey that there is no guarantee that "ME" knows my natural key.
So you need a way to search other attributes or columns which describe
me and help me and salesman find me in the database.
And only if that information
> narrows the universe down to you with some degree of certainity. That's
> why they ask you for your mother's maiden name and "who is your favorite
> stooge?" (Moe! same everyhone else!)
Sorry, you left yourself open on this one.
I first read (Moe! same everyhone else!) as
Moi`!
;-) ;-) ;-) ;-)
> I know what you are saying. But "accurate" is more important than
> "quick" and some internal schemes are error prone by their nature.
The use of natural does not make your method more accurate. At least
we agree that it is slower.
;-) ;-)
> If the access method is exposed to a user, a user will screw it up.
Balogna.
> "Against invincible stupidity, not even the gods themselves can prevail"
> -- Roman proverb.
> If it is redundant (e.g. a natural key and an artifical key for same
> data element); at best, they are a waste of storage and speed and at
> worst, they will destroy data integrity.
Storage cheap.
Integer lookups faster.
If it is more complicated then one field, UI programmers will screw it
up, get the fields wrong, and destroy data integrity.
;-) ;-)
-Doug Miller
drmil...@hotmail.com (Doug Miller) wrote in message news:<178db91.02120...@posting.google.com>...
Anything that makes me think is good at my age. Anything that makes
me go up stairs is bad.
>> I grant that a lot of Newbies cannot tell the difference between
logical and physical models, ... So maybe there are 3 levels. <<
I'd say yes, and the three levels:
1) Logical level, where "SQL the abstract language" and the
relational model live without reference to software or hardware.
2) Implementation level, where "Product X" tried to make that logical
model work; here we have software (which might be tied to hardware or
not).
3) Physical level, where the bits and bytes live; here we have
hardware.
The bad news is that we have "leakage" between the levels. A large
number of SQL products map Tables => files, rows => records and
columns => fields between the logical and implementation levels. Why?
Well, they already had a file system before we had SQL, so they used
it. Then the file system maps directly to physical storage -- I can
look at a block of contigous disk space and see the records that hold
the rows.
This is a terrible way to do SQL. Look at the Nucleus engine (SAND
Technology) and its N-Vectors for a model based on seting up domains,
then representing rows, columns and tables as bits in compresed bit
vectors. The database gets smaller, faster and every column is
automatically fully indexed. No contigous storage, each value appears
once, etc.
>> Are you building a logical model, or a business implementation? <<
First I build a logical model; if I cannot do that, all is lost. Once
I have the model, I move to the particular product I have to use. But
I try to avoid changing the logical model or destroying portability.
>> My BI layer needs to take into account the logical stuff, the
physical stuff, the ability of the support guys to support this thing
4 years from now, and still be flexible, fast, and understandable. <<
Thank you for mentioning that it has to be maintained! Newbies keep
posting things like "I will never work with anything but <<SQL product
X, version Y>>, so I can use all kinds of proprietary stuff." Then
version (y+1) changes things or Product X goes out of business, or the
company moved to Product Z and they are screwed. Was everyone asleep
when the Software Enginnering instructor told them that 80% or more of
the cost of a system is taking care of it after it is deployed? rant,
rant, rant!!
>> Most [users] go blank when you start talking set theory, let alone
logical design. <<
Sooner; a lot of them start that way. So talk in simple sentences and
use ORM techniques. Use names they know for the data elements,
attributers they can identify.
>> Ahhhh. So your approach is only as good as a company's internal
policies. My approach is as good as SQL Server's identity
implementation. <<
The company policies apply to ALL the systems and machines (I might be
an optimist in saying that for some companies, but ...) both current
and future, while an IDENTITY column is limited to the particulzar
machine upon which a data base is currently residing.
If my machiner goes down, I can find the internal UPC code we made up
for "Betty's Boiled Buzzard Eggs" somewhere else. I canotr fidn the
old IDENTITY value anywhere.
>> The use of a uniquifier (identity) allows simple changing of
attribute fields [sic] when they change (see internal company policies
above) without having to change the data elements of the children
[sic]. <<
We gotta get a better word than "uniquifier"!!
In a proper Standard SQL database I would use a CREATE DOMAIN for the
identifiers. All the constraints, default value and datatype
information would be in one and only one place in the schema.
Give a real world example: the 10 digit UPC codes on all your
packages are due to be replaced by the 15 digit GTIN (Global Trade
Identification Number) in a few years. If the company had designed
application databases with columns that all used UPC codes under
different names (product_code, product_id, upc, upc_code, to use some
actual names from a real company), they are in trouble -- enough
trouble to pay Joe Celko sized consulting fees <g>.
What I want to do is excute one ALTER DOMAIN statement and some code
to pack the front of the old UPC codes with zeros for now. The UCC
has a lot of material about the changes, so I can plan for it. Most
SQLs still don't have CREATE DOMAIN statements so I have to do it with
a table of codes and REFERENCES ...ON DELETE CASCADE ON UPDATE CASCADE
and a ton of scripts. But a clean logical design and a good data
dictionary help a lot.
>> The identity is shorter, making for smaller indexes and faster
seeks for children [sic] tables. <<
Let's make a concrete example, say a city map grid where the x and y
values can be huge. Huge enough that to number each one of the square
uniquely requires BIGINT. People also use BIGINT to aovid running out
of uniquifiers, but I want to stsrt off really big.
attempt #1:
CREATE TABLE Locations
(location_id BIGINT IDENTITY (1,1) NOT NULL
PRIMARY KEY,
x_ord INTEGER NOT NULL,
y_ord INTEGER NOT NULL,
...);
There was nothing in Locations to prevent me from inserting a billion
identical (x,y) values. And I would never know about the identical
values until I tried to do a COUNT(*) or something; the queries would
be slower, but still correct. Moral to the story: You still have to
declare the natural keys to be unique becuase of the logical model.
Attempt #2:
CREATE TABLE Locations
(location_id BIGINT IDENTITY (1,1) NOT NULL
PRIMARY KEY,
x_ord INTEGER NOT NULL,
y_ord INTEGER NOT NULL,
...
UNIQUE (x_ord, y_ord));
But why bother with location_id at all? Nobody knows it, it is too
long to type weven if you did and everybody uses (x,y)
Attempt #3:
CREATE TABLE Locations
(x_ord INTEGER NOT NULL,
y_ord INTEGER NOT NULL,
...
PRIMARY KEY (x_ord, y_ord));
one index, smaller physical and logical table, and data that can be
verified.
>> The identity allows easier programming on the client side - to keep
track of one specific row, you only have to keep track of the
identity, not the columns that make up the natural key. <<
Wrong! A natural key is, well, natural -- it is how the users think.
The same logic would say that it is easier to type in one 40 digit
number than to type in the three column key ('1997', 'Ford',
'Escort'). And it is a lot easier to verify by looking at the
automobile in question
>> The identity is easier for me as an administrator, as I only have
to type in the integer to go to a certain row. <<
How do you know what that integer is? How do you know that it is the
right one? If anything gets out of synch, which used to happen before
SQL Server 7.0, you are dead meat.
>> The identity makes a database or schema easier to learn - the
uniqueness is obvious, where there might be several unique column
definitions (this one is weak I know.) <<
No, it just adds confusion about the candidate keys; at least all the
candidate keys have to be valid and identify the same entity; the
IDENTITY has no verification. I have an example of multiple keys
based on a class schedule, if you want to see it.
>> [Teradata hashing] I had a different understanding of how this
would be implemented on sql server. <<
Teradata is totally different at the implementation and physical
levels. Hey, it is a data warehouse tool and SQL Server is an
application tool aimed at department level applications. None of the
VLDB products use anything close to the SQL Server file-system style
model -- ithat model does not scale very well when you are stsrting
with petabytes.
>> ... there is no guarantee that "ME" knows my natural key. <<
You stand a better chance of knowing it than an IDENTITY number ...
>> I know what you are saying. But "accurate" is more important than
"quick" and some internal schemes are error prone by their nature. <<
Medical records comes to mind ...and credit history. This is why we
are moving to Biometric id. Kroger's just started fingerprinting for
check cashing. This gets into how "fuzzy" you are willing to let your
data be. Morgan-Kaufmann suggested that I write a book on data
quality at one point ... another thread ..
>> If it is more complicated then one field [sic], UI programmers will
screw it up, get the fields [sic] wrong, and destroy data integrity.
<<
The error rate is not related to the number of items until you get
over five of them (see a freshman psychology text for the details of
"the magic number five"), nor it is related to the length -- people
misspell short words more often than longer ones.
The important thing is the "natural-ness" of the code and how easy it
is to "chunk" (actual term!) -- that is, break into parts with some
meaning and how easy it is to get physical (visual, oral, audio)
verfication.
> 2) Implementation level, where "Product X" tried to make that logical
> model work; here we have software (which might be tied to hardware or
> not).
Ok.
> The bad news is that we have "leakage" between the levels. A large
Totally agreed.
Following this thread further is why I like Identity fields.
But I'm more concerned between the leakage between logical and
Implementation. I'm willing to change things in the implementation so
it is easier long term and short term to deal with them.
So are you, but we differ in what we think makes it easier.
> This is a terrible way to do SQL. Look at the Nucleus engine (SAND
> Technology) and its N-Vectors for a model based on seting up domains,
> then representing rows, columns and tables as bits in compresed bit
> vectors. The database gets smaller, faster and every column is
> automatically fully indexed. No contigous storage, each value appears
> once, etc.
Web site suggestions? Is this significantly different then a writable
OLAP cube? (A good cube, not most of the junk out there today).
> First I build a logical model; if I cannot do that, all is lost. Once
Agreed.
> I have the model, I move to the particular product I have to use. But
> I try to avoid changing the logical model or destroying portability.
Here is where we differ. Why do you try to avoid changing the logical
model?
> company moved to Product Z and they are screwed. Was everyone asleep
> when the Software Enginnering instructor told them that 80% or more of
> the cost of a system is taking care of it after it is deployed? rant,
> rant, rant!!
Good developers never did support. Great database Developers still do
support.
> Sooner; a lot of them start that way. So talk in simple sentences and
> use ORM techniques. Use names they know for the data elements,
> attributers they can identify.
Agreed.
> The company policies apply to ALL the systems and machines (I might be
> an optimist in saying that for some companies, but ...) both current
> and future, while an IDENTITY column is limited to the particulzar
> machine upon which a data base is currently residing.
Oh. Come now.
I used the term Uniquifier. One implementation of that is Identity.
You can write your own in 20-30 minutes using stored procedures and a
simple table to get the next value.
Sorry, that is NOT a valid argument against using Uniquifiers
Sometimes.
> If my machiner goes down, I can find the internal UPC code we made up
> for "Betty's Boiled Buzzard Eggs" somewhere else. I canotr fidn the
> old IDENTITY value anywhere.
If the machine goes down you aren't too worried about entering or
findign stuff in the database.
Besides, how often does a replicated server cluster go down?
> We gotta get a better word than "uniquifier"!!
>
> In a proper Standard SQL database I would use a CREATE DOMAIN for the
> identifiers. All the constraints, default value and datatype
> information would be in one and only one place in the schema.
I'll fess up. I don't know enough SQL theory to argue this with you.
> Give a real world example: the 10 digit UPC codes on all your
> packages are due to be replaced by the 15 digit GTIN (Global Trade
> Identification Number) in a few years. If the company had designed
> application databases with columns that all used UPC codes under
> different names (product_code, product_id, upc, upc_code, to use some
> actual names from a real company), they are in trouble -- enough
> trouble to pay Joe Celko sized consulting fees <g>.
Yeah, but if you used a Uniquifier, and had the UPC as a description
or attribute in the product file, this would be a VERY simple change.
You would probably add a new field for the GTIN, and leave the UPC.
After all, both the UPC and GTIN are just descriptions for the
product, just like "Betty's Boiled Buzzard Eggs".
Or use of uniquifier fields.
>
> >> The identity is shorter, making for smaller indexes and faster
> seeks for children [sic] tables. <<
>
> Let's make a concrete example, say a city map grid where the x and y
> values can be huge. Huge enough that to number each one of the square
> uniquely requires BIGINT. People also use BIGINT to aovid running out
> of uniquifiers, but I want to stsrt off really big.
Great example where a single uniquifier does not make sense.
But isn't your grid simply a construct of 2 uniquifiers that have no
verifiable physical attributes?
In other words, is this not an example of my approach? Shouldn't you
be using street address that is physically verifiable and knowable by
a person?
> attempt #1:
>
> CREATE TABLE Locations
> (location_id BIGINT IDENTITY (1,1) NOT NULL
> PRIMARY KEY,
> x_ord INTEGER NOT NULL,
> y_ord INTEGER NOT NULL,
> ...);
>
> There was nothing in Locations to prevent me from inserting a billion
> identical (x,y) values.
You can't think of how to make a unique constraint on this?
PuuuulEaase. ;-) ;-) ;-)
> Attempt #3:
>
> CREATE TABLE Locations
> (x_ord INTEGER NOT NULL,
> y_ord INTEGER NOT NULL,
> ...
> PRIMARY KEY (x_ord, y_ord));
>
> one index, smaller physical and logical table, and data that can be
> verified.
There we go. I agree that this is the best solution. Somewhat
surprised that you believe so also.
> >> The identity allows easier programming on the client side - to keep
> track of one specific row, you only have to keep track of the
> identity, not the columns that make up the natural key. <<
>
> Wrong! A natural key is, well, natural -- it is how the users think.
But it is NOT necessarily how the programmer thinks.
> The same logic would say that it is easier to type in one 40 digit
> number than to type in the three column key ('1997', 'Ford',
> 'Escort'). And it is a lot easier to verify by looking at the
> automobile in question
If I'm a lazy programmer (and I actually often am), I'd rather keep
track of the one 40 digit number then the three columns probably of
different data types. That was my point. Not what the user sees.
> >> The identity is easier for me as an administrator, as I only have
> to type in the integer to go to a certain row. <<
>
> How do you know what that integer is?
I look it up using the indexes and fiels I do know.
> How do you know that it is the > right one?
Same way you do. Just because I have a uniquifier, I did not lose any
data elements.
>If anything gets out of synch, which used to happen before
> SQL Server 7.0, you are dead meat.
I don't use anything older then SQL Server 7.0. And if a database
can't keep track of primary keys, then your solution will have
problems at implementation also.
> Teradata is totally different at the implementation and physical
> levels. Hey, it is a data warehouse tool and SQL Server is an
> application tool aimed at department level applications. None of the
> VLDB products use anything close to the SQL Server file-system style
> model -- ithat model does not scale very well when you are stsrting
> with petabytes.
Ok. Out of my realm.
> >> ... there is no guarantee that "ME" knows my natural key. <<
>
> You stand a better chance of knowing it than an IDENTITY number ...
Not a good enough answer. You have to GUARANTEE that data can be
accurately found. My argument is that you have to provide the methods
for users to find their data. Not provide a "better chance".
Uniquifier does not eliminate the use of other indexes or search
criteria, and the fact that you keep implying it does weakens your
arguments IMO.
> Medical records comes to mind ...and credit history. This is why we
I did some credit history work at one point in my career.
Lets say you are pulling data for credit reports, and you want to have
a unique identifier for an applicant.
What is the natural Key you propose?
I propose a uniquifier field, and then use a "status" or something for
when you figure out 3/4 of the way through the credit report that the
SSN you were given is incorrect and you already have the "person" in
the database.
> are moving to Biometric id.
Fingerprint? Retina? DNA? Which id will you use as the "biometric id"?
Kroger's just started fingerprinting for
> check cashing. This gets into how "fuzzy" you are willing to let your
> data be. Morgan-Kaufmann suggested that I write a book on data
> quality at one point ... another thread ..
that is NOT a trivial subject. Pretty broad topic even for a single
book.
> The important thing is the "natural-ness" of the code and how easy it
> is to "chunk" (actual term!) -- that is, break into parts with some
> meaning and how easy it is to get physical (visual, oral, audio)
> verfication.
Uniquifier code is shorter.
Again, thank you. I really enjoy discussions such as these, and I
learn a Lot.
-doug miller
;-) ;-)
-doug miller
drmil...@hotmail.com (Doug Miller) wrote in message news:<178db91.02121...@posting.google.com>...
I agree that clustering is good for queries. Locking should not matter
for queries only situations. Updates, inserts and deletes are the
worry. If everyone is hitting the same page, the row level locking gets
promoted to page level locking. So you can get page locking problems
plus the extra overhead of handling the row locks.
Sybase did a white paper years ago to show that page level locking was
as good or better than row level locking. I don't buy this conclusion
as a general statement; I think they stacked the deck by working a hot
spot instead of randomly distributing the keys over the table space.
The gimmick is that row level locking has another kind of "hot spot"
problem. Expanding updates are those which cause a VARCHAR(n) column to
increase in size, so that rows have to be re-adjusted on the page, based
on the "percentage free/fill factor" in your product's data page. Keep
pushing and suddenly, you are in an overflow situation while juggling
page level and row level locks on two data pages.
A uniformily distributed set of keys would make each data page as likely
to be hit as another, so you don't have hot spots except by pure chance.
Sorry to be late, but I was busy pouring boiling oil on Christmas
Carollers.
>> Web site suggestions? Is this significantly different then a writable
OLAP cube? (A good cube, not most of the junk out there today). <<
www.sand.com, but they have stopped puting up the white papers; might
ned to contact them and ask a human for help. OLAP cubes tend to hold
summary information instead of just a bit that represents a fact; OLAP
cubes tend to be dense, while nucleus vectors are sparse. But the same
sort of idea of dimensions that locate data in a space.
>> Here is where we differ. Why do you try to avoid changing the logical
model? <<
I put a lot of work into it, I feel pretty sure I got it right and I can
explain it to other people. When I start to monkey with it, it too
often becomes the "illogical model" instead. The Great Gildersleve was
an old radio program which had a character who had her own filing system
at the office. She did things like put the ink under "W" for "wet", on
the grounds that you could sacrifice the logical model for
implementation and it would be an improvement.
>> Yeah, but if you used a Uniquifier, and had the UPC as a description
or attribute in the product file, this would be a VERY simple change.
You would probably add a new field [sic] for the GTIN, and leave the
UPC. After all, both the UPC and GTIN are just descriptions for the
product, just like "Betty's Boiled Buzzard Eggs". <<
GTIN, EAN and UPC are standard identifiers for the products, not
descriptions. Big difference. Because these code are standards, there
are nice migration paths and plans for them. The first step is to add
leading zeroes to the UPC, then latter to fill in those five new digits
with codes that will tell the country of origin.
Doing it with multiple columns, you become the man with many watches,
never quite sure which one is right. Unlike an auto-incremented number,
I could verify that the manufacturer's digit are valid, that they do in
fact package "Betty's Boiled Buzzard Eggs", run the check digits, etc.
Remember ZIP+4 migration? Basically, the same thing. You validated the
new codes with state and city digit ranges. You bought a mag tape from
USPS, etc.
>If everyone is hitting the same page, the row level locking gets
> promoted to page level locking.
You wish to document this statement, perhaps ?
You're only to willing to "document" other assertions, so I'm sure you can
point to some document that states this for Microsoft SQL Server 7.0 / 2000,
and not for Sybase or some other database. As you excerpted in the quote
from my earlier post, the Microsoft architecture for SQL Server 7.0 / 2000
differs significantly from the one inherited from Sybase, so I presume you
will point to a document pertaining specifically to SQL Server 7.0 / 2000.
BPM
"Joe Celko" <71062...@compuserve.com> wrote in message
news:3dff634f$0$16018$7586...@news.frii.net...
BTW, if you are unable to find a reference to a document that states this, I
will of course be only too happy to accept code that can be cut and pasted
into Query Analyzer that will prove your assertion. Since the code has to
work in SQL Server 7.0 and/or SQL Server 2000, please be sure to use T-SQL
and not ANSI SQL :-)
BPM
"BP Margolin" <bpm...@attglobal.net> wrote in message
news:3dffe...@news1.prserv.net...
> Joe,
>
>> If everyone is hitting the same page, the row level locking gets
>> promoted to page level locking.
>
> You wish to document this statement, perhaps ?
> You're only to willing to "document" other assertions, so I'm sure you can
> point to some document that states this for Microsoft SQL Server 7.0 / 2000,
> and not for Sybase or some other database. As you excerpted in the quote
> from my earlier post, the Microsoft architecture for SQL Server 7.0 / 2000
> differs significantly from the one inherited from Sybase, so I presume you
> will point to a document pertaining specifically to SQL Server 7.0 / 2000.
>
> BPM
http://support.microsoft.com/default.aspx?scid=KB;en-us;323630& ?
In any case, it's pointless to talk about this in the abstract.
Sometimes transactions concentrated within a continguous portion
of the table (i.e. disk) is a good thing, since you might be hitting
many blocks multiple times and be able to hold the entire active
section of the table in memory. If on the other hand, you access
distinct disk blocks for each transaction on a huge table with narrow rows,
you might well have hundreds of gigabytes of i/o when only a tiny
fraction of that is relevant.
Of course it depends. It's just as easy to come up with an
example where the opposite is true, maybe or maybe not
having anything to do with locks and escalation.
What's silly is for anyone to claim that there is one right way to
key a table, or (what seems to be more popular) that there is one
wrong way to key a table. What would be far more interesting
would be to see real implementations of real models, not unrealistic
benchmarks, implemented in various ways and profiled.
SK
> In any case, it's pointless to talk about this in the abstract.
But I don't know any way to talk about it in the real.
> Sometimes transactions concentrated within a continguous portion
> of the table (i.e. disk) is a good thing, since you might be hitting
> many blocks multiple times and be able to hold the entire active
> section of the table in memory.
That is fine if you are reading. Writing to the table (either update
or insert) means that you have to WRITE to the disk.
And I believe what Celko was trying to say was that you are probably
better off to have those writes scattered across the disk then to be
concentrated in one place. Striping comes to mind. Multiple disks
writing is always going to be faster then one disk writing multiple
times.
Just like years ago Duplexing on Novell was faster then a single
drive.
> If on the other hand, you access
> distinct disk blocks for each transaction on a huge table with narrow rows,
> you might well have hundreds of gigabytes of i/o when only a tiny
> fraction of that is relevant.
That is a fact of life.
So lets look at the real world for a complex high usage database.
For every write, there is probably 50 or more logical reads, what with
indexes, tables, joins, hashes, etc. Those Reads are answered mostly
by cache, but the rest are scattered across the entire disk drives.
You can't do much for those reads except have good indexes, and limit
table scans.
So the drives are accessing all over the place, arms are flailing
about, and in the middle of it (for maybe every 5 actual physical
reads), you have a write. What are the odds that the last physical
read you made is somewhere close to the physical write? ZERO with a
clustered index, but basically the point is that there is no intrinsic
value in having all the writes happen in one place, except the
POSSIBILITY of hot spots.
If you have one user typing away into one table, then the database
will probably keep up no matter what you do.
But a true multi user OLTP system I believe you are better to just
scatter stuff to the winds and let the caching and write queue
programs do their jobs.
-doug miller
> What's silly is for anyone to claim that there is one right way to
> key a table, or (what seems to be more popular) that there is one
> wrong way to key a table. What would be far more interesting
> would be to see real implementations of real models, not unrealistic
> benchmarks, implemented in various ways and profiled.
Agreed.
-Doug Miller
I can almost picture this... ;-) ;-)
> >> Here is where we differ. Why do you try to avoid changing the logical
> model? <<
>
> I put a lot of work into it, I feel pretty sure I got it right and I can
It isn't right if it does not meet the business needs.
> explain it to other people. When I start to monkey with it, it too
> often becomes the "illogical model" instead. The Great Gildersleve was
> an old radio program which had a character who had her own filing system
> at the office. She did things like put the ink under "W" for "wet", on
> the grounds that you could sacrifice the logical model for
> implementation and it would be an improvement.
So I'm with her. She should have a method to find her ink under the
W's. If there is a business need for me to be able to find it, I'd
prefer to look for it under "I". The final logical model should
support both, if the business needs it.
> GTIN, EAN and UPC are standard identifiers for the products, not
> descriptions. Big difference. Because these code are standards, there
> are nice migration paths and plans for them. The first step is to add
> leading zeroes to the UPC, then latter to fill in those five new digits
> with codes that will tell the country of origin.
Spoken like a consultant who gets paid by the hour to make databases
overly complex, obtuse, and overly difficult to understand and modify.
Sorry, but I'm calling BS.
> Doing it with multiple columns, you become the man with many watches,
> never quite sure which one is right.
They are all right. If they are not, then you fix them as best you
can.
> Unlike an auto-incremented number,
> I could verify that the manufacturer's digit are valid, that they do in
> fact package "Betty's Boiled Buzzard Eggs", run the check digits, etc.
> Remember ZIP+4 migration? Basically, the same thing. You validated the
> new codes with state and city digit ranges. You bought a mag tape from
> USPS, etc.
Before my time.
This discussion is on the ragged edge of meaningless for me when you
refuse to respond to my discussion points pointing out what I perceive
as holes in your logic.
I'm trying to learn. As you stated, discussion and argument are
extremely efficient ways to learn. I really don't mind the
pompousness as long as you are responding to my questions
(challenges?).
Would offline be better?
-doug miller
The KB article states that there are circumstances where SQL Server will
perform lock escalation. This is well-known. This does not prove that the
set of circumstances described by Joe WILL result in lock escalation.
Perhaps, since you an employee of Oracle, you are not sufficiently familiar
with the behavior of SQL Server in this matter. If you would like to prove
that in the set of circumstances described by Joe, SQL Server WILL result in
a lock escalation of row locks to a page lock, then perhaps posting a script
that can be cut and pasted into Query Analyzer would be appropriate.
Otherwise, I'll just mark this up to another attempt of Oracle to mislead
the general database community about the capabilities of the alternatives to
your product.
-------------------------------------------
BP Margolin
Please reply only to the newsgroups.
When posting, inclusion of SQL (CREATE TABLE ..., INSERT ..., etc.) which
can be cut and pasted into Query Analyzer is appreciated.
"Mark Townsend" <markbt...@attbi.com> wrote in message
news:BA254215.4767%markbt...@attbi.com...
BTW ... posting that link reminds me of the Oracle salesperson who contended
that corruption NEVER happens in an Oracle database, but that Oracle, in
response to the Oracle database community, added, a command to check for
corruption in an Oracle database ... even though it served no purpose
whatsoever.
It appears to me that there is an unfortunate tendency among Oracle
personnel to resort to the distribution of true but very misleading
information in order to bolster the position of Oracle in the marketplace.
Does sort of bring to mind the phrase "Statistics don't lie, but liars use
statistics" :-(
-------------------------------------------
BP Margolin
Please reply only to the newsgroups.
When posting, inclusion of SQL (CREATE TABLE ..., INSERT ..., etc.) which
can be cut and pasted into Query Analyzer is appreciated.
"Mark Townsend" <markbt...@attbi.com> wrote in message
news:BA254215.4767%markbt...@attbi.com...
And, last BTW ... it's exactly because of posts such as these, that has led
me to consistently "out" you as an employee of Oracle.
Without the knowledge that you are an Oracle employee, it is just possible
that someone might be led to believe that you are posting "unbiased" and
"objective" information. The information in the KB article is true. The
**context** in which you provide the link is inappropriate. Posting "true"
information is not sufficient. Context is important.
For example, I could post: "Oracle does not support the CASE expression
defined in ANSI SQL-92". That's a true statement. Of course, I "happened" to
leave out that the "Oracle" I'm talking about is Oracle 7, not the latest
version of Oracle. "Truth" is NOT an absolute. It exists within a context.
By consistently and repeatedly refusing to volunteering that you are an
employee of Oracle, one has to question the context of the information that
you choose to post.
-------------------------------------------
BP Margolin
Please reply only to the newsgroups.
When posting, inclusion of SQL (CREATE TABLE ..., INSERT ..., etc.) which
can be cut and pasted into Query Analyzer is appreciated.
"Mark Townsend" <markbt...@attbi.com> wrote in message
news:BA254215.4767%markbt...@attbi.com...
Hmm, it appears then that the one Oracle db that I "fixed" wasn't corrupted.
It was just being antsy.
(To Oracle's credit, their Distaster Recovery book was excellent and allowed
me, an Oracle novice, but DB knowledgable to save a customer's skin.)
If you want to make your point convincingly, I think you need real
examples that demonstrate the behavior you believe takes place. I'm
simply pointing out that there are no hard and fast rules about what
is good and what is bad - there are examples of good and bad
performance, and you're quite welcome to post examples of where
surrogate keys result in bad performance.
SK
Exactly; it would nopt be a logical design for the business if it
didn't. And the first business need is data accuracy; ask Enron about
that. The other stuff includes robustness, portability, reliability,
etc. with speed not too high on the list. We used to have to worry
about speed in the old days (mag tapes, punch cards, and all that jazz)
because:
1) Machinery was much slower -- "nanosecond" was not even an official
ISO unit.
2) Machinery was much costily -- a minicomputer costs about the same as
a small middle class house in the 1960's; don't evena sk about
mainframes.
Today, you can throw hardware at as problem because it is cheap and
people are expensive. In fact, I saw a quote to the effect that IT and
subsistent farming have about the same ratio of labor to tools nows.
IfI have a good logical design, it will scale up to that cheap, fast
hardware.
>> So I'm with her. She should have a method to find her ink under the
W's. If there is a business need for me to be able to find it, I'd
prefer to look for it under "I". <<
And when she fails to show up at work, the whole system shuts down.
Amateurs write that kind of personalized code; professionals write code
for other users. Violating good programming practices to solve an
immediate problem is called a "kludge" and we don't do them.
Ed Yourdon (please tell me you know that name!)has a quote to the effect
that "There is nothing more permanent than a temporary fix." because
nobody understands and therefore nobody dares to touch it.
>> The final logical model should support both, if the business needs
it. <<
A man with two watches will never know what time it is. Read any book
on human interface design; you want one consistent interface that is
documented and easy to understand.
>> Spoken like a consultant who gets paid by the hour to make databases
overly complex, obtuse, and overly difficult to understand and modify.
<<
Actually I get paid for the *opposite* reason, when I am doing a
clean-up. Somebody with your attitude decided to put the ink under "W"
for "wet" because their "immediate business need" could be met by a
kludge.
One kludge after another on top of a bad or forgotten logical model has
added up and the database is overly complex, obtuse, and overly
difficult to understand and modify when I walk in the door.
If you want to find out how bad your databases have gotten, do a data
warehouse project. Suddenly, you have to pull all the applications
together and make them consistent across the entire enterprise. Then
you see why you should have had a logical model that wasn't focused on
one and only one area.
> performance, and you're quite welcome to post examples of where
> surrogate keys result in bad performance.
I don't think anyone will ever post where surrogate keys lead to bad
performance. I think surrogate keys for sure should be a normal way of
doing things. Not always used on every table, but for sure used on
many (most?).
Perhaps we are in agreement there?
I was trying to show examples where clustered indexes, on whatever
key, but especially a surrogate key (often implemented using identity
in SQL Server) might be slower then not using a clustered index.
I wonder how you would quantifiably measure the difference in the real
world for a large multi user database.
-doug miller
As I understand it, your argument for never using surrogate keys boils
down to two main points:
> We used to have to worry
> about speed in the old days (mag tapes, punch cards, and all that jazz)
> because:
> Today, you can throw hardware at as problem because it is cheap and
> people are expensive.
Did I get that right?
-Doug Miller
Keep in mind that companies like Oracle and Microsoft, create technology
that has limitation and may permit compromises to the relational
paradigm, but that does not mean those features are good relational
elements. Consider the difference of relational joins using where
clause vs the join clause. The where clause syntax does not provide a
mechanism to implement all the features of the relational model. But
that's what we had 10 years ago to work with. I still tend to fall back
to it for convenience. But I don't then claim I'm using the true
relational syntax. Only an approximation that works a lot of the time.
As for me, I avoid artificial keys as often as possible. I've seen far
too many problems and hair pulling caused by a quick and dirty choice to
throw in an identity as a primary key.
--
Posted via http://dbforums.com