IID bigint (Identity)
GID guid (PK)
ParentID guid
Value nvarchar(400)
with indexes on
ParentID, Value
Value, ParentID
Now, if I put the clustered index on IID, I get great INSERT
performance because new records are added to the end of
the table all the time, but the column is kind of superfluous
as it is only being used to control what happens to new inserts.
GID is the real PK, but if I put the clustered index on this column
and ditch IID to save some space and simplify the table design,
insert performance deteriorates as the number of records grow.
So I read this article:
http://www.informit.com/articles/printerfriendly.asp?p=25862
which advocates generating (sorta) sequential GUIDs to
provide the best of both worlds.
But in a discussion on the topic here
http://forums.asp.net/780340/ShowPost.aspx
I find this comment:
---snip---
The InformIT article is nice I suppose, but the "COMB" datatype is
pointless. Since it had any effect at all, it means his Clustered
Index was on the Guid, which is bad. All he had to do was change the
Clustered Index to a Created(DateTime), or Int(Identity) column, and
he would have had the same (or slightly better) results.
---snip---
Now, the question is: If a guid column *is* needed and an index
on that column *is* required, won't inserts still be costly because
guid indexes need to be updated (balanced) anyway? If so, for
a table like the one above where an index such as "ParentID, Value"
is almost as wide as the table (the clustered index on IID or GID),
I would believe that the poster is incorrect and the COMB GUIDs
would still be the way to go?
And what is the deal with NewSequentialID? As I understand it,
this function would provide the same benefits as COMB GUIDs,
but straight out of the Yukon box. Yet, if I Google for this new
function, I get, like, 17 hits?!? (And if I google for discussions
about the subject of sequential guids, I invariable end up with
the InformIT article).
Isn't anyone using COMB GUIDs or have anything to say about
NewSequentialID(). I would have thought that this would have
spurred numerous discussions, flame wars, and exchange of
experiences. Instead, I find the usual Identity vs GUIDs
discussions?!?
I guess my main question is:
1) Are COMB GUIDs and the SequentialID()
function the way of the future?, or
2) Were they nice ideas that did not work in the real world?, or
3) Did the discussion slip under most peoples' radar?
Regards,
Joergen Bech
Btw, your analysis about having a nc ix on guid for a very narrow table, the nc index is very much
the same as the cl ix. Note, however, that the nc ix includes the cl ix key (under the hoods, as a
row locator).
--
Tibor Karaszi, SQL Server MVP
http://www.karaszi.com/sqlserver/default.asp
http://www.solidqualitylearning.com/
"Joergen Bech @ post1.tele.dk>" <jbech<NOSPAMNOSPAM> wrote in message
news:pf6o51ljvulv49s09...@4ax.com...
I have used Nilison's COMB algorithm in a database it worked fine. The
difference between using purely a DateTime and the COMB Guid is that the COMB
guid adds additional information which significantly decreases the probability
of a collision. Further, you can adjust the percentage of the result represented
by the DateTime by adjusting the algorithm. Using less of the DateTime in the
Guid lowers the probabiliy of a collision but weakens its usefulness in indexes.
>1) Are COMB GUIDs and the SequentialID()
>function the way of the future?, or
No, I wouldn't go that far. The COMB algorithm provides a way of significantly
improving performance by sacrificing some uniqueness probablility. As with any
choice, it is a cost-benefit analysis. Can you sacrific some uniqueness to get
better performance? Do the benefits you get by using Guids outweigh the costs?
Can you use something like a COMB algorithm to sacrifice a little uniqueness and
if so, does that tip the scales?
> 2) Were they nice ideas that did not work in the real world?, or
The COMB algorithm does work nicely. I implemented it by making one change in my
middle-layer code and changing the default values in the various PKs to call a
user-defined function that is passed GetDate() and NewId().
> 3) Did the discussion slip under most peoples' radar?
Likely. There is a big stigma against using Guids. Admittedly, some of that
resistence is valid. They provide poorer performance. They might require an
additional, meaningful or user-friendly key values making your table fatter.
They are tougher to work with in development. They are not Hoyle ANSI etc. The
list goes on.
My biggest reason for using them is that they made certain problems
significantly easier to solve. In addition, they can save you the cost of adding
an addition Guid column if you know you are going to replicating the database.
Beyond all of that, they have the benefit of allowing arbitrary keys to be
generated outside of the database system itself.
One example of an incredibly easy problem to solve is combining tables. For
example, suppose you have ParentTable1 and ParentTable2 which each have
ChildTable1 and ChildTable2 respectively associated. Now suppose, after you have
tons of data, that you are going to combine ParentTable1 and ParentTable2 using
a field to differentiate the values. With Guids as PKs, you can simply combine
the two ParentTables and the two ChildTables, reset the relationship and you are
done. If you want to do that with Identity values, you probably have to assign
new Identity values to ParentTable2 and then of course update ChildTable2 (and
any other related tables). It's a small thing, but with Guids the change is far
easier.
As convenient as Guids can be, I would generally not recommend their use except
in a small, select set of circumstances. Instead, I would choose Identity values
or an Identity table to generate arbitrary keys instead.
HTH,
Thomas
"Joergen Bech @ post1.tele.dk>" <jbech<NOSPAMNOSPAM> wrote in message
news:pf6o51ljvulv49s09...@4ax.com...
>
There is. Just as the OP mentioned, it's called NEWSEQUENTIALID().
Google it.
Agreed on the comb stuff -- it can be quite useful and I've used to it
with great success in two projects now which required GUIDs for various
reasons.
--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Thomas
"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:%23rlOMB6...@TK2MSFTNGP14.phx.gbl...
It's there, just not documented yet. Try running:
SELECT NEWSEQUENTIALID()
---snip---
>Btw, your analysis about having a nc ix on guid for a very narrow table, the nc index is very much
>the same as the cl ix. Note, however, that the nc ix includes the cl ix key (under the hoods, as a
>row locator).
Yes, I know it is using the whole clustered index key for bookmark
lookups. Haven't gotten through "Inside SQL Server 2000" completely,
but at least I have gotten past that part :)
But in terms of insert performance and the whole search tree balancing
question, I would presume that it just makes the nc key a bit wider
under the hood, but does not affect the position of new
inserts. In order for nc indexes to work as, well, indexes, I would
guess that only the explicitly defined fields affect the insert
position and your comment only applied to the extra space they
take up.
Regards,
Joergen Bech
More than half a year ago. And it explicitly states Yukon Beta 2.
As for BOL, it might not be in sync with the code - or the information
might have been left out if the function is at the experimental stage.
Nothing to do but try to run the script in the post above and see
if it works - and not rely on it until it becomes official. I do not
have the latest Yukon installed myself, so I cannot verify if it is
(still) there.
Regards,
Joergen Bech
But thanks anyway. Casual readers of this group need to
read this stuff if they haven't already come across it in other
places.
This is for a client-server solution with data possibly
being created and/or modified offline, so I have already
decided to use guids in one form or another in order to
make programming life easier on myself in general.
Only question was how best to solve the insert performance
problem with guids, but you covered that at the beginning
of your answer.
Thanks,
Joergen Bech
If that is not what you mean by a "physical" locator then I'd suggest devising a
different term and I would be interested in knowing what exactly you mean by
this concept.
Thomas
"--CELKO--" <jcel...@earthlink.net> wrote in message
news:1113342062....@o13g2000cwo.googlegroups.com...
A way of getting to a row which is based on the physical state of the
machine at the time the record that row is inserted into the table.
Examples are Oracle ROWID which are based on the disk architecture
(track, sector); pointer chains based on the addresses, sequential
counters based on the arrival time of the data, hash tables, etc.
>> To me, a physical locator is something that would tell me the exact
location of a piece of data on disk or in memory. <<
Bingo! But it is maintained by the system; the (track, sector) rowid
might not be the same after a defragmentation; the hash code might not
be the same after a certain number of collisions (hash clash) forces a
re-org, etc.
By definitin, a relational key is a subset of the attributes of a table
which are always unique by the nature of the data model. I do not care
about what physical locators the particular product uses (Oracle track
and sector, Teradata hashing, Nucleus bit vectors, DB2 indexes, etc.)
as long as they are not exposed to the user.
>> I fail to see how a uniqueidentifier or anything type of data for
that matter qualifies as a "physical" locator. <<
So, find it in the LOGICAL data model. It ain't there; it is in the
PHYSICAL storage just as surely as a hash table value. The problem is
that it is exposed and can be used in queries and statements.
Finally, an appeal to authority, with a quote from Dr. Codd:
"..Database users may cause the system to generate or delete a
surrogate, but they have no control over its value, nor is its value
ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd,
E. (1979), Extending the database relational model to capture more
meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434.
This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means never
used in queries, DRI or anything else that a user does.
Codd also wrote the following:
"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.
(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result
that some or all of the serial numbers might be changed.).
(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.
(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).
These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities. A solution - proposed in part [4] and more fully in [14] -
is to introduce entity domains which contain system-assigned
surrogates. Database users may cause the system to generate or delete
a surrogate, but they have no control over its value, nor is its value
ever displayed to them....." (Codd in ACM TODS, pp 409-410).
References
Codd, E. (1979), Extending the database relational model to capture
more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434
> A way of getting to a row which is based on the physical state of the
> machine at the time the record that row is inserted into the table.
> Examples are Oracle ROWID which are based on the disk architecture
> (track, sector); pointer chains based on the addresses, sequential
> counters based on the arrival time of the data, hash tables, etc.
>>> To me, a physical locator is something that would tell me the exact
> location of a piece of data on disk or in memory. <<
>
> Bingo! But it is maintained by the system; the (track, sector) rowid
> might not be the same after a defragmentation; the hash code might not
> be the same after a certain number of collisions (hash clash) forces a
> re-org, etc.
1. Neither an identity value nor a guid tells me anything about where a piece of
data is physically located on the disk. With either type of value, I have no
idea in which track or sector a piece of data is located. For all I know, SQL
Server gnomes move the physical location of the data in the middle of the night
and its affect on the system would be as much as using purely user entered data
that were moved in the middle of the night.
2. Using your definition above, would a value generated on the client (but not
the user) be considered a "physical" locator?
3. It sounds like you are really talking about the ability to predict the next
value. In the case of an identity value or RowId, it is assumed that you cannot
predict the next value as it is considered a surrogate. This again, would not be
inline with my interpretation of "physical" locator.
> By definitin, a relational key is a subset of the attributes of a table
> which are always unique by the nature of the data model. I do not care
> about what physical locators the particular product uses (Oracle track
> and sector, Teradata hashing, Nucleus bit vectors, DB2 indexes, etc.)
> as long as they are not exposed to the user.
IMO, the core problem is that the concept of the "primary key" is really used
for two vastly different purposes. The first purpose is for users to uniquely
find instances of a fact in a table. The second purpose is as the glue that
binds relations together.
Unique indexes can be used to achieve the same functionality as the PK from the
user's perspective. From the system's perspective, however, a value(s) is needed
that uniquely identifies a row such that it can *never* change. The means by
which users uniquely identify a fact is subject to change over the lifetime of
the system.
>>> I fail to see how a uniqueidentifier or anything type of data for
> that matter qualifies as a "physical" locator. <<
>
> So, find it in the LOGICAL data model. It ain't there; it is in the
> PHYSICAL storage just as surely as a hash table value. The problem is
> that it is exposed and can be used in queries and statements.
I'm not disputing that the ISO specification does not provide for a guid. That
does not obviate the need for a value by which a programmatic system can find a
row that is insulated from specification change.
<snip>
> This means that a surrogate ought to act like an index; created by the
> user, managed by the system and NEVER seen by a user. That means never
> used in queries, DRI or anything else that a user does.
Firstly, let's make sure we understand each other. From previous discussions, I
take your meaning of the word "surrogate" to imply two different properties:
arbitrarily generated (from the user's perspective) and not visible to the user.
IMO, the visibility of the key is trivial issue. The real discussion is whether
a given key should be arbitrary.
In this area, I respectively disagree with Codd. He clearly never involved
himself in a massive and complex database system where the "unique, immutable"
definition of primary key had to change because the original designer did not
account for the possibility that the definition of the term "uniqueness"
according to the users would change. Even adding another surrogate value doesn't
account for the massive code changes that are required so that existing code can
uniquely find fact instances.
<snip>
> These difficulties have the important consequence that an equi-join on
> common key values may not yield the same result as a join on common
> entities. A solution - proposed in part [4] and more fully in [14] -
> is to introduce entity domains which contain system-assigned
> surrogates.
So the solution is to introduce a surrogate key to get around the fact that you
didn't use a surrogate key to being with? Of course, that doesn't account for
all the code that would have to change when that happens.
Thomas
Corrent.
> In order for nc indexes to work as, well, indexes, I would
> guess that only the explicitly defined fields affect the insert
> position and your comment only applied to the extra space they
> take up.
>
Just think of is as SQL Server adding the cl column(s) to the end of your NC index. You can specify
the cl columns(s) explicitly in the nc index, and SQL Server will not store those values twice. In
other words, you now have control over the order of the columns for the index. However, considering
this is to be a unique index (possibly along with a PK or UQ constraint), specifying uniqueness for
only the guid columns compared to the guid column together with the cl key column(s) is not the same
thing. So in this particular situation, adding the cl column(s) explicitly is not really an
option...
Thomas
"Tibor Karaszi" <tibor_please.n...@hotmail.nomail.com> wrote in
message news:u6%23pOd5P...@TK2MSFTNGP15.phx.gbl...
And there are some other quirks associated with identity values, which has
some implications at both the theoretic as well as the implementation
levels. But then, such aspects are generally overlooked since arguing from a
relational standpoint doesn't always hold water with SQL.
--
Anith
None of the surrogate key techinques you mention contain any direct
information whatsover about the physical location of the data. Do you even
know how modern RDBMS work? If all the data were moved on disk, none of
these values would change.. (Identity, Guid, RowNo)
> Bingo! But it is maintained by the system; the (track, sector) rowid
> might not be the same after a defragmentation; the hash code might not
> be the same after a certain number of collisions (hash clash) forces a
> re-org, etc.
And even if you used a "Logical" key like you're pushing, the Sector, Track
numbers (which are just IRRELEVANT tto this discussion!) would still change
after a defrag or reorg, Does this mean your Logical" keys contain physical
data ?
Both approaches use keys whose direct actual values are immutable under
physical storage movement, And both approaches MUST indirectly map to the
actual physical location (HOW ELSE CANN THE SYSTEM FIND THE DATA.)
There is no meaningful difference here, your arguments as to "Physical
Storage" are vacuous.
There are valid, debatable issues about Surrogate keys vs meaningful,
Logical keys, (I happend to think both are necessary) but your "Physical
Storage" argument is just silly.
You can't place the anlaytical knife between the Physical storage and the
not-Physical storage for one thought, and then place it between the
meaningful and the non-meaningful, for the next thought. That leaves a
giant gap, because there are numerous approaches (like Identities/guids)
which are not meanignful, yet are far from the physical, (despite Celko's
proseletizing)...
<snip>... however changes in the physical model can affect the generation of
identity values that are supposedly at the logical level...</snip>
What does it mean that these are "Supposed" to be "Logical"? Do you simply
mean not Physical? or do you mean "Meaningful"? And in any event, what
difference does it make ? The whole point of, and the advantedge of
"non-Meaningful", surrogate keys is that they ARE non-meaningful! What's the
problem?
<snip>And that would constitute a violation of physical data
independence</snip>
It in no way implies Physical Dependance, because the changes you describe
are far removed from the Physical Storage. What identity keys ARE
Indedependant of is the meaningful data about the row... Again, I believe
Your analysis confuses the Physical/NonPhyssical Analytical differentiation
with a Meaningful/non-Meaningful anlaytical differentiation.
There are many points to be made pro and cpn about surrogate keys, (I happen
to believe both are useful) but this Physical Logical differentiation seems
to me to be a seriously flawed argument.
Because the word "meaningful" has numerous interpretations, I would instead
interpret the discussion as Arbitrary keys vs. Attribute-based keys.
How the arbitrary key is generated is entirely irrelevant and it is something
that I think Mr. Celko has a difficult time getting his head around. That you
are using a SQL Server's Identity field or Oracle's RowId or some magic function
in a DLL based on astrological signs to generate a key value is irrelevant. All
that is important is whether the user determines they key or the system and
whether that key is made up of a fact's attributes or whether it is purely
arbitrary from the user's perspective.
The core problem is that the needs of a user and those of a programmatic system
are different. Programmatic systems need a way to access a row such the
specification of the identifier will never change. Users also need a way they
can find a row. However, that set of information considered unique may change
over time.
As you said, both types of keys are really needed.
Thomas
--
Tibor Karaszi, SQL Server MVP
http://www.karaszi.com/sqlserver/default.asp
http://www.solidqualitylearning.com/
"Thomas" <tho...@newsgroup.nospam> wrote in message news:OHQWh7CQ...@TK2MSFTNGP15.phx.gbl...
I'd be very interesting in knowing whether this will be the case. Is there
someone we can ask or a SQL Server team member's blog that mentions this?
Thomas
"Tibor Karaszi" <tibor_please.n...@hotmail.nomail.com> wrote in
message news:eSV2WNEQ...@TK2MSFTNGP10.phx.gbl...
> Seems BOL haven't been updated to include info on this yet. The TSQL function
> only exposes a Win32 API function, so it was probably not a big deal to
> implement for he SQL server team.
>
> --
> Tibor Karaszi, SQL Server MVP
> http://www.karaszi.com/sqlserver/default.asp
> http://www.solidqualitylearning.com/
>
>
> "Thomas" <tho...@newsgroup.nospam> wrote in message
> news:OHQWh7CQ...@TK2MSFTNGP15.phx.gbl...
>> That's curious. I haven't found anything in the SQL 2005 BOL that mentions
>> anything about sequential Guids. Where did you hear this?
>>
>>
>> Thomas
>>
>> "Tibor Karaszi" <tibor_please.n...@hotmail.nomail.com> wrote in
>> message news:u6%23pOd5P...@TK2MSFTNGP15.phx.gbl...
>>> My guess is that most people haven't realized that 2005 will have seq guids.
>>> And for this who has, there haven't been good posts to comment on the
>>> feature. Sounds like a good thing to blog and article about.
>>>
>>> Btw, your analysis about having a nc ix on guid for a very narrow table, the
>>> nc index is very much the same as the cl ix. Note, however, that the nc ix
>>> includes the cl ix key (under the hoods, as a row locator).
<snip>
Not all GUIDs are sequential. It's a different call.
What I mean by meaningful is "Containing some meaningful information about
the entity representing by the record. "
And although we agree in principle on the main issue here, and I am open to
a better word than "meaningful" I wouldn't use the word "Arbitrary".
That word is probably more open to misinterpretation than "Meaningful" and
doesn;t get as close to the meaning I intended... Ther can be columns in a
data recrod that are "arbitrary" than contain meaninful information.. Date
of Birth, vs Age as an example...
To use <snip>...Arbitrary keys vs. Attribute-based keys...</snip> is making
a similar differentiation that is not clear... Non-Arbitrary does not imply
Attribute based. They are not clearly opposites. This distinction also, in
a way, just shifts the semantic interpretation/definition to the word
"Attribute". Once a surrogate key column has been added t oa table, how do
we define that it is, or is not, an "Attribute" of the record? We're back to
whetehr or not it contains "Meaningful" information about the entity
represented by the record.
Charly
--
Tibor Karaszi, SQL Server MVP
http://www.karaszi.com/sqlserver/default.asp
http://www.solidqualitylearning.com/
"Thomas" <replyi...@anywhere.com> wrote in message news:OavDNREQ...@TK2MSFTNGP14.phx.gbl...
Perhaps your incapability in recognizing the difference or the insufficiency
in your understanding of the foundations of the most useful and successful
datamodel ever. Perhaps it is a reflection of a more general, yet simple
principle most vociferous ones in the newsgroups forget:
Read more than you write
Since we book references can be t on this newsgroup, hopefully some online
references could make some eyes open. We can educate someone who does not
know, but it is hard to educate someone who does not know that he does not
know.
>> What does it mean that these are "Supposed" to be "Logical"? Do you
>> simply mean not Physical? or do you mean "Meaningful"? And in any event,
>> what difference does it make ? The whole point of, and the advantedge of
>> "non-Meaningful", surrogate keys is that they ARE non-meaningful! What's
>> the problem?
Assigning the usual linguistic meaning to the term "logical" in this context
is moot. Equating "logical" with the term "meaningful" literally, might
carry some baggage which requires further explanation of "meaningful" as in
which scope, to whom, at what level etc.
In the context of relational databases, logical stands for suitable data
representation and/or manipulation in accordance with the nature of first
order logic, or first order predicate calculus. Since logic applied to data
is independent of the medium of materialization, it is separated from any
physical aspects of the data, thus separated as physical model. Some FO
logic incorporated in relational model was borrowed from Alfred Tarski while
some later proof equating the notation ( algebra ) with calculus was done by
Codd himself.
Here are the references:
http://citeseer.ist.psu.edu/443292.html ( See section 2 )
If you would like to know why such separation is beneficial, I suggest you
google up some relational literature. Here is quick one:
http://citeseer.ist.psu.edu/5334.html
Read the first 4 pages, and that will demonstrate what you posted is BS.
Also, ISO/IEC document 2382 International Standard Database Vocabulary
details the three methodical levels for data representation.
1) A conceptual level, which captures an information model that can be
assimilated by humans.(business/conceptual model)
2) A logical level, which represents --in an abstract form--some or all of a
conceptual model suitable for machine manipulation without regard to actual
media or physical devices (logical model)
3) A physical level that maps the logical level to actual storage structures
and locations. This involves the actual encodings of data on physical media
or devices (the physical model)
Here is same information, by Fabian Pascal recently in simple terms:
http://www.dbdebunk.com/page/page/622537.htm
http://www.inconcept.com/JCM/May2003/Pascal.html
Do you also need book references too? Post back.
>> It in no way implies Physical Dependance, because the changes you
>> describe are far removed from the Physical Storage.
The implication of physical dependence has been proven several times here,
if you cannot find a good example in the google archives of this newsgroup,
post back.
>> Again, I believe Your analysis confuses the Physical/NonPhyssical
>> Analytical differentiation with a Meaningful/non-Meaningful anlaytical
>> differentiation. There are many points to be made pro and cpn about
>> surrogate keys, (I happen to believe both are useful) but this Physical
>> Logical differentiation seems to me to be a seriously flawed argument.
Your belief is irrelevant. Other than sidetracking simple concepts over
irrelevant areas, you have not yet demonstrated sufficient knowledge of
relational databases or its different levels of representation at its core.
Nothing you have posted so far has shown you have sufficient grasp on the
subject of surrogates either.
--
Anith
No disrespect intended... But I believe you might be missing my point.
<snip>In the context of relational databases, logical stands for suitable
data
representation and/or manipulation in accordance with the nature of first
order logic, or first order predicate calculus. Since logic applied to data
is independent of the medium of materialization, it is separated from any
physical aspects of the data, thus separated as physical model. Some FO
logic incorporated in relational model was borrowed from Alfred Tarski while
some later proof equating the notation ( algebra ) with calculus was done by
Codd himself.</snip>
No offense intended, and I'm not trying to be personal, but you make my
point... The definition above makes it clear that the word "Logical" is
something much more than simply "Not Physical" Nothing in the snippet above
has anything to do with Physical storage...
Therefore, it is deceptive (and, wrong,) to even talk about a
Physical/Logical split. If Physical means what Joe says it means, (such that
a defrag changes the Physical data pointers i.e., the sector and track
numbers) and Logical means what your quote above means, then there's an
enormous gap between the two...
But I will now read the links you suggest... to see if there's anything
onpoint there...
>> 2. Using your definition above, would a value generated on the
client (but not the user) be considered a "physical" locator? <<
Could be. Hell, I have seen Oracle programmers use the ROWID to move
the read/write heads at the hardware level. Terrible way to program,
but this was Oracle.
>> 3. It sounds like you are really talking about the ability to
predict the next value. In the case of an identity value or RowId, it
is assumed that you cannot predict the next value as it is considered a
surrogate. This again, would not be inline with my interpretation of
"physical" locator. <<
No, predictability is not the problem. Such internal machine states
are not *verifiable* in the real world. Example: I use a VIN number to
identify an automobile and you use nothing but a GUID. Lightening hits
both our machines and takes out that table. I walk out to the
motorpool and read the VIN numbers off the cars. You are screwed.
>> In this area, I respectively disagree with Codd. ..<<
Well, then you are going to have some problems in the RDBMS world :)
>> Even adding another surrogate value doesn't account for the massive
code changes that are required so that existing code can uniquely find
fact instances. <<
Right now the retail industry is moving from the 10 digit UPC code to
the 13 digit EAN code. Thanks to industry standards, they have a
migration path and time to do this. This is why we have CREATE DOMAIN
and ALTER DOMAIN statements in Standard SQL. If you have a product that
is up to Standards, the conversion on the database side is easy. The
poor bastards on the front end will have to re-do their reports,
screens and all that stuff, however.
Look up the Nucleus database engine architecture. It uses domains to
hold the values, then creates tables on the fly from compressed bit
vectors that model the relations. The resutls are assembled one column
at a time. Everything is automatically compressed, fully indexed and
incrediably fast, whcih makes it a very good VLDB product.
Also, since the tone of your post was a bit confromtational, lest I not
be clear, and to ensure that my tone here is, and remains, appropriate, I
appreciate the time and effort you invested in replying to my post. My
opinion, (despite your characterizations) is informed and thought out, and
represents a substantial period of time and effort, but I wouldn't
participate in these threads if I wasn't interested in learning more, and
correctting myself in those areas where my knowledge and/or opinions are not
as well-informed as they should be.
Where?
The only relevant hit that Google returns is an article you wrote for
DBMS in 1996.
Again, you are presuming that by using an identity field its sequential nature
is specifically desired. Think in more general terms of a system-generated key
that uniquely identifies a row. Nothing more.
>>> 2. Using your definition above, would a value generated on the
> client (but not the user) be considered a "physical" locator? <<
>
> Could be. Hell, I have seen Oracle programmers use the ROWID to move
> the read/write heads at the hardware level. Terrible way to program,
> but this was Oracle.
But no more so that the any PK in any table. This type of example is entirely
orthogonal the discussion. Get your mind out of thinking that the a
system-generated key must be sequential. If said developers complained to me
because the values had gaps or were seeminingly random, I would have no remorse.
>>> 3. It sounds like you are really talking about the ability to
> predict the next value. In the case of an identity value or RowId, it
> is assumed that you cannot predict the next value as it is considered a
> surrogate. This again, would not be inline with my interpretation of
> "physical" locator. <<
>
> No, predictability is not the problem. Such internal machine states
> are not *verifiable* in the real world. Example: I use a VIN number to
> identify an automobile and you use nothing but a GUID. Lightening hits
> both our machines and takes out that table. I walk out to the
> motorpool and read the VIN numbers off the cars. You are screwed.
Straw man argument. Without the computer system, simply having the VIN gives you
little. There are thousands of pieces of data stored in the system that relate
to the vehicle that are not buried in the VIN that would also need to be known
that can't if the computer system is down. Further, there are better ways of
approaching disaster recovery than bury all the fact data in the PK.
>>> In this area, I respectively disagree with Codd. ..<<
>
> Well, then you are going to have some problems in the RDBMS world :)
Not at all. Do you structure all of your databases to seventh normal form?
Doubtful. Just because there are elements of Codd's theory with which I disagree
does not mean I must throw out all of his theories.
>>> Even adding another surrogate value doesn't account for the massive
> code changes that are required so that existing code can uniquely find
> fact instances. <<
>
> Right now the retail industry is moving from the 10 digit UPC code to
> the 13 digit EAN code. Thanks to industry standards, they have a
> migration path and time to do this. This is why we have CREATE DOMAIN
> and ALTER DOMAIN statements in Standard SQL. If you have a product that
> is up to Standards, the conversion on the database side is easy. The
> poor bastards on the front end will have to re-do their reports,
> screens and all that stuff, however.
That presumes you have a huge monolithic organization that can guarantee the
makeup of your key. Most database systems do not have this luxury. Furthermore,
what do you imagine is the cost to those companies that built their database
systems with the core assumption that the UPC would be 10 digits that now have
to change? If you were adhering to SQL-92 (where the Domain construct does not
exist if I'm not mistaken), then you might be looking a thousands of man-hours
to change your system to accomodate a PK structure change. Whereas, if you had
originally designed your system such that the UPC was another, unique attribute
on product in question, making this change would be trivial.
> Look up the Nucleus database engine architecture. It uses domains to
> hold the values, then creates tables on the fly from compressed bit
> vectors that model the relations. The resutls are assembled one column
> at a time. Everything is automatically compressed, fully indexed and
> incrediably fast, whcih makes it a very good VLDB product.
Is this an OLAP product?
Thomas
I re-read my post and am embarrassed about the tone and cockiness. I did not
mean to come across that way, but apparently it did & I am sorry about it.
Stick around and I am sure we will have more meaningful exchanges :-)
--
Anith
Your second recommended link:
http://citeseer.ist.psu.edu/5334.html
says, essentially, exactly what I am saying.... And I quote from that post
on page 3, paragraph titled "Below the Logical Level"...
"The implementation of the database occurs below the logical level in a
HIERARCHY of levels of abstraction. At the BOTTOM of this hierarchy lies the
physical level, which consists of the data stored on some physical medium
(say, on disk), as a sequence of bits. " <my CAPS>
My point, which you missed, (perhaps because I did not properly describe it)
is that Surrogate keys (like Identities or Guids) lie not at the Physical
Level, bit somewhere between the Physical and the Logical, and therefore
Joe's constant description of them as "physical" is just wrong..
Logical / Physical is NOT binary, it's just the top and bottom of the
hieracrhy... There are other levels of abstraction betwen them that Joe's
Logical / Physical argument ignores...
Charly
My apologies for the knee-jerk reactions :-(
>> The definition above makes it clear that the word "Logical" is something
>> much more than simply "Not Physical" Nothing in the snippet above has
>> anything to do with Physical storage...
True, my point was to demonstrate the validity of such distinction. Physical
model, does not conclusively mean "Physical storage", though storage might
be a part of it.
>> If Physical means what Joe says it means, (such that a defrag changes the
>> Physical data pointers...
Not sure. For instance, look at his classification of keys at:
http://www.intelligententerprise.com/030320/605celko1_1.jhtml
Does anyone know what the author's basis for such a classification is?
Unfortunately, there is a tendency in our industry to treat opinions as
facts.
--
Anith
Regards,
Charly
I confess that the Math from the first link you listed above, was above my
head,
I was stopped at the point where he attempts to "formalize" the notion of a
"Logical" query... by the following:
"Tarski has shown us how. Consider the following cumulative hierarchy of
universes Uo,U1, U2, etc, and theur Union U*:
Uo := U, Un+1 := U U P(Un), U* := UUn
(Please excuse the inabiliy to differentiate between Universe U and Union U)
Here P denotes the PowerSet operation."
What is a PowerSet Operation ?
It is a basic set operation. For instance, the power set of A is the set of
all subsets of A including the set itself. An example with a set of {0, 1}
you can have a powerset operation as :
P( {0, 1} ) = { [], {0}, {1}, {0,1} }
-- [] to denote the null set.
--
Anith
>The change came with W2K. But seq guid's is a different Win 32 API call. So, we have two different
>Win32 call's go get guid's, and in SQL 2005, SQL Server exposes both of them.
Eh? I know CoCreateGuid is exposed by ole32.dll, but what is the name
of the API call used for generating sequential guids? Please? :)
TIA,
Joergen Bech
I looked back at my saved snippets of texts, but I couldn't find it... Perhaps it isn't exposed in
OLE, but only in Win32 and probably also .NET framework?
... Had a quick look in VS2003 doc's but didn't find it, but I'm not very good at the VS help
system...
--
Tibor Karaszi, SQL Server MVP
http://www.karaszi.com/sqlserver/default.asp
http://www.solidqualitylearning.com/
"Joergen Bech @ post1.tele.dk>" <jbech<NOSPAMNOSPAM> wrote in message
news:2urq51puhm7ktt88a...@4ax.com...
I think the using the generic term surrogates interchangeably with product
specific features like identity, GUID etc. can cause some confusion.
Surrogates are logical concepts ( I did not miss this in your post ) since
it values belong to a type that is mapped with a named attribute to tuple in
a relation. Surrogate values can be user provided or system generated,
either has no distinction at the logical level.
However as I mentioned before, identity columns ( not surrogates in general,
just the SQL Server-specific identity columns ) come with certain quirks in
that it violates certain general relational principles. For instance, we can
alter an underlying index to change the value that is generated for a given
row.
Identity values can be manipulated to provide gaps in its sequence by
manipulating uncommitted transactions.
Identity column values are non-updateable, by its nature. Does this hold
with the definition of a relational database where all data can be accessed
and manipulated by the user? How does Information Principle hold with this?
Another practical problem is the serialization of multi-row inserts when
referencing key column is an identity column. Yet another one is deleting a
row and inserting the same row in a single transaction when there is a
referenced foreign key on the identity column. Cascading cannot work since
ON UPDATE actions on identity columns can fail.
We cannot discount all these as logical "features" of an identity column,
can we?
>> ...and therefore Joe's constant description of them as "physical" is just
>> wrong..
Possible, but the fact of the matter is that the distinction between logical
and physical models is not sufficiently demarcated in most DBMSs supporting
SQL. This is the real issue we have to deal with, when we meticulously try
to define what is logical and what is physical in current environments.
--
Anith
I agree with levels in between. That is why I tried to classify keys
into types, with natural keys as the best choice because you can verify
them externally, validate and port them. I prefer industry standards
over "homemade" encodings because they let me exchange data. The worst
things are those that depend on the particular hardware state of a
particular product at a particular time. And they are by definiton not
keys because they do not model attributes in the data model. Not all
keys are created equal.
Just 2 points though: SQL-server doesn't guarantee that the values generated
will be sequencial (that is, that there won't be any gaps) and does not give
you any way of knowing in advance what the next value generated will be.
These 2 points are what most people (well mostly newbies) in this forum have
problems with when using Identity.
Just for fun:
I could argue your point about not being able to update values in an
Identity column.
Why would someone want to update an Identity value?
"Anith Sen" <an...@bizdatasolutions.com> wrote in message
news:ujNFb7FQ...@tk2msftngp13.phx.gbl...
If all that is desired is a system-generated key, it shouldn't matter whether we
can predict or have the system alter the next value. You seem to be dwelling on
how the key is generated. If you are using SQL Server and move to Oracle or
Informix, you clearly will have to find another way of generating key values.
That can be handled behind the scenes to the users. Again if predictability of
the key value is irrelevant, it won't matter if the key values jump from 12345
to 54321.
> Identity values can be manipulated to provide gaps in its sequence by
> manipulating uncommitted transactions.
Again, you are presuming that the sole purpose of an identity field is for
sequence. Most people want a system-generated value and could care less whether
the values are sequential with no gaps.
> Identity column values are non-updateable, by its nature. Does this hold with
> the definition of a relational database where all data can be accessed and
> manipulated by the user? How does Information Principle hold with this?
If we are now talking about SQL Server's specifically, Identity values are in
fact updateable. Further, by definition a surrogate key is one that the user
cannot change, so they shouldn't be changing it anyway. Throughout these
discussions it is clear that people seem unable to understand that there are two
fundamentally different purposes for the primary key in database design.
Programmatic systems (business and data layers for example) want a way to find a
single row such that the specification of the key can *never* change. Users want
a way to uniquely find a fact in a table. Those are two entirely different
purposes. The later purpose can be achieved through the use of various and even
multiple unique indexes. The former can only be achived by system-generated
(what I consider arbitrary) value.
> Another practical problem is the serialization of multi-row inserts when
> referencing key column is an identity column.
Please clarify. I'm not sure what you mean by the "serialization of multi-row
inserts."
> Yet another one is deleting a row and inserting the same row in a single
> transaction when there is a referenced foreign key on the identity column.
> Cascading cannot work since ON UPDATE actions on identity columns can fail.
I'm not sure I understand your example. Could you be more specific?
>>> ...and therefore Joe's constant description of them as "physical" is just
>>> wrong..
>
> Possible, but the fact of the matter is that the distinction between logical
> and physical models is not sufficiently demarcated in most DBMSs supporting
> SQL. This is the real issue we have to deal with, when we meticulously try to
> define what is logical and what is physical in current environments.
The need that SQL Server's Identity column fills is as a system-generated value.
That's it. That it might be sequential is irrelevant for the purposes of
generating key values. How it is generated is also irrelevant. The code that
locates a viable key value could be the Identity column, or a function call to a
DLL. What matters is that it is insulated from changes in user specification.
Thomas
Not that I know of. The column, yes -- but not the whole table.
(snip)
>Example: I use a VIN number to
>identify an automobile and you use nothing but a GUID. Lightening hits
>both our machines and takes out that table. I walk out to the
>motorpool and read the VIN numbers off the cars. You are screwed.
Hi Joe,
You seem to have missed the simple fact that Thomas was discussing
identity or guid as SURROGATE KEY (that is: in addition to a verifiable
business key), not as an ONLY KEY (that is: screwing up the data model).
>Right now the retail industry is moving from the 10 digit UPC code to
>the 13 digit EAN code.
(snip)
>If you have a product that
>is up to Standards, the conversion on the database side is easy. The
>poor bastards on the front end will have to re-do their reports,
>screens and all that stuff, however.
And if you have used UPC as business key plus some system generated
meaningless surrogate key in the Products table, the conversion on the
database is even easier: just one table to modify, no more. The poor
bastards that used UPC as the only key in Products will have to change
all tables that have a relationship to the Products table as well.
(Note: I would be one of those poor bastards, since I seldom use a
surrogate key - but I do see the validity of Thomas' point!)
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)
In the old network databases, the system kept the pointer chains
correct for us. They had utitity programs for re-building the chains,
garbage collection, etc. When you have both a surrogate that you have
to maintain and a real key, how do you keep them in synch?
DROP TABLE Foobar;
CREATE TABLE Foobar
(id INTEGER IDENTITY(1,1) NOT NULL PRIMARY KEY,
real_key INTEGER NOT NULL UNIQUE);
INSERT INTO Foobar VALUES (10);
INSERT INTO Foobar VALUES (20);
INSERT INTO Foobar VALUES (30);
SELECT * FROM Foobar;
BEGIN -- this does nothing logically with real_key, but it trashes the
physically derived id.
DELETE FROM Foobar WHERE id = 2;
INSERT INTO Foobar (real_key) VALUES (20);
SELECT * FROM Foobar;
END;
Look at postings on this newgroup about IDENTITY and BCP errrors, or go
to the Oracle groups and see how many programmers hardcoded a rowid
into code whose physical storage got changed.
>> And if you have used UPC as business key plus some system generated
meaningless surrogate key in the Products table, the conversion on the
database is even easier: just one table to modify, no more.<<
If I had full SQL-92, I would do a single ALTER DOMAIN statement and
the engine would do the work for me. My only worry would be that
someone wrote DDL with a UPC code, called it something else and did not
use the DOMAIN feature.
For managed code, you can just call Guid.NewGuid - this is equivalent to
UuidCreate.
e.g.
[DllImport("Rpcrt4.dll")]
public static extern int UuidCreateSequential(out Guid guid);
[DllImport("Rpcrt4.dll")]
public static extern int UuidCreate(out Guid guid);
static void GUIDTest()
{
for (int i=0;i<5;i++)
{
Guid myGuid= Guid.NewGuid();
Console.WriteLine("myGuid={0} ",myGuid);
}
Console.ReadLine();
for (int j=0;j<5;j++)
{
Guid myGuid2;
int ret = UuidCreateSequential(out myGuid2);
Console.WriteLine("UuidCreateSequential={0} ",myGuid2);
}
Console.ReadLine();
for (int k=0;k<5;k++)
{
Guid myGuid3;
int ret = UuidCreate(out myGuid3);
Console.WriteLine("UuidCreate={0} ",myGuid3);
}
Console.ReadLine();
}
--result--
myGuid=ecdef4d9-680a-49a1-bb5d-472972f0eb04
myGuid=d7e661bf-7e1c-40bf-9aff-6699306aacfc
myGuid=d2555cda-0daa-4ce6-995c-1d743a096846
myGuid=69b296a0-4efc-439c-a5b3-507afc11aaf7
myGuid=19b1840d-92aa-41ea-a526-86cbcd6d85cd
UuidCreateSequential=d2c72daa-ac73-11d9-97a7-00301bb4afc8
UuidCreateSequential=d2c72dab-ac73-11d9-97a7-00301bb4afc8
UuidCreateSequential=d2c72dac-ac73-11d9-97a7-00301bb4afc8
UuidCreateSequential=d2c72dad-ac73-11d9-97a7-00301bb4afc8
UuidCreateSequential=d2c72dae-ac73-11d9-97a7-00301bb4afc8
UuidCreate=b07b7a2b-eb96-40b5-835b-138a5c69a7b6
UuidCreate=7667d40e-a201-494d-bf50-cacbc7e16619
UuidCreate=f4085de2-3dda-43d1-acda-90892ee4efc7
UuidCreate=40ac5c81-8692-4c44-aa41-f7ab977471bc
UuidCreate=4bf67ec0-6264-441e-9f90-f84e73eade3c
--
-oj
"Tibor Karaszi" <tibor_please.n...@hotmail.nomail.com> wrote in
message news:u%23ehUAGQ...@TK2MSFTNGP10.phx.gbl...
I also think keys can be classified into types, And I agree that
"natural" keys are generally, the best choice. In fact, I would agree with
you that they are generally the best choice "by far". For one of the main
purposes for which keys are created and used (that of eliminating the
possibility of having more than one record which represents the SAME data
entity), there is NO Substitute for a "natural" key. And in fact, I believe
that every "primary" *table (relation) should have at least one "natural" key
established on it.
But there is another function that keys are used for, which is fundamentally
different from the one mentioned above, and that is to act as the "glue" (in
the form of Foreign Keys) that bind relations to each other. For this
purpose, it just doesn't matter whether the key has meaning, whether the
values are sequential or not, or whether they can be updated, or any of the
above. For this purpose All that matters is that the key value be able to
uniquely identify the parent record. So the key must be unique. After that,
the "Best" key is one that is simplest to use (in SQL Code, or Joins and
gives the best performance... And a simple integral Identity value has
significant advantedges in this regard. (It is of course totally useless
for the first function...)
Clearly when one key can be put in place which satisfies the objectives of
both functions in the optimum way, we are in the best of all worlds. When
this is NOT the case, then, unless the overhead of keeping two unique indices
updated outweighs the advantedges, it makes sense to have two keys.
* An exception, (for me), might be those tables that are added to the
implementation, in addition to the tables in the business data model, to
facilitate DRI =- like a many-to-many "association" tables...
<snip>...Surrogates are logical concepts...</snip>
I use the term "surrogate" strictly to mean functionally adequate,
"alternative", but not "natural", just as a surrogate Mother is functionally,
a real Mother but not the natural Mother... Does the word have an
established, well-understood meanng within database-related compuer science?
(I was not a CS Major, as I'm sure you've guessed by now!)
<snip>...For instance, we can alter an underlying index to change the value
that is generated for a given row...</snip>
You of course only mean for future records, not for those which have an
already assigned identity value
As to the rest of your post, I agree with all you say.
Looking at keys as serving multiple functions, the aspects you mention are
significant to the function of preventing duplicate records, true... but
I don't believe any of these factors affect the DRI function of acting as
foreign Key references to join child table records to a parent record in
another table.
Regards
Charly
INSERT INTO Foobar VALUES (10);
INSERT INTO Foobar VALUES (20);
INSERT INTO Foobar VALUES (30);
SELECT * FROM Foobar;
BEGIN -- this does nothing logically with real_key, but it trashes the
physically derived id.
DELETE FROM Foobar WHERE id = 2;
INSERT INTO Foobar (real_key) VALUES (20);
SELECT * FROM Foobar;
END;
How does this trash the "physically derived id", (which by the way is no
more "Physical" than the real Key) ?
Joe, Since you don't believe in using IDENTITIES, GUIDS, etc... and other
surrogate keys, it seems like your solution is to always use ON CASCADE
UPDATE, DELETE for every table PK-FK relationship. EVERY single example
that I've ever seen you write has the exact same code, "ON CASCADE UPDATE,
ON CASCADE DELETE, and on and on and on". You let the "system" take care of
the relationship.
However, don't you realize that all that updating/deleting comes a price?
Obviously, it's NOT free. By using man-made "surrogate" keys (IDENTITY,
GUIDS) you can get around all that database "churning". And if it really
burns you, you can just let the "natural" key be the primary key, and still
use an IDENTITY, GUID as a unique index key that you send out to other
tables as the FK.
That seems like a more logical approach than all this CASCADE churning. Any
problems with that?
> In the old network databases, the system kept the pointer chains
> correct for us. They had utitity programs for re-building the chains,
> garbage collection, etc. When you have both a surrogate that you have
> to maintain and a real key, how do you keep them in synch?
>
> DROP TABLE Foobar;
> CREATE TABLE Foobar
> (id INTEGER IDENTITY(1,1) NOT NULL PRIMARY KEY,
> real_key INTEGER NOT NULL UNIQUE);
>
> INSERT INTO Foobar VALUES (10);
> INSERT INTO Foobar VALUES (20);
> INSERT INTO Foobar VALUES (30);
> SELECT * FROM Foobar;
>
> BEGIN -- this does nothing logically with real_key, but it trashes the
> physically derived id.
> DELETE FROM Foobar WHERE id = 2;
> INSERT INTO Foobar (real_key) VALUES (20);
> SELECT * FROM Foobar;
> END;
Yes, it thrashes the key. Whether the key is really considered "physical" is
another matter. Further, what does this example prove? The next value holds no
magic meaning nor is required to be sequential. It doesn't matter from the
system standpoint if the next number is 12345, 42 or 2222222. The only thing
that matters is that the number is unique.
> Look at postings on this newgroup about IDENTITY and BCP errrors, or go
> to the Oracle groups and see how many programmers hardcoded a rowid
> into code whose physical storage got changed.
Again, you are implying that people are using the an Identity PK for some other
nefarious purpose is entirely orthogonal to the discussion. There is nothing to
stop someone from using any number of fields for nefarious purposes like this.
The purpose of the identity value is the provide a simple, system-generated key.
Presuming you apply no cosmic meaning or its sequence, your example in no way
helps to support your argument against their use.
>>> And if you have used UPC as business key plus some system generated
> meaningless surrogate key in the Products table, the conversion on the
> database is even easier: just one table to modify, no more.<<
>
> If I had full SQL-92, I would do a single ALTER DOMAIN statement and
> the engine would do the work for me. My only worry would be that
> someone wrote DDL with a UPC code, called it something else and did not
> use the DOMAIN feature.
And what about the thousands of lines of code that were designed around querying
for a Char(10) value that now have to change and be tested to a Char(13)? Every
foreign key that points to this value would have to be changed. What about the
thousands, millions perhaps tens of millions of records that would have to be
changed to allow for this change? Changing the characteristics of a central PK
is not as simple as changing the field size.
Thomas
Programmatic systems, on the other hand, need a way of uniquely finding a row in
such a way that the manner in which it locates that row *NEVER* changes.
Changing the way that a row is located can be a major disaster for applications
built on top of a database system. Every system built on a table whose PK design
changes has to be updated, tested, deployment tested and deployed. That could
cost weeks or months in a huge company. That could easily be thousands of
man-hours to fix what should be an expected change (like the UPC going from
Char(10) to Char(13)).
Your UPC example really substantiates my point. Using a natural key as the value
that glues relations together and is used by the system (not necessarily the
user) to locate rows is decision fraught with peril. If a searchable value that
can extend beyond the company is necessary (e.g. UPC, VIN), fine. I just
wouldn't make it the value that is used to bind tables together internally to
the system.
Thomas
If we had full SQL-92, then maybe that would work. However, we're
discussing SQL Server, so we don't. So we have to have work-arounds,
which may require more intervention by the database programmer.
For instance, what's being advocated here is to have two keys in the
table - the natural key (with uniqueness constraint), that is only
stored/maintained/(have it's nature altered e.g. char(10) -> char(13))
in the one table, and a second key which allows the relationships
between tables to be stored/maintained *without being impacted by
changes required to the nature of the natural key*.
The fact that the programmer has to be aware of this second key is
unavoidable, but the programmer has made this decision to make
maintenance *easier* in the future.
Actually, the point of my original post was to find the best way
to generate surrogate keys for maintaining relations between records,
but doing it in such a way as to not cause insert performance to
deteriorate over time as the number of records grow.
In a scenario with disconnected clients, some kind of unique key
is required when creating new records so they won't collide with
other records when data is eventually merged.
4-byte integers have their problems. 8-byte integers are a bit more
useful, as they leave more room for a central server to assign
identity ranges to individual clients (though I would probably use
a function to eat away at an assigned range, rather than using the
built-in Identity function).
The main problem with GUIDs, as they exist today in SQL Server 2000,
is that their random nature causes insert performance to degrade
over time, as the task of balancing the search tree becomes greater
and greater. Even when using an Identity field for the clustering
index, just for the sake of making sure new records are added to the
end of a table rather than all over the place, non-clustered indexes
on GUID relation fields still need to be updated/balanced.
So one should be careful when using NEWID() (SQL Server) or
Guid.NewGuid (.Net) for generating GUID values.
Which is why I was looking into COMB GUIDs and the new
NEWSEQUENTIALID() Yukon function.
So far, my own tests have shown that COMB GUIDs work great
for SQL Server. Mileage for other database systems might vary.
At least, the above is what my research so far has led me to
believe.
/Joergen Bech
--
-oj
"Joergen Bech @ post1.tele.dk>" <jbech<NOSPAMNOSPAM> wrote in message
news:lq6s515dvuic5iccg...@4ax.com...
>A small correction to your statement regarding randomness of newid() in
>sql2k that is newid() randomness depends entirely on the OS sqlserver is
>running on (i.e. it does not return a random guid on NT4< for each row).
Yes, of course. The MAC address was ditched at some point due
to security concerns (anonymity, etc) and the fact that MAC addresses
were not guaranteed to be unique. Something like that.
I had my head stuck in 2000/XP/2003-land when I wrote the first
post (since I do not support other operating systems for my app).
Would be nice to have a copy of the >=2000 GUID algorithm
so I could write a .Net version of it, though. Just to be sure that
the GUIDs would be generated the same way regardless of the
OS.
/JB
http://groups.google.co.uk/groups?hl=en&lr=&q=sqlserver+oj+Uuidcreatesequential
--
-oj
"Joergen Bech @ post1.tele.dk>" <jbech<NOSPAMNOSPAM> wrote in message
news:fk8s51hh5d3ldq3c3...@4ax.com...
I am working strictly in VB.Net, so I had to convert the C# examples
out there. Just in case anyone else would want it, I am posting
the code and results here.
Your code would translate to something like:
---snip---
Public Declare Function UuidCreateSequential Lib "Rpcrt4.dll"
(ByRef guid As Guid) As Integer
Public Function VBNewSequentialID() As Guid
Dim g As Guid
Dim ret As Integer = UuidCreateSequential(g)
Return g
End Function
---snip---
The COMB code would become something like
---snip---
Public Function GenerateComb() As Guid
Dim guidArray() As Byte = Guid.NewGuid.ToByteArray
Dim baseDate As DateTime = New DateTime(1900, 1, 1)
Dim now As DateTime = DateTime.Now
' Get the days and milliseconds which will be used to build
the byte string
Dim days As TimeSpan = New TimeSpan((now.Ticks -
baseDate.Ticks))
Dim msecs As TimeSpan = New TimeSpan((now.Ticks _
- (New DateTime(now.Year, now.Month,
now.Day).Ticks)))
' Convert to a byte array
Dim daysArray() As Byte = BitConverter.GetBytes(days.Days)
' Note that SQL Server is accurate to 1/300th of a millisecond
so we divide by 3.333333
'Dim msecsArray() As Byte =
BitConverter.GetBytes(CType((msecs.TotalMilliseconds / 3.333333),
Long))
Dim msecsArray() As Byte =
BitConverter.GetBytes(CType((msecs.TotalMilliseconds), Long))
' Reverse the bytes to match SQL Servers ordering
Array.Reverse(daysArray)
Array.Reverse(msecsArray)
' Copy the bytes into the guid
Array.Copy(daysArray, (daysArray.Length - 2), guidArray,
(guidArray.Length - 6), 2)
Array.Copy(msecsArray, (msecsArray.Length - 4), guidArray,
(guidArray.Length - 4), 4)
Return New Guid(guidArray)
End Function
---snip---
I don't really see the point of making client-generated COMBs match
SQL Server ones by dividing by 3.333333, thereby decreasing the
randomness of the value?!? Inside SQL Server, there is little choice,
but that does not mean clients need to be restricted by those
limitations.
I have not worked on shortening or optimizing the functions above.
In terms of performance, calling the VBNewSequentialID() function
is actually faster than plain old Guid.NewGuid(). GenerateComb,
of course, is far worse, as in
Guid.NewGuid(): 10 million in 3.2 seconds.
VBNewSequentialID(): 10 million in 2.4 seconds.
GenerateComb(): 10 million in 28 seconds.
In terms of using the values for inserts, Guid.NewGuid gives me
1000 records in 2-12 seconds (ouch), whereas VBNewSequentialID
and GenerateComb consistently results in 1000 records in approx
0.5 second, sometimes spiking to 0.7 or (rarely) 1 second.
So: Even though VBNewSequentialID and GenerateComb create
values with quite different bit patterns, I cannot tell any difference
in insert performance:
VBNewSequentialID sample values:
-----------------
07ba2900-acbe-11d9-8e69-444553544200
07ba2901-acbe-11d9-8e69-444553544200
07ba2902-acbe-11d9-8e69-444553544200
07ba2903-acbe-11d9-8e69-444553544200
07ba2904-acbe-11d9-8e69-444553544200
GenerateComb sample values:
-----------------
278a760c-dd50-4e75-a551-96360237e211
728b815e-3529-462a-8c05-96360237e211
48f2dc13-f5ab-4c03-a471-96360237e21b
a34869be-daa1-4802-a5d3-96360237e21b
e802886e-e900-4fea-be42-96360237e21b
In terms of using the values for joins, I understand from the
original article that the values are evaluated from the right,
in which case GenerateComb might have an edge, but if so,
this is likely to be insignificant.
Good enough for me. For now, I'll keep both functions around
and stick with VBNewSequentialID unless something causes
me to switch.
Regards,
Joergen Bech
On Thu, 14 Apr 2005 01:09:25 -0700, "oj" <nospam...@home.com>
wrote:
Almost sounds to me like the new Yukon NewSequentialID() function
is the same as CoCreateGuid before Win2000.
Regards,
Joergen Bech
On Thu, 14 Apr 2005 10:50:25 +0200, Joergen Bech
<jbech<NOSPAM>@<NOSPAM>post1.tele.dk> wrote:
>
>Yes yes. I was working on it and doing some tests. Thanks :)
>
>I am working strictly in VB.Net, so I had to convert the C# examples
>out there. Just in case anyone else would want it, I am posting
>the code and results here.
>
>Your code would translate to something like:
>
>---snip---
> Public Declare Function UuidCreateSequential Lib "Rpcrt4.dll"
>(ByRef guid As Guid) As Integer
>
> Public Function VBNewSequentialID() As Guid
> Dim g As Guid
> Dim ret As Integer = UuidCreateSequential(g)
> Return g
> End Function
>---snip---
---snip---
I don't disagree with you that changes to keys in the logical model can
pose problems for application design, but the issue is mainly due to
the popularity of development techniques, tools and interfaces that
don't provide an easy way to bind data access code to keys. There is no
fundamental reason why this HAS to be so. In some environments this
problem is solved by code generation, in others by abstracting the
logical design through an object-based data access tier.
<quote>
In this area, I respectively disagree with Codd. He clearly never
involved
himself in a massive and complex database system where the "unique,
immutable"
definition of primary key had to change because the original designer
did not
account for the possibility that the definition of the term
"uniqueness"
according to the users would change.
</quote>
It seems like a bit of a stretch to blame Codd for a failure to account
for inadequacies in development tools and middleware. I don't think
this is something that comes within the scope of the articles that Joe
referred to.
--
David Portas
SQL Server MVP
--
That is an implementation issue. You are manually doing what the
engine should be doing, but without any safety features or much help.
In Sybase SQL Anywhere (nee Wacom SQL), the the PK value is stored in
one and only one place. The FK that reference it are pointer chains
back to that value. The update cascades are one write operation, while
the delete cascades are a matter of going down a pointer chain list.
Likewise, the compressed bit vector used in Nucleus has the same
effect.
The contigous storage model in SQL Server makes IDENTITY easy to
implement, but it also makes each table a "standalone file" rather than
part of a whole schema. Doesn't it bother you that you have to
manually add an index to all the FKs yourself to speed up joins?
Does this actually exist, or is it something you hope to invent some
day?
Thomas
"--CELKO--" <jcel...@earthlink.net> wrote in message
news:1113486025.9...@f14g2000cwb.googlegroups.com...
Technically, yes, if the all the development teams across the entire enterprise
had all code designed in such a way as to account for this type of change that
might be possible. But come back to reality for a moment. Firstly, remember that
most enterprises have massive amounts of legacy code out there. Structural
changes can cause tsunami sized ripples throughout the enterprise. Secondly,
application developement tools use an entirely different paradigm than database
systems. There has always been a certain degree of disconnect between those
tools and their developers and the database system and its development. Lastly,
ORMs imply a certain design approach which may not be used for valid reasons
which is another religious war for another time. It is dangerous to build your
database with the assumption that everyone is going to use an ORM even if they
are available to everyone.
<snip>
> It seems like a bit of a stretch to blame Codd for a failure to account
> for inadequacies in development tools and middleware. I don't think
> this is something that comes within the scope of the articles that Joe
> referred to.
It sounds like you are echoing the disconnect between threory (Codd) and
application. In a purely theorectical sense, I agree that surrogate keys should
never be needed. I agree that system generated anything is not a fact of the
world and should not be represented in the database. That's great in theory. In
reality, however, that type of system handles change and evolving systems
poorly.
In theory, as Joe pointed out, if the key makeup changes, you simply change the
Domain constraint and because all of your relationships have Cascade Update, you
are golden. The reality is that in any product, the implication of that kind of
change to legacy systems, middleware, reporting systems and the volume of data
changes might be enormous and as such enormously expensive to effect.
Thomas
Excellent point about the consequences of having to change the size of
the key. Hadn't thought of that, but it is another practical issue whch is
very relevant tto theis discussion. Needing to change the datatype (although
not as common) is also a consideration.
Still, it's important to remember. Constraints are a good thing!
Reminds me of a funny run-in I had with a tech support department. In
maybe '97, I was tasked with doing an audit of a large banking corporation's
desktop computers, for a buyout. I don't remember the name of the software
that I used, but it required every user to put a floppy disk into their
machine and run a program for a few minutes that would collect all sorts of
info. Then the floppy disks were mailed back to headquarters and we
tabulated all of the data (this corporation didn't yet have a WAN). Anyway,
each computer was assigned a "unique" identifier when the program was run --
10 digits IIRC. So low and behold, I get a collision and can't get one of
the branches loaded into the database. I called tech support and was told
that collisions were absolutely impossible! -- this program had been TESTED,
and so far had worked fine for everyone else -- and there was no way I could
have possibly gotten a collision! After a while, the support guy got so
angry he hung up on me. They never did admit that it was a collision, and
we ended up having to re-run the numbers for that branch. But I always
think about it when I'm working with so-called "unique" identifiers ;)
Most of us understand the concept of surrogates and become opinionated based
on the existing provisions for surrogates in the DBMS products we are
familiar with. That is also why usage of surrogate keys becomes hot topic in
product related discussions/forums, whereas in database related scientific
circles it is generally a non-issue. Regarding your question, see if the
following helps:
http://groups.google.ca/groups?selm=%23UJYBUA9EHA.2112%40TK2MSFTNGP14.phx.gbl
--
Anith
"Joergen Bech @ post1.tele.dk>" wrote:
--
-oj
"CBretana" <cbre...@areteIndNOSPAM.com> wrote in message
news:38AB8C05-386A-4239...@microsoft.com...
Sand Analytic Server. They no longer use the Nucleus name.
"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:%23vfW2BF...@TK2MSFTNGP09.phx.gbl...
> "--CELKO--" <jcel...@earthlink.net> wrote in message
> news:1113412834.5...@f14g2000cwb.googlegroups.com...
>>
>> Look up the Nucleus database engine architecture. It uses domains to
>> hold the values, then creates tables on the fly from compressed bit
>> vectors that model the relations. The resutls are assembled one column
>> at a time. Everything is automatically compressed, fully indexed and
>> incrediably fast, whcih makes it a very good VLDB product.
>
> Where?
>
> The only relevant hit that Google returns is an article you wrote for
> DBMS in 1996.
Hi Joe,
(snip)
>BEGIN -- this does nothing logically with real_key, but it trashes the
>physically derived id.
>DELETE FROM Foobar WHERE id = 2;
>INSERT INTO Foobar (real_key) VALUES (20);
>SELECT * FROM Foobar;
>END;
I'll steer clear of the "physcially derived or not" for this time. But I
do want to reply that this code "trashes" nothing. It does "change"
something, namely the value of id.
But since id is a system-generated key that is not exposed to the end
user, but only used for internal processing - who cares if it's changed?
Not me, I don't!
BTW, if the code above is supposed to be issued by a user, the DELETE
should have read:
DELETE FROM Foobar WHERE real_key = 20
or, even better:
EXEC DelFoobar 20
Remember: the real key is what the user gets to see and use; the
surrogate key is internal.
>Look at postings on this newgroup about IDENTITY and BCP errrors,
Look at postings on this newsgroup making the most horrific abuse of
CASE. Does that mean that CASE in itself is bad? No, of course not - it
means that the abuse is bad.
If you have children, do you remove all knifes from your house, or do
you tell your children to not touch the knives until they're old enough
to know how to handle them without getting hurt?
> or go
>to the Oracle groups and see how many programmers hardcoded a rowid
>into code whose physical storage got changed.
I know zilch about Oracle, so I see no reason to check those groups. But
I have no reason to doubt that the Oracle community is blessed with just
as large a number of bad programmers who will abuse facilities without
any consideration for those who get to clean up after they get fired as
the SQL Server community.
Everything can be abused. MS' IDENTITY is no exception, and Oracle's
rowid is undoubtably no exception either.
>>> And if you have used UPC as business key plus some system generated
>meaningless surrogate key in the Products table, the conversion on the
>database is even easier: just one table to modify, no more.<<
>
>If I had full SQL-92, I would do a single ALTER DOMAIN statement and
>the engine would do the work for me. My only worry would be that
>someone wrote DDL with a UPC code, called it something else and did not
>use the DOMAIN feature.
Yeah? Would ALTER DOMAIN convert the values for you? Both in the
products table and in ALL tables that refer to the products table?
Anyway - you DON'T have full SQL-92. We're in a MS SQL Server group
here, and until Yukon gets released, we'll have to assume SQL Server
2000 as the default.
To get this conversion done without surrogate keys with the minimum
downtime and the minimum effort, I think these would be the steps:
1. Shut off user access to the system, to prevent erroneous input while
some constraints are temporarily dropped;
2. Drop all FOREIGN KEY constraints referring to the UPC;
3. Use ALTER TABLE to change the datatype on all tables that hold a UPC
column;
4. Recreate all FOREIGN KEY constraints dropped in step 1 (with update
cascading);
5. Restore user access to the system;
6. Run the conversion script to change the old codes to the new codes in
the Products table (the cascading option will take care of propagating
the change to all other tables). Users will soon start experiencing
timeouts, since this change will acquire locks on all rows in all tables
that have at least one column holding a UPC.
7. Done. System has been shut off from during steps 2, 3, and 4, and
effectively useless during step 6.
Now, if we had designed the Products table with UPC as UNIQUE column and
some unexposed, internally generated and meaningless ProdID as PRIMARY
KEY column, with the latter being used for all foreign key relationship,
the same conversion would require considerably less downtime:
1. Use ALTER TABLE to add the new UPC column, with a UNIQUE constraint
of course.
2. Shut off user access to the products table to prevent erroneous input
during the conversion. Other tables can still be used.
3. Run the conversion script to change the old codes to the new codes in
the Products table.
4. Restore user access to the products table, but make sure to disallow
updates to the old UPC column.
5. Old UPC column can be removed at a later time, during scheduled
outage.
6. Done. System has never been completely shut off; usage has been
restricted during step 3 only. (And that step would also run lots
quicker than step 6 from the previous plan).
Supposing you're the manager of an international company that uses it's
website to do business, and that requires as close to 24x7 operation as
possible (since each minute of down time costs orders) - which version
would you choose?
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)
On 14 Apr 2005 06:40:26 -0700, --CELKO-- wrote:
(snip)
> Doesn't it bother you that you have to
>manually add an index to all the FKs yourself to speed up joins?
Yes, it does - but I don't see how that realtes to this discussion. Or
are you trying to tell me that I only need to manually add an index if I
use identity values, and that the database engine will add an index for
me if I use the natural key in all FKs?
It definitely exists: http://www.sand.com
I have the white paper that describes the technology, although I cannot
find it on their site any more.
Someone else pointed me to the site yesterday -- I have requested the
white paper, but what I'd really like is a demo!
No, but they did have a KEY JOIN operator that tells the compiler to
use the pointers instead of the column. I am not sure if the optimizer
has gotten good enough not to need that now.
Another thing WATCOM SQL did which I liked was let you put in an
estimate of how often you expected a predicate to be true -- (sex = 1
85) mean that you thought that 85% of the time, the ISO sex code would
be "male". The optimizer would then estimate cost with your numbers.
As a bit of geek trivia, when Stonebreaker's grad students were writing
Ingres, the default assumption for a search argument expression (<col>=
<constant>) was 5% true. It was a "magic number" that someone made up
on the fly. Because the grad students who worked on that project
designed the first commerical optimizers, the "magic number" came with
them. This tended to mess up execution plans when the column had a
small number of actual values, like gender, or a skewed distrribution,
like 85% male.
You get the effect of having an index on everything; They even put in a
CREATE INDEX statement in the parser which did absolutely nothing, but
it made porting DDL from other products a bit easier. Besids the huge
physical storage savings, you get the ability to do some weird data
mining queries. You can look at the values in a domain to pull out the
rows in every table that have that value -- "Show me all the red things
in the schema!", "show me everything that happened on 2005 March 31",
etc.
They used to have a white paper on the internals of the first version,
but I have not seen anything about the later work. The trick was in the
algorithms for working with the bit vectors while they were still
compressed. The results were then materialized one column (domain) at
a time on screen or to storage. It was really weird to see a table
appear like that, instead of one row at a time.
A guy I met in Salt Lake City is trying to use the engine for HIPAA and
other massive text data projects. I don't know the details.
Sand Technology in Canada. Google it. The original engne was designed
by Ted Glazer, one of the Burroughs 5000 guys and the founder of the
Case Western computer Science department. The company was Marcus
System, named after his seeing eye dog. It changed to Nucleus and was
written up in a "Things to Watch" article in DATAMATION one year.
They even had a custom chip set burned to run the engine as well as the
software version.
Ted's wife died and he followed her shortly thereafter. The patents
bounced around and finally settled down at Sand.
The real trick in the engine was the ability to work with the bit
vectors in their compressed format. They were expanded only in the
last step, one domain at a time. This mean that the display on the
screen materialized column by column and not row by row. you could
also search down a domain to ask questions like "What red things do we
have in the schema?" or "Tell me everything that happened on 2004 march
31" as fast as a regular select statement. Great for data mining.
The fewer values in a domain, the beter the performance and the smaller
the database. For example, instead of keeping phone number in one
column, break them into area code, exchange and number columns. The
SAND Searchable Archive typically stores data in less than 10% of the
space required by an equivalent relational data warehouse.
If that were widely available, that would solve this semantic debate on the use
of surrogates. In essence, the system would handle the creation and mapping of
surrogates and "keys" per se would only be things applicable to the user
requirements. It's a shame that more database systems do not provide this
capability.
RE: Search argument assumption
Wouldn't that serve the same purpose as SQL Server's column statistics except
that they'd be DBA-settable? That is kind of a cool idea actually.
Thomas
"--CELKO--" <jcel...@earthlink.net> wrote in message
news:1113594628....@f14g2000cwb.googlegroups.com...
Did using the Key Join operator syntax require the programmer to specify
which Key pointer to use?
How did you have to specify the columns in the Child Table that represented
the FK side of the RI Constraint?
You implied that these columns are not actually IN the child table, right?
Since you said the values are stored only once, in Parent PK Columns. Or did
you just have to specify the Reference Table, and the Unique Key in the
reference Table that the FK DRI Constraint is referencing ?