CREATE TABLE [dbo].[ccTo] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[codePractice] [int] NULL ,
[codeDemographics] [int] NULL ,
[codeProvider] [int] NULL
) ON [PRIMARY]
GO
all colums are indexed individually. tried a few 2 column indexes as well.
The join takes FOREVER! (30+ minutes).
These queries
select * from ccTo where exists (select * from ccTo c2 where
c2.codeDemographics = ccTo.codeDemographics and c2.codePractice =
ccTo.codePractice and c2.codeProvider = ccTo.codeProvider and c2.id < ccTo.id)
select count(*) from ccTo c1 join ccTo c2 on c1.codeDemographics =
c2.codeDemographics and
c1.codePractice = c2.codePractice and c1.codeProvider = c2.codeProvider and
c1.id > c2.id
are effectively essentially the same and take 30+ minutes.
Even this
select count(*) from ccTo where id < 10000 and exists (select * from ccTo c2
where id = ccTo.id)
Takes about 30 seconds. There is little to no load on the db at this time.
What could the problem be? Most self joins I do like this on indexed tables
are very fast.
Thanks
Chris
What's the DDL for your indexes? Do you have a clustered index on this bad
boy or are they all non-clustered indexes? What does the query execution
plan look like? Maybe the query engine determined it was cheaper to ignore
your indexes and use a table scan...
If that IDENTITY column never had been added, maybe those duplicates
would never sneaked in? :-) Whatever, once you are done, add a
UNIQUE constrint on (codePractice, codeDemographics, codeProvider).
> all colums are indexed individually. tried a few 2 column indexes as well.
Invididual indexes are not useful to your query. You need a clustered
index on (codePractice, codeDemographics, codeProvider, id).
And this query may be better:
SELECT a.id, a.codePractice, a.codeDemographics, a,codeProvider
FROM ccTo a
JOIN (SELECT codePractice, codeDemographics, codeProvider,
minid = MIN(id)
FROM ccTO
GROUP BY codePractice, codeDemographics, codeProvider) AS b
ON a.codePractice = b.codePractice
AND a.codeDemograohics = b.codeDemographics
AND a.codeProvider = b.codeProvider
--
Erland Sommarskog, SQL Server MVP, esq...@sommarskog.se
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
Also, you are right in that my non clustered indexes were ignored in favor
of a scan- but again, why would the optimizer ignore the regular indexes? If
it chooses a scan, shouldn't the scan be fast?
Thanks
Chris
Chris
You can't just put a non-clustered index on every column in the table and
expect all queries to be fast.
As Erland said, the query you wrote doesn't find any of your indexes useful,
and the engine determined that a scan would be a faster way to get all the
data.
A
"querylous" <quer...@discussions.microsoft.com> wrote in message
news:62D3B752-E7F1-4C40...@microsoft.com...
Chris
Maybe, but you're not trying to do such a lookup against 1 index, you're
trying to do it against 4 simultaneously.
A
I love that (often repeated) myth. The duplicates have nothing to do with
the identity column. They are there because there was no constraint on the
other 3 columns.
Michael
Ah, but a lot of times an IDENTITY column is added because it's the only way
people know how to prevent client tools (like Open Table in EM or SSMS) to
not complain about needing to know which row they meant to update/delete.
:-)
Possibly so but identity is still not the cause of the duplicates and has
nothing to do with them. The duplicates are there because the constraint is
missing.
BTW, I didn't think EM complained about lack of a primary key?
Michael
It does more often than you think. I can't count how many people have told
me they added an identity column because they needed a primary key.
A
It does more often than you think. I can't count how many people have told
Correct. However, my suspicion is that whoever put those that IDENTITY
column there did so as a matter of routine. If the data-modelling had
included more thought, the process would have gone "is there any obvious
primary key we can use here?", the odds are good that the table would have
been created without the ID column at all.
Actually, it should not matter whether the index I suggested is the
clustered index or not, and if the table would have more columns a
non-clustered index would be better.
What is important is that the index comprises all four columns with the
Id column as the last index. Then SQL Server can read one row, use the
index to quick find the collection of all rows with the same values in
(codePractice, codeDemographics, codeProvider).
If the columns are only indexed individually, SQL Server could choose
to look at all rows with the same value in, say, codePractice. It would
then have to perform a bookmark lookup for all those rows to see if
they match the rest of criterias. But too many bookmark lookups are more
expensive than just scanning tbe table, while the optimizer may prefer to
perform a second table scan.
Maybe so, but the identity column is still not the issue. Again, the lack of
a constraint on the other fields IS the issue. This is NOT a valid reason to
use a natural primary key.
Michael
It's also possible they realised the extra work and complexity involved in
using a natural primary key. ;-)
Michael
I am not saying the IDENTITY column *is* the source of the issue. I'm just
suggesting *WHY* Erland made a statement that *alluded* to the fact. Relax
already!
Ok, fair enough but maybe you should re-read your post. I posted that
identity is not the cause of the issue. Your response was "It does more
often than you think".
Michael
Being lazy is not an excuse. With Google, not checking for industry
standard is not an excuse. If you have a natural key, then you must
enforce it anyway if you care about data integrity. But you probably
think that data integrity is also a myth.
I think we are all certain that the redundant duplicates came from
inserting the same data twice (or more often). Newbies who think that
IDENTITY can ever be a key fall into this trap within one year of
deployment.
His other problems are the use of camelCase (even MS stopped this),
violation of ISO-11179 rules for data element names, all NULL-able
columns and a lack of DEFAULTs and other constraints. And
over-indexing.
Based on a total lack of specs, my guess would be something like this:
CREATE TABLE Foobar
(provider_id INTEGER DEFAULT 0 NOT NULL PRIMARY KEY,
practice_code INTEGER DEFAULT 0 NOT NULL,
demographics_code INTEGER DEFAULT 0 NOT NULL);
Provider sounds like an entity, which would have a practice and
demographics as attributes. Just a wild guess. If you do a lot of
reports grouped by demographics_code, then put a clustered index on it.
But why are there no constraint on the other 3 columns? Because
newbies believe that IDENTITY is a primary key when it cannot ever be
a key. That is the real myth that kills DBs, as we are seeing now.
NO, the lack of the constraint and the IDENTITY are totally seperate issues.
Michael
I really dislike this statement. There seams to be a myth that because
identity primary keys make working with databases easier in pretty much
every way that the programmer is somehow lazy.
> With Google, not checking for industry
> standard is not an excuse.
The standard these days seems to be non-natural pks.
> If you have a natural key, then you must
> enforce it anyway if you care about data integrity.
Of course.
> But you probably
> think that data integrity is also a myth.
No, that's something you invented.
> I think we are all certain that the redundant duplicates came from
> inserting the same data twice (or more often). Newbies who think that
> IDENTITY can ever be a key fall into this trap within one year of
> deployment.
This is another statement I see here often and quite dislike. Because
newbies generally use identity pks without thinking, the technique is
associated with beginners. This, the statement about laziness and the one
about unique data are so often repeated yet make absolutely no sense. It
reminds me of when those religious nutters who come to my door, they're
trying to sell me their ideas but their arguements are so flawed they
convince me of the opposite.
On the other hand the arguements against natural pks make complete sense.
Changing data is much easier, changing the field definitions is much easier,
joins are faster, space is reduced, programming model is more consistant,
data is stored once (I thought this was a big thing with database purists,
another thing that doesn't make sense).
We've had this conversation many times before. I've asked you to give me
something concrete on the natural key side and you've been unable to, just
giving me theoretical reasons instead. I've also asked you to answer the
question regarding the difficulty of changing the primary key data or it's
schema and you've sidestepped that many times.
Michael
Anyway, thanks for everyone's expert input...
Chris
You said "and has nothing to do with them" and I said "it does more often
than you think."
I was merely referring to the *fact* that, often, IDENTITY is used as an
alternative to a real key, and that's one way that people can get a table
full of redundant data.
I wasn't saying that was the cause in *this* case, just giving supporting
data for Erland's comment (which did come with a smiley, by the way).
A
I use IDENTITY all the time.
But I also make sure to model the data in such a way that redundant data is
avoided. Just because I use IDENTITY as my PK does not mean I can't apply
unique constraint(s) to the column(s) that may have been my natural key.
Unfortunately, not everyone knows to take that next step, and think that
defining a PK on the IDENTITY column is enough.
This is the difference between:
(a) choosing identity as a key because it makes sense for your model, and
ensuring data integrity at the same time; and,
(b) choosing identity as a key and not worrying and/or knowing about data
integrity.
Nobody is saying that everyone who uses IDENTITY is doing (b). But there is
a strong argument that a lot of people who end up with redundant data *are*
doing (b).
I've been involved in these groups for years, and I can assure you that you
can set your watch to this.
But that's what we're here for, to help them with that knowledge. Not to
bury our heads in the sand and pretend that nobody has anything to learn.
You say that the IDENTITY column and duplicate data have nothing to do with
each other. I still stand by my statement that, sometimes, they do.
IDENTITY gives *some* people an easy out, and they don't have to think about
natural keys or unique constraints. They don't realize the problem until
they are in here asking us how to remove duplicates. Stick around here for
a while, you will see it again and again -- I'm not making this up.
A
I use IDENTITY all the time.
This is exactly my point.
> This is the difference between:
>
> (a) choosing identity as a key because it makes sense for your model, and
> ensuring data integrity at the same time; and,
>
> (b) choosing identity as a key and not worrying and/or knowing about data
> integrity.
>
> Nobody is saying that everyone who uses IDENTITY is doing (b). But there
> is a strong argument that a lot of people who end up with redundant data
> *are* doing (b).
That is correct but a lot of the natural key crowd seem to assume that only
B exists. Many times i've seen people use the duplicate data issue as a
reason to use natural pks.
> But that's what we're here for, to help them with that knowledge. Not to
> bury our heads in the sand and pretend that nobody has anything to learn.
> You say that the IDENTITY column and duplicate data have nothing to do
> with each other. I still stand by my statement that, sometimes, they do.
It depends how you look at it. The identity does not cause the duplicate
data, the lack of a constraint on the other columns does.
> IDENTITY gives *some* people an easy out, and they don't have to think
> about natural keys or unique constraints. They don't realize the problem
> until they are in here asking us how to remove duplicates. Stick around
> here for a while, you will see it again and again -- I'm not making this
> up.
I've seen it many times, I've also seen many times the response "you should
use a natural pk to solve this".
Michael
I think you mean Celko. I don't know of anyone else who believes that. I
think we're just aware that it's common.
> It depends how you look at it. The identity does not cause the duplicate
> data, the lack of a constraint on the other columns does.
That's true, but please read my point again, that a lot of people have the
misconception that having a unique column on IDENTITY is enough data
integrity - and so, because they have an IDENTITY column, they don't bother
looking at what other constraints they should have.
A
In SQL 2005, you can do as below:
CREATE TABLE dubblisar (a int NOT NULL,
b int NOT NULL,
c int NOT NULL)
INSERT dubblisar (a, b, c) VALUES (1, 2, 3)
INSERT dubblisar (a, b, c) VALUES (1, 2, 3)
INSERT dubblisar (a, b, c) VALUES (11, 22, 33)
INSERT dubblisar (a, b, c) VALUES (11, 22, 33)
INSERT dubblisar (a, b, c) VALUES (16, 23, 33)
INSERT dubblisar (a, b, c) VALUES (13, 23, 39)
INSERT dubblisar (a, b, c) VALUES (13, 23, 39)
INSERT dubblisar (a, b, c) VALUES (14, 22, 31)
go
WITH CTE AS (
SELECT row_number() OVER (PARTITION BY a, b, c ORDER BY c) AS rn
FROM dubblisar
)
DELETE CTE WHERE rn > 1
go
SELECT * FROM dubblisar
go
DROP TABLE dubblisar
Your non-clustered indexes probably were generating a lot of Bookmark
Lookups. The SQL engine uses a cost-based optimizer, so if the "cost" of a
Table Scan is less than the cost of an Index Seek + Bookmark Lookups + Sort
operation(s) + etc., then the engine will use the Table Scan. When you
added the clustered index you probably eliminated Bookmark Lookups and
possibly some Sort operation(s) which make the cost of using the Clustered
Index less than the cost of a Table Scan or non-clustered index seeks/scans
for the same query. SQL might have even found a more efficient method of
combining your join results than it previously had - though it's hard to
tell without all the information.
For this kind of troubleshooting your best bet is to use the Estimated and
Actual Query Execution Plan displays in QA. They can help you locate
bottlenecks and long-running operations in your queries and provide hints on
how to optimize your indexes.