Small self join takes long time

querylous

unread,

Oct 29, 2006, 4:26:01 PM10/29/06

to

Hi- I am trying to do a self join to find duplicate entries on a table
containing only 26,000 rows of 4 columns of integer only data. Creation
script here...

CREATE TABLE [dbo].[ccTo] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[codePractice] [int] NULL ,
[codeDemographics] [int] NULL ,
[codeProvider] [int] NULL
) ON [PRIMARY]
GO

all colums are indexed individually. tried a few 2 column indexes as well.
The join takes FOREVER! (30+ minutes).

These queries

select * from ccTo where exists (select * from ccTo c2 where
c2.codeDemographics = ccTo.codeDemographics and c2.codePractice =
ccTo.codePractice and c2.codeProvider = ccTo.codeProvider and c2.id < ccTo.id)

select count(*) from ccTo c1 join ccTo c2 on c1.codeDemographics =
c2.codeDemographics and
c1.codePractice = c2.codePractice and c1.codeProvider = c2.codeProvider and
c1.id > c2.id

are effectively essentially the same and take 30+ minutes.

Even this

select count(*) from ccTo where id < 10000 and exists (select * from ccTo c2
where id = ccTo.id)

Takes about 30 seconds. There is little to no load on the db at this time.

What could the problem be? Most self joins I do like this on indexed tables
are very fast.

Thanks

Chris

Mike C#

unread,

Oct 29, 2006, 6:02:49 PM10/29/06

to

"querylous" <quer...@discussions.microsoft.com> wrote in message
news:87328BF5-FC83-43AE...@microsoft.com...

> Hi- I am trying to do a self join to find duplicate entries on a table
> containing only 26,000 rows of 4 columns of integer only data. Creation
> script here...

What's the DDL for your indexes? Do you have a clustered index on this bad
boy or are they all non-clustered indexes? What does the query execution
plan look like? Maybe the query engine determined it was cheaper to ignore
your indexes and use a table scan...

Erland Sommarskog

unread,

Oct 29, 2006, 6:34:55 PM10/29/06

to

querylous (quer...@discussions.microsoft.com) writes:
> Hi- I am trying to do a self join to find duplicate entries on a table
> containing only 26,000 rows of 4 columns of integer only data. Creation
> script here...
>
> CREATE TABLE [dbo].[ccTo] (
> [id] [int] IDENTITY (1, 1) NOT NULL ,
> [codePractice] [int] NULL ,
> [codeDemographics] [int] NULL ,
> [codeProvider] [int] NULL
> ) ON [PRIMARY]
> GO

If that IDENTITY column never had been added, maybe those duplicates
would never sneaked in? :-) Whatever, once you are done, add a
UNIQUE constrint on (codePractice, codeDemographics, codeProvider).

> all colums are indexed individually. tried a few 2 column indexes as well.

Invididual indexes are not useful to your query. You need a clustered
index on (codePractice, codeDemographics, codeProvider, id).

And this query may be better:

SELECT a.id, a.codePractice, a.codeDemographics, a,codeProvider
FROM ccTo a
JOIN (SELECT codePractice, codeDemographics, codeProvider,
minid = MIN(id)
FROM ccTO
GROUP BY codePractice, codeDemographics, codeProvider) AS b
ON a.codePractice = b.codePractice
AND a.codeDemograohics = b.codeDemographics
AND a.codeProvider = b.codeProvider

--
Erland Sommarskog, SQL Server MVP, esq...@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

querylous

unread,

Oct 29, 2006, 6:44:01 PM10/29/06

to

Thanks Mike- you are correct- it was the need for a clustered index... but I
don't get it - why? Why wouldn't a non clustered index still be fast?

Also, you are right in that my non clustered indexes were ignored in favor
of a scan- but again, why would the optimizer ignore the regular indexes? If
it chooses a scan, shouldn't the scan be fast?

Thanks

Chris

querylous

unread,

Oct 29, 2006, 6:47:01 PM10/29/06

to

Thanks Erland, please see my response to Mike- it was indeed need for
clustered index / primary key.. but why? Why wouldn't an index on id make
this just as fast (or almost as fast, or at least MUCH faster) than no index?
The optimizer DID ignore my indexes, but why would it do that? Though I think
I understand the concept of clustered index (i.e. the table is the index, the
data is "right there" for retrieval once the row is located), I'm still
missing something of the concepts over all...

Chris

Aaron Bertrand [SQL Server MVP]

unread,

Oct 29, 2006, 7:41:59 PM10/29/06

to

Indexing is not a process of pointing and clicking in the GUI; you need to
understand how indexes work, and how you are going to be using your data, in
order to determine the best indexing strategy.

You can't just put a non-clustered index on every column in the table and
expect all queries to be fast.

As Erland said, the query you wrote doesn't find any of your indexes useful,
and the engine determined that a scan would be a faster way to get all the
data.

A

"querylous" <quer...@discussions.microsoft.com> wrote in message

news:62D3B752-E7F1-4C40...@microsoft.com...

querylous

unread,

Oct 29, 2006, 7:49:01 PM10/29/06

to

Hmmm... I think I have the rudiments... But, a non clustered index is
essentially an ordered lookup table. So, if I'm trying to even just determine
if another row exists which has the same id as the one I'm looking at in a
self join, which of course it does, even a non clustered index should greatly
speed this process. Instead, it's ignored by the optimizer... Seems like the
index should help, and the optimizer should know that?

Chris

Aaron Bertrand [SQL Server MVP]

unread,

Oct 29, 2006, 9:41:53 PM10/29/06

to

> Hmmm... I think I have the rudiments... But, a non clustered index is
> essentially an ordered lookup table. So, if I'm trying to even just
> determine
> if another row exists which has the same id as the one I'm looking at in a
> self join, which of course it does, even a non clustered index should
> greatly
> speed this process.

Maybe, but you're not trying to do such a lookup against 1 index, you're
trying to do it against 4 simultaneously.

A

Michael C

unread,

Oct 29, 2006, 10:01:57 PM10/29/06

to

"Erland Sommarskog" <esq...@sommarskog.se> wrote in message
news:Xns986BEFE38...@127.0.0.1...

> If that IDENTITY column never had been added, maybe those duplicates
> would never sneaked in? :-)

I love that (often repeated) myth. The duplicates have nothing to do with
the identity column. They are there because there was no constraint on the
other 3 columns.

Michael

Aaron Bertrand [SQL Server MVP]

unread,

Oct 29, 2006, 11:05:21 PM10/29/06

to

> I love that (often repeated) myth. The duplicates have nothing to do with
> the identity column. They are there because there was no constraint on the
> other 3 columns.

Ah, but a lot of times an IDENTITY column is added because it's the only way
people know how to prevent client tools (like Open Table in EM or SSMS) to
not complain about needing to know which row they meant to update/delete.
:-)

Michael C

unread,

Oct 29, 2006, 11:14:43 PM10/29/06

to

"Aaron Bertrand [SQL Server MVP]" <ten...@dnartreb.noraa> wrote in message
news:evarhh9%23GHA...@TK2MSFTNGP05.phx.gbl...

> Ah, but a lot of times an IDENTITY column is added because it's the only
> way people know how to prevent client tools (like Open Table in EM or
> SSMS) to not complain about needing to know which row they meant to
> update/delete. :-)

Possibly so but identity is still not the cause of the duplicates and has
nothing to do with them. The duplicates are there because the constraint is
missing.

BTW, I didn't think EM complained about lack of a primary key?

Michael

Aaron Bertrand [SQL Server MVP]

unread,

Oct 29, 2006, 11:22:45 PM10/29/06

to

> Possibly so but identity is still not the cause of the duplicates and has
> nothing to do with them.

It does more often than you think. I can't count how many people have told
me they added an identity column because they needed a primary key.

A

Aaron Bertrand [SQL Server MVP]

unread,

Oct 29, 2006, 11:22:45 PM10/29/06

to

> Possibly so but identity is still not the cause of the duplicates and has
> nothing to do with them.

It does more often than you think. I can't count how many people have told

Erland Sommarskog

unread,

Oct 30, 2006, 3:14:24 AM10/30/06

to

Correct. However, my suspicion is that whoever put those that IDENTITY
column there did so as a matter of routine. If the data-modelling had
included more thought, the process would have gone "is there any obvious
primary key we can use here?", the odds are good that the table would have
been created without the ID column at all.

Erland Sommarskog

unread,

Oct 30, 2006, 3:20:25 AM10/30/06

to

querylous (quer...@discussions.microsoft.com) writes:
> Thanks Erland, please see my response to Mike- it was indeed need for
> clustered index / primary key.. but why? Why wouldn't an index on id
> make this just as fast (or almost as fast, or at least MUCH faster) than
> no index? The optimizer DID ignore my indexes, but why would it do that?
> Though I think I understand the concept of clustered index (i.e. the
> table is the index, the data is "right there" for retrieval once the row
> is located), I'm still missing something of the concepts over all...

Actually, it should not matter whether the index I suggested is the
clustered index or not, and if the table would have more columns a
non-clustered index would be better.

What is important is that the index comprises all four columns with the
Id column as the last index. Then SQL Server can read one row, use the
index to quick find the collection of all rows with the same values in
(codePractice, codeDemographics, codeProvider).

If the columns are only indexed individually, SQL Server could choose
to look at all rows with the same value in, say, codePractice. It would
then have to perform a bookmark lookup for all those rows to see if
they match the rest of criterias. But too many bookmark lookups are more
expensive than just scanning tbe table, while the optimizer may prefer to
perform a second table scan.

Michael C

unread,

Oct 30, 2006, 4:56:40 PM10/30/06

to

"Aaron Bertrand [SQL Server MVP]" <ten...@dnartreb.noraa> wrote in message

news:%23$IJQr9%23GHA...@TK2MSFTNGP02.phx.gbl...

Maybe so, but the identity column is still not the issue. Again, the lack of
a constraint on the other fields IS the issue. This is NOT a valid reason to
use a natural primary key.

Michael

Michael C

unread,

Oct 30, 2006, 4:58:52 PM10/30/06

to

"Erland Sommarskog" <esq...@sommarskog.se> wrote in message

news:Xns986C5DFF1...@127.0.0.1...

> Correct. However, my suspicion is that whoever put those that IDENTITY
> column there did so as a matter of routine. If the data-modelling had
> included more thought, the process would have gone "is there any obvious
> primary key we can use here?", the odds are good that the table would have
> been created without the ID column at all.

It's also possible they realised the extra work and complexity involved in
using a natural primary key. ;-)

Michael

Aaron Bertrand [SQL Server MVP]

unread,

Oct 30, 2006, 5:00:18 PM10/30/06

to

>> It does more often than you think. I can't count how many people have
>> told me they added an identity column because they needed a primary key.
>
> Maybe so, but the identity column is still not the issue. Again, the lack
> of a constraint on the other fields IS the issue. This is NOT a valid
> reason to use a natural primary key.

I am not saying the IDENTITY column *is* the source of the issue. I'm just
suggesting *WHY* Erland made a statement that *alluded* to the fact. Relax
already!

Michael C

unread,

Oct 30, 2006, 5:51:09 PM10/30/06

to

"Aaron Bertrand [SQL Server MVP]" <ten...@dnartreb.noraa> wrote in message

news:%23VMtM6G$GHA....@TK2MSFTNGP05.phx.gbl...

> I am not saying the IDENTITY column *is* the source of the issue. I'm
> just suggesting *WHY* Erland made a statement that *alluded* to the fact.
> Relax already!

Ok, fair enough but maybe you should re-read your post. I posted that
identity is not the cause of the issue. Your response was "It does more

often than you think".

Michael

--CELKO--

unread,

Oct 30, 2006, 6:04:47 PM10/30/06

to

>> It's also possible they realised the extra work and complexity involved in using a natural primary key. ;-) <<

Being lazy is not an excuse. With Google, not checking for industry
standard is not an excuse. If you have a natural key, then you must
enforce it anyway if you care about data integrity. But you probably
think that data integrity is also a myth.

I think we are all certain that the redundant duplicates came from
inserting the same data twice (or more often). Newbies who think that
IDENTITY can ever be a key fall into this trap within one year of
deployment.

His other problems are the use of camelCase (even MS stopped this),
violation of ISO-11179 rules for data element names, all NULL-able
columns and a lack of DEFAULTs and other constraints. And
over-indexing.

Based on a total lack of specs, my guess would be something like this:

CREATE TABLE Foobar
(provider_id INTEGER DEFAULT 0 NOT NULL PRIMARY KEY,
practice_code INTEGER DEFAULT 0 NOT NULL,
demographics_code INTEGER DEFAULT 0 NOT NULL);

Provider sounds like an entity, which would have a practice and
demographics as attributes. Just a wild guess. If you do a lot of
reports grouped by demographics_code, then put a clustered index on it.

--CELKO--

unread,

Oct 30, 2006, 6:09:08 PM10/30/06

to

>> I love that (often repeated) myth. The duplicates have nothing to do with the IDENTITY column. They are there because there was no constraint on the other 3 columns. <<

But why are there no constraint on the other 3 columns? Because
newbies believe that IDENTITY is a primary key when it cannot ever be
a key. That is the real myth that kills DBs, as we are seeing now.

Michael C

unread,

Oct 30, 2006, 6:23:03 PM10/30/06

to

"--CELKO--" <jcel...@earthlink.net> wrote in message
news:1162249747....@f16g2000cwb.googlegroups.com...

NO, the lack of the constraint and the IDENTITY are totally seperate issues.

Michael

Michael C

unread,

Oct 30, 2006, 6:40:17 PM10/30/06

to

"--CELKO--" <jcel...@earthlink.net> wrote in message

news:1162249485.4...@e3g2000cwe.googlegroups.com...

> Being lazy is not an excuse.

I really dislike this statement. There seams to be a myth that because
identity primary keys make working with databases easier in pretty much
every way that the programmer is somehow lazy.

> With Google, not checking for industry
> standard is not an excuse.

The standard these days seems to be non-natural pks.

> If you have a natural key, then you must
> enforce it anyway if you care about data integrity.

Of course.

> But you probably
> think that data integrity is also a myth.

No, that's something you invented.

> I think we are all certain that the redundant duplicates came from
> inserting the same data twice (or more often). Newbies who think that
> IDENTITY can ever be a key fall into this trap within one year of
> deployment.

This is another statement I see here often and quite dislike. Because
newbies generally use identity pks without thinking, the technique is
associated with beginners. This, the statement about laziness and the one
about unique data are so often repeated yet make absolutely no sense. It
reminds me of when those religious nutters who come to my door, they're
trying to sell me their ideas but their arguements are so flawed they
convince me of the opposite.

On the other hand the arguements against natural pks make complete sense.
Changing data is much easier, changing the field definitions is much easier,
joins are faster, space is reduced, programming model is more consistant,
data is stored once (I thought this was a big thing with database purists,
another thing that doesn't make sense).

We've had this conversation many times before. I've asked you to give me
something concrete on the natural key side and you've been unable to, just
giving me theoretical reasons instead. I've also asked you to answer the
question regarding the difficulty of changing the primary key data or it's
schema and you've sidestepped that many times.

Michael

querylous

unread,

Oct 30, 2006, 6:54:01 PM10/30/06

to

Hi guys, just thought I'd weigh back in- I actually added the identity column
AFTER I realized we'd mistakenly introduced duplicates; what easier way to
eliminate all but one of each duplicate?

Anyway, thanks for everyone's expert input...

Chris

Aaron Bertrand [SQL Server MVP]

unread,

Oct 30, 2006, 11:37:28 PM10/30/06

to

> Ok, fair enough but maybe you should re-read your post. I posted that
> identity is not the cause of the issue. Your response was "It does more
> often than you think".

You said "and has nothing to do with them" and I said "it does more often
than you think."

I was merely referring to the *fact* that, often, IDENTITY is used as an
alternative to a real key, and that's one way that people can get a table
full of redundant data.

I wasn't saying that was the cause in *this* case, just giving supporting
data for Erland's comment (which did come with a smiley, by the way).

A

Aaron Bertrand [SQL Server MVP]

unread,

Oct 30, 2006, 11:52:51 PM10/30/06

to

> This is another statement I see here often and quite dislike. Because
> newbies generally use identity pks without thinking, the technique is
> associated with beginners. This, the statement about laziness and the one
> about unique data are so often repeated yet make absolutely no sense. It
> reminds me of when those religious nutters who come to my door, they're
> trying to sell me their ideas but their arguements are so flawed they
> convince me of the opposite.
>
> On the other hand the arguements against natural pks make complete sense.
> Changing data is much easier, changing the field definitions is much
> easier, joins are faster, space is reduced, programming model is more
> consistant, data is stored once (I thought this was a big thing with
> database purists, another thing that doesn't make sense).

I use IDENTITY all the time.

But I also make sure to model the data in such a way that redundant data is
avoided. Just because I use IDENTITY as my PK does not mean I can't apply
unique constraint(s) to the column(s) that may have been my natural key.
Unfortunately, not everyone knows to take that next step, and think that
defining a PK on the IDENTITY column is enough.

This is the difference between:

(a) choosing identity as a key because it makes sense for your model, and
ensuring data integrity at the same time; and,

(b) choosing identity as a key and not worrying and/or knowing about data
integrity.

Nobody is saying that everyone who uses IDENTITY is doing (b). But there is
a strong argument that a lot of people who end up with redundant data *are*
doing (b).

I've been involved in these groups for years, and I can assure you that you
can set your watch to this.

But that's what we're here for, to help them with that knowledge. Not to
bury our heads in the sand and pretend that nobody has anything to learn.
You say that the IDENTITY column and duplicate data have nothing to do with
each other. I still stand by my statement that, sometimes, they do.
IDENTITY gives *some* people an easy out, and they don't have to think about
natural keys or unique constraints. They don't realize the problem until
they are in here asking us how to remove duplicates. Stick around here for
a while, you will see it again and again -- I'm not making this up.

A

Aaron Bertrand [SQL Server MVP]

unread,

Oct 30, 2006, 11:52:51 PM10/30/06

to

> This is another statement I see here often and quite dislike. Because
> newbies generally use identity pks without thinking, the technique is
> associated with beginners. This, the statement about laziness and the one
> about unique data are so often repeated yet make absolutely no sense. It
> reminds me of when those religious nutters who come to my door, they're
> trying to sell me their ideas but their arguements are so flawed they
> convince me of the opposite.
>
> On the other hand the arguements against natural pks make complete sense.
> Changing data is much easier, changing the field definitions is much
> easier, joins are faster, space is reduced, programming model is more
> consistant, data is stored once (I thought this was a big thing with
> database purists, another thing that doesn't make sense).

I use IDENTITY all the time.

Michael C

unread,

Oct 31, 2006, 12:08:10 AM10/31/06

to

"Aaron Bertrand [SQL Server MVP]" <ten...@dnartreb.noraa> wrote in message

news:eAP3tgK$GHA....@TK2MSFTNGP02.phx.gbl...

> I use IDENTITY all the time.
>
> But I also make sure to model the data in such a way that redundant data
> is avoided. Just because I use IDENTITY as my PK does not mean I can't
> apply unique constraint(s) to the column(s) that may have been my natural
> key. Unfortunately, not everyone knows to take that next step, and think
> that defining a PK on the IDENTITY column is enough.

This is exactly my point.

> This is the difference between:
>
> (a) choosing identity as a key because it makes sense for your model, and
> ensuring data integrity at the same time; and,
>
> (b) choosing identity as a key and not worrying and/or knowing about data
> integrity.
>
> Nobody is saying that everyone who uses IDENTITY is doing (b). But there
> is a strong argument that a lot of people who end up with redundant data
> *are* doing (b).

That is correct but a lot of the natural key crowd seem to assume that only
B exists. Many times i've seen people use the duplicate data issue as a
reason to use natural pks.

> But that's what we're here for, to help them with that knowledge. Not to
> bury our heads in the sand and pretend that nobody has anything to learn.
> You say that the IDENTITY column and duplicate data have nothing to do
> with each other. I still stand by my statement that, sometimes, they do.

It depends how you look at it. The identity does not cause the duplicate
data, the lack of a constraint on the other columns does.

> IDENTITY gives *some* people an easy out, and they don't have to think
> about natural keys or unique constraints. They don't realize the problem
> until they are in here asking us how to remove duplicates. Stick around
> here for a while, you will see it again and again -- I'm not making this
> up.

I've seen it many times, I've also seen many times the response "you should
use a natural pk to solve this".

Michael

Aaron Bertrand [SQL Server MVP]

unread,

Oct 31, 2006, 8:56:45 AM10/31/06

to

> That is correct but a lot of the natural key crowd seem to assume that
> only B exists.

I think you mean Celko. I don't know of anyone else who believes that. I
think we're just aware that it's common.

> It depends how you look at it. The identity does not cause the duplicate
> data, the lack of a constraint on the other columns does.

That's true, but please read my point again, that a lot of people have the
misconception that having a unique column on IDENTITY is enough data
integrity - and so, because they have an IDENTITY column, they don't bother
looking at what other constraints they should have.

A

Erland Sommarskog

unread,

Oct 31, 2006, 5:56:53 PM10/31/06

to

querylous (quer...@discussions.microsoft.com) writes:
> Hi guys, just thought I'd weigh back in- I actually added the identity
> column AFTER I realized we'd mistakenly introduced duplicates; what
> easier way to eliminate all but one of each duplicate?

Adding an IDENTIY column to sort out the accident when it has already
happened may be necessary in SQL 2000.

In SQL 2005, you can do as below:

CREATE TABLE dubblisar (a int NOT NULL,
b int NOT NULL,
c int NOT NULL)
INSERT dubblisar (a, b, c) VALUES (1, 2, 3)
INSERT dubblisar (a, b, c) VALUES (1, 2, 3)
INSERT dubblisar (a, b, c) VALUES (11, 22, 33)
INSERT dubblisar (a, b, c) VALUES (11, 22, 33)
INSERT dubblisar (a, b, c) VALUES (16, 23, 33)
INSERT dubblisar (a, b, c) VALUES (13, 23, 39)
INSERT dubblisar (a, b, c) VALUES (13, 23, 39)
INSERT dubblisar (a, b, c) VALUES (14, 22, 31)
go
WITH CTE AS (
SELECT row_number() OVER (PARTITION BY a, b, c ORDER BY c) AS rn
FROM dubblisar
)
DELETE CTE WHERE rn > 1
go
SELECT * FROM dubblisar
go
DROP TABLE dubblisar

Mike C#

unread,

Nov 1, 2006, 10:15:54 AM11/1/06

to

"querylous" <quer...@discussions.microsoft.com> wrote in message

news:83E95845-EC70-44DC...@microsoft.com...
> Thanks Mike- you are correct- it was the need for a clustered index... but
> I
> don't get it - why? Why wouldn't a non clustered index still be fast?
>
> Also, you are right in that my non clustered indexes were ignored in favor
> of a scan- but again, why would the optimizer ignore the regular indexes?
> If
> it chooses a scan, shouldn't the scan be fast?

Your non-clustered indexes probably were generating a lot of Bookmark
Lookups. The SQL engine uses a cost-based optimizer, so if the "cost" of a
Table Scan is less than the cost of an Index Seek + Bookmark Lookups + Sort
operation(s) + etc., then the engine will use the Table Scan. When you
added the clustered index you probably eliminated Bookmark Lookups and
possibly some Sort operation(s) which make the cost of using the Clustered
Index less than the cost of a Table Scan or non-clustered index seeks/scans
for the same query. SQL might have even found a more efficient method of
combining your join results than it previously had - though it's hard to
tell without all the information.

For this kind of troubleshooting your best bet is to use the Estimated and
Actual Query Execution Plan displays in QA. They can help you locate
bottlenecks and long-running operations in your queries and provide hints on
how to optimize your indexes.