Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

List of bad practices

2 views
Skip to first unread message

anon...@discussions.microsoft.com

unread,
Sep 19, 2004, 1:54:39 AM9/19/04
to
I have a list of bad practices in SQL Server below. If I
am missing something can you please share it?

1) Do not use cursors. Cursor "type" opersations should
be done in a middle tier.

2) Do not use temp-tables. Use derived-table or view

3) Too much normalization is not good.

4) Too many joins in quesries is not good.

5) Avoid triggers if you can. Use constraints

Anith Sen

unread,
Sep 19, 2004, 2:44:08 AM9/19/04
to
>> I have a list of bad practices in SQL Server below.

It would help others to understand your perspective better, if you care to
post the reasons behind each points in your "list".

--
Anith


David Portas

unread,
Sep 19, 2004, 4:18:17 AM9/19/04
to
I don't think that defining "bad practices" as blanket statements like this
is very useful to anyone. For what it's worth, I almost entirely disagree
with everything you've written! Are these ideas something that you actually
practice today or are you looking for guidance on what good and bad
practices *should* be?

--
David Portas
SQL Server MVP
--


Eric Sabine

unread,
Sep 19, 2004, 8:48:35 AM9/19/04
to
This is a weird list IMO. I disagree with the "absolute-ness" of 2-5 and my
only feeling about cursors is I've never had a compelling reason to turn to
using them. Only once I've seen a cursor out perform a set-based solution
and that's because I saw Itzik Ben-Gan prove it in a class. The problem
with a blanket list is everything should contain proofs and caveats. The
"it depends" mantra in SQL Server is true because it really does depend.

Eric


<anon...@discussions.microsoft.com> wrote in message
news:166a01c49e0d$26906fe0$a501...@phx.gbl...

Joe Celko

unread,
Sep 19, 2004, 9:10:33 AM9/19/04
to
>> I have a list of bad practices in SQL Server below. If I
am missing something can you please share it? <<

This is funny! I am about 6000 words into my next book on SQL
programming style and explain some of these things with a rationale and
an exception. Here is some feedback:

2) One problem with temp tables in T-SQL is that they do not follow the
model for Standard SQL temp tables, so they will not port easily.

But I agree that most of the time programmers use temp tables to hold
steps in a process rather than writing a single declarative statement.
Why learn RDBMS when you can fake your old skills in the new language?

In fact, a lot T-SQL is structured exactly like mag tape file systems --
the old "WORKnnn" tape files are simply replaced by "#temp_nn" tables.

3) I am not sure what "over-normalize" means. If you split up a table
beyond a certain point, you destroy the model for entity it was supposed
to represent. Properly normalized to AT LEAST 3NF and preferably to 5NF
is the minimum.

5) I think you meant DRI actions should replace triggers when possible.

--CELKO--
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are. Sample data is also a good idea, along with clear
specifications.


*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

anon...@discussions.microsoft.com

unread,
Sep 19, 2004, 10:09:30 AM9/19/04
to
Well, I didn't want to turn this into a religous debate.
You are right, these statements are not true if you are
dealing with a mom and pop database. It really doesn't
hinder any performance. Even if it did you can always
set up an SQL Server cluster using Active/Active. I
should refresh my quote. It should be:

What are "optimal" practices versus "other" practices in
SQL Server.

>.
>

Jeff Cochran

unread,
Sep 19, 2004, 11:57:46 AM9/19/04
to

The problem is, all these "bad" practices have valid reasons for use.
Kind of like the "Thou shalt not kill" thing getting in the way when
someone is intent on taking your own life.

A better list would be good practices, starting with "know why you're
using a technique and what the advantages and disadvantages are in
your particular environment."

Jeff

WKidd

unread,
Sep 19, 2004, 12:01:10 PM9/19/04
to
Cursors implemented on the client side can still hold locking transactions.
Views are not a replacement for temporary tables.

<anon...@discussions.microsoft.com> wrote in message
news:166a01c49e0d$26906fe0$a501...@phx.gbl...

anon...@discussions.microsoft.com

unread,
Sep 19, 2004, 12:35:34 PM9/19/04
to
I agree. That is more appropriate

Nigel Rivett

unread,
Sep 19, 2004, 1:37:03 PM9/19/04
to
See
http://www.nigelrivett.net/BadThings.html
Especially the paragraph at the top.

Think I disagree with all your statements but particularly 2 which in some
cases is just wrong.

David Portas

unread,
Sep 19, 2004, 1:50:56 PM9/19/04
to
> 1) Do not use cursors. Cursor "type" opersations should
> be done in a middle tier.

Cursor operations should be avoided altogether, which is possible with
set-based SQL in at least 99% of cases no matter what the size of your data.
Writing cursor-like data manipulation operations in the middle tier is
unlikely to scale well.

> 3) Too much normalization is not good.

This statement is pretty meaningless because you don't say how much is too
much. Third Normal Form should be required for any relational design.
De-normalization below 3NF is a mistake every time in my experience.

> 4) Too many joins in quesries is not good.

Why? Joins in a normalized database can improve performance by reducing the
number of physical reads required. Your statement implies that you have a
choice in the number of joins you might implement in a query. Can you give
an example of this? I'm not sure I understand what kind of scenario you have
in mind.

> 5) Avoid triggers if you can. Use constraints

Triggers and Constraints are different solutions to different problems. DRI
should usualy be implemented through constraints where possible. I do agree
that minimizing the use of triggers is wise; the obvious alternative is to
implement DML logic in SPs.

Message has been deleted

Nigel Rivett

unread,
Sep 19, 2004, 3:49:03 PM9/19/04
to
lol.
1st and 3rd are the same really.
The whole point is so that you can change the database structure without
changing the application - often for performance, as long as the SPs have the
same interface then application will never know anything has happened. I do
it quite often. Saying that it will never happen probably means that you have
bound the application to the database structure. That sort of thing usually
means that everything is designed by an application programmer which is what
the warning is really about.
Just because microsoft does something doesn't mean it's a good idea - look
at the system SPs for previous versions of sql server.

2nd one
So you're using SP to genrate the change scripts which is fine - the problem
is when people use it to make changes without knowing what it is really going
to do and don't save any record of what it has done.

Read the first paragraph again. These are just things to think about and not
necessarily bad in all cases.

"Richard Quinn" wrote:

> On Sun, 19 Sep 2004 10:37:03 -0700, Nigel Rivett <sq...@hotmail.com>
> wrote:
>
> >See
> >http://www.nigelrivett.net/BadThings.html
> >Especially the paragraph at the top.
> >
>

> Let's see...
>
> > Stored procedures to access tables rather than business entities.
>
> I disagree. Exactly why do you consider this as bad? AFAIK, this
> practice is a cornerstone of Microsoft's DNA architecture, FWIW.
>
> > Use of enterprise manager for data or schema updates.
>
> I disagree. I always store the changes to a file and then put that in
> VSS or add it to my Vis.Studio "Change Scripts" folder.
>
> > Application bound to the database structure (application should have no knowledge of tables).
>
> I disagree (mostly). Why do you consider this as bad?
> If my application knows nothing about the database (at any layer),
> then why am I using a relational database in the first place? I have
> yet to see an example of a "database application" which will continue
> to work if somebody changes the schema behind its back.
>
>
> - Richard
>
> ---
> Richard Quinn
> MCSD, MCDBA firstname...@ieee.org
>

Andrew J. Kelly

unread,
Sep 19, 2004, 4:17:22 PM9/19/04
to
> It really doesn't
> hinder any performance. Even if it did you can always
> set up an SQL Server cluster using Active/Active.


I must be missing the point here. Clusters have absolutely nothing to do
with performance. They are for hardware failover.


--
Andrew J. Kelly SQL MVP


<anon...@discussions.microsoft.com> wrote in message
news:30ef01c49e52$47d6c1a0$a401...@phx.gbl...

Hugo Kornelis

unread,
Sep 19, 2004, 5:41:34 PM9/19/04
to
Hi Anonymous,

On Sat, 18 Sep 2004 22:54:39 -0700, <anon...@discussions.microsoft.com>
wrote:

>I have a list of bad practices in SQL Server below. If I

>am missing something can you please share it?
>
>1) Do not use cursors. Cursor "type" opersations should
>be done in a middle tier.

I do think that cursors should be avoided as much as possible. But the
same holds true for blanket statements.


>2) Do not use temp-tables. Use derived-table or view

Depends on what you want to do. Not everything can be done with derived
tables or views. And even when it can be done, there are cases where using
a temp table gives better performance. To prove my point, I'll show you
three different queries that will find any couple of authors that have the
exact same earnings (based on sales_ytd, price and royalty percentages),
but only if the earnings are between $4,000 and $4,500.

(Note - there might be better ways to solve this query; that's not my
point. My point is that the optimizer will not choose the best execution
plan if the same view or derived table is used multiple times in the
query).

Try the below queries. Compare the difference from the set statistics io
output. Check the execution plans as well. You'll see that the solutions
with view and derived table actually construct the contents of the derived
table (or view) twice, making execution more expensive than the version
with the temp table. (And the last version can actually be optimized even
further by adding a primary key constraint and an extra index - commented
out in the code below).

--Solution #1, using a view:

create view earnings
as
select a.au_id, sum(t.price * t.ytd_sales * t.royalty / 100 *
ta.royaltyper / 100) as earnings
from authors a
join titleauthor ta
on ta.au_id = a.au_id
join titles t
on t.title_id = ta.title_id
group by a.au_id
having sum(t.price * t.ytd_sales * t.royalty / 100 * ta.royaltyper /
100) between 4000 and 4500
go
set statistics io on
select e1.au_id, e2.au_id, e1.earnings
from earnings AS e1
join earnings AS e2
on e1.earnings = e2.earnings
and e1.au_id < e2.au_id
drop view earnings
set statistics io off

-- Solution #2, using a derived table:

set statistics io on
select e1.au_id, e2.au_id, e1.earnings
from (select a.au_id, sum(t.price * t.ytd_sales * t.royalty / 100 *
ta.royaltyper / 100) AS earnings
from authors a
join titleauthor ta
on ta.au_id = a.au_id
join titles t
on t.title_id = ta.title_id
group by a.au_id
having sum(t.price * t.ytd_sales * t.royalty / 100 *
ta.royaltyper / 100) between 4000 and 4500) AS e1
join (select a.au_id, sum(t.price * t.ytd_sales * t.royalty / 100 *
ta.royaltyper / 100) AS earnings
from authors a
join titleauthor ta
on ta.au_id = a.au_id
join titles t
on t.title_id = ta.title_id
group by a.au_id
having sum(t.price * t.ytd_sales * t.royalty / 100 *
ta.royaltyper / 100) between 4000 and 4500) AS e2
on e1.earnings = e2.earnings
and e1.au_id < e2.au_id
set statistics io off

-- Solution #3, using a temp table:

set statistics io on
create table #tmp (au_id varchar(11) not null
,earnings money not null
-- ,primary key (au_id)
)
--create index ixtmp on #tmp(earnings)
insert #tmp (au_id, earnings)
select a.au_id, sum(t.price * t.ytd_sales * t.royalty / 100 *
ta.royaltyper / 100)
from authors a
join titleauthor ta
on ta.au_id = a.au_id
join titles t
on t.title_id = ta.title_id
group by a.au_id
having sum(t.price * t.ytd_sales * t.royalty / 100 * ta.royaltyper /
100) between 4000 and 4500
select e1.au_id, e2.au_id, e1.earnings
from #tmp AS e1
join #tmp AS e2
on e1.earnings = e2.earnings
and e1.au_id < e2.au_id
drop table #tmp
set statistics io off


>3) Too much normalization is not good.

I disagree. In OLTP databases, the design should be normalized down to
fifth normal form, with only very few exception (and these should be
documented, as they'll probably prove to be the future bottleneck anyway).

Requirements for OLAP databases are different though. That might be a
place where carefull denormalization has it's place. But denormalization
should (as the name implies) follow normalization - stopping before you
have a completely normalized design won't yield a "denormalized" design -
you'll end up with just another "unnormalized" design.


>4) Too many joins in quesries is not good.

Define "too many".

If you mean: don't join in tables that you don't need in the query, then I
would aagree.

If you mean: no more than (fill in your arbitrary number) of joins, then
this is nonsense. If the query needs data from many tables, you'll just
have to get them all into the query, or you'll never get the desired
results. (Unless of course you use temp tables for intermediate results,
but in THIS case, that WOULD be bad).


>5) Avoid triggers if you can. Use constraints

Use constraints and DRI if they can achieve what you need. Use triggers if
you need business logic that constraints and DRI don't offer.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

Lucas Tam

unread,
Sep 19, 2004, 7:25:43 PM9/19/04
to
"David Portas" <REMOVE_BEFORE_R...@acm.org> wrote in
news:66edncmusab...@giganews.com:

> I don't think that defining "bad practices" as blanket statements like
> this is very useful to anyone. For what it's worth, I almost entirely
> disagree with everything you've written!

Very true...

There are times when one must use these bad practices because it's the best
way to do things.

--
Lucas Tam (REMOV...@rogers.com)
Please delete "REMOVE" from the e-mail address when replying.
http://members.ebay.com/aboutme/coolspot18/

Joe Celko

unread,
Sep 19, 2004, 9:30:04 PM9/19/04
to
>> Only once I've seen a cursor out perform a set-based solution and
that's because I saw Itzik Ben-Gan prove it in a class. <<

I have written five cursors in my career and I know that if we had a
CASE expression in the old days, I could have avoided at least three of
them. The other two were NP complete things (a version of knapsack and
traveling salesman problems) where we wanted to find any answer under a
certain cost. The relational solution would find ALL the answers --
sometime after we were all dead.

Do you happen to have that example and is it generic enough to use for
Standard SQL classes?

Nigel Rivett

unread,
Sep 19, 2004, 9:30:13 PM9/19/04
to
Odd - I'm at a place now where someone implemented a cluster to improve
performance then used one half for something else when it didn't.
Still have half a cluster.

Nigel Rivett
www.nigelrivett.net

Itzik Ben-Gan

unread,
Sep 20, 2004, 12:21:04 AM9/20/04
to
I think that Eric refers to the following problem:

http://www.winnetmag.com/Article/ArticleID/37636/37636.html

If you can't access the complete problem, here goes:

---------------------------------------------------------------------
-- Concurrent time intervals
---------------------------------------------------------------------
CREATE TABLE Sessions
(
key_col INT NOT NULL PRIMARY KEY,
app VARCHAR(10) NOT NULL,
usr VARCHAR(10) NOT NULL,
host VARCHAR(10) NOT NULL,
starttime SMALLDATETIME NOT NULL,
endtime SMALLDATETIME NOT NULL,
CHECK(endtime > starttime)
)

INSERT INTO Sessions VALUES(1, 'app1', 'user1', 'host1', '20030212 08:30',
'20030212 10:30')
INSERT INTO Sessions VALUES(2, 'app1', 'user2', 'host1', '20030212 08:30',
'20030212 08:45')
INSERT INTO Sessions VALUES(3, 'app1', 'user3', 'host2', '20030212 09:00',
'20030212 09:30')
INSERT INTO Sessions VALUES(4, 'app1', 'user4', 'host2', '20030212 09:15',
'20030212 10:30')
INSERT INTO Sessions VALUES(5, 'app1', 'user5', 'host3', '20030212 09:15',
'20030212 09:30')
INSERT INTO Sessions VALUES(6, 'app1', 'user6', 'host3', '20030212 10:30',
'20030212 14:30')
INSERT INTO Sessions VALUES(7, 'app1', 'user7', 'host4', '20030212 10:45',
'20030212 11:30')
INSERT INTO Sessions VALUES(8, 'app1', 'user8', 'host4', '20030212 11:00',
'20030212 12:30')
INSERT INTO Sessions VALUES(9, 'app2', 'user8', 'host1', '20030212 08:30',
'20030212 08:45')
INSERT INTO Sessions VALUES(10, 'app2', 'user7', 'host1', '20030212 09:00',
'20030212 09:30')
INSERT INTO Sessions VALUES(11, 'app2', 'user6', 'host2', '20030212 11:45',
'20030212 12:00')
INSERT INTO Sessions VALUES(12, 'app2', 'user5', 'host2', '20030212 12:30',
'20030212 14:00')
INSERT INTO Sessions VALUES(13, 'app2', 'user4', 'host3', '20030212 12:45',
'20030212 13:30')
INSERT INTO Sessions VALUES(14, 'app2', 'user3', 'host3', '20030212 13:00',
'20030212 14:00')
INSERT INTO Sessions VALUES(15, 'app2', 'user2', 'host4', '20030212 14:00',
'20030212 16:30')
INSERT INTO Sessions VALUES(16, 'app2', 'user1', 'host4', '20030212 15:30',
'20030212 17:00')

-- Calculate the maximum number of concurrent sessions for each
user,application
-- The datetime ranges provided include starttime and exclude datetime
-- i.e., the session was active >= starttime < endtime
-- Desired results:
app mx
---------- -----------
app1 4
app2 3

---------------------------------------------------------------------
-- Concurrent time intervals
---------------------------------------------------------------------
Here's one (of several) pure set-based solution:

-- Set-based solution
SELECT app, MAX(concurrent) AS mx
FROM (SELECT app, starttime,
(SELECT COUNT(*)
FROM Sessions AS S2
WHERE S1.app = S2.app
AND S1.starttime >= S2.starttime
AND S1.starttime < S2.endtime) AS concurrent
FROM Sessions AS S1) AS C
GROUP BY app

The cursor-based solution is a cursor based on a query that marks events as
increasing or decreasing the count of concurrent sessions.
I haven't found a better performing pure set-based solution but I'd be more
than glad if I someone did (test against larger tables, 10000, 100000,
1000000 rows).

-- Iterative solution
DECLARE
@app AS varchar(10),
@prevapp AS varchar (10),
@ts AS datetime,
@type AS int,
@concurrent AS int,
@mx AS int

DECLARE @AppsMx TABLE
(
app varchar (10) NOT NULL PRIMARY KEY,
mx int NOT NULL
)

DECLARE sessions_cur CURSOR FAST_FORWARD FOR
SELECT app, starttime AS ts, 1 AS type
FROM Sessions

UNION ALL

SELECT app, endtime, -1
FROM Sessions

ORDER BY app, ts, type

OPEN sessions_cur

FETCH NEXT FROM sessions_cur
INTO @app, @ts, @type

SET @prevapp = @app
SET @concurrent = 0
SET @mx = 0

WHILE @@FETCH_STATUS = 0
BEGIN
IF @app <> @prevapp
BEGIN
INSERT INTO @AppsMx VALUES(@prevapp, @mx)
SET @concurrent = 0
SET @mx = 0
SET @prevapp = @app
END

SET @concurrent = @concurrent + @type
IF @concurrent > @mx SET @mx = @concurrent

FETCH NEXT FROM sessions_cur
INTO @app, @ts, @type
END

IF @prevapp IS NOT NULL
INSERT INTO @AppsMx VALUES(@prevapp, @mx)

CLOSE sessions_cur

DEALLOCATE sessions_cur

SELECT * FROM @AppsMx

This problem is one of two examples I show in the course where a cursor
outperforms a pure set-based query. The second problem has to do with
calculating, dare I say, "row numbers" ;-) based on some desired sort
criteria. e.g., a simple scenario of calculating row numbers for orders
based on orderid sort:

USE Northwind

SELECT orderid,
(SELECT COUNT(*)
FROM Orders AS O2
WHERE O2.orderid <= O1.orderid) AS rownum
FROM Orders As O1

These problems where a cursor outperforms a set-based solution are typically
characterized by an exponential performance degradation as the table grows
larger. e.g., for the latter, the optimizer counts the number of rows that
have an orderid which is <= the outer orderid, so even with an index you get
(n+n^2)/2 rows scanned. The cursor performs better in such a case because
the data is scanned only once. So the overhead of the curser is negligible
compared to the overhead of the multiple scans.
As an aside, this was addressed in SQL Server 2005 by providing the ANSI-SQL
99 ROW_NUMBER() function:

SELECT orderid, ROW_NUMBER() OVER(ORDER BY orderid) AS rownum
FROM Orders

ROW_NUMBER() scans the data only once. It has allows many alternate very
efficient solutions to existing problems that had slow set-based solutions.

--
BG, SQL Server MVP
www.SolidQualityLearning.com


"Joe Celko" <jcel...@earthlink.net> wrote in message
news:OUusbFrn...@TK2MSFTNGP15.phx.gbl...

Eric Sabine

unread,
Sep 20, 2004, 8:03:50 AM9/20/04
to
Thanks for posting that Itzik. Also, thanks for the 2005 - Row_Number()
information.

Hugo Kornelis

unread,
Sep 20, 2004, 8:17:14 AM9/20/04
to
On Sun, 19 Sep 2004 23:21:04 -0500, Itzik Ben-Gan wrote:

>I haven't found a better performing pure set-based solution but I'd be more
>than glad if I someone did (test against larger tables, 10000, 100000,
>1000000 rows).

Hi Itzik,

I guess you focussed too much on the SQL and forgot to look at indexes.

CREATE CLUSTERED INDEX ImproveIt ON Sessions (app, starttime)

I added this line to your creation script, then ran the following lines
several times until I had 65536 rows in the table:

INSERT Sessions (key_col, app, usr, host, starttime, endtime)
SELECT key_col + (SELECT MAX(key_col) FROM Sessions), app, usr, host,
starttime, endtime
FROM Sessions

INSERT Sessions (key_col, app, usr, host, starttime, endtime)
SELECT key_col + (SELECT MAX(key_col) FROM Sessions),
app + 'X', usr, host, starttime, endtime
FROM Sessions
SELECT COUNT(*) FROM Sessions

With this bloated table, I ran your own setbased SQL and your own
cursorbased SQL, both after running DBCC DROPCLEANBUFFERS and DBCC
FREEPROCCACHE, of course. Both produced the same output:

app mx
---------- -----------
app1 256
app1X 1536
app1XX 3840
app1XXX 5120
app1XXXX 3840
app1XXXXX 1536
app1XXXXXX 256
app2 192
app2X 1152
app2XX 2880
app2XXX 3840
app2XXXX 2880
app2XXXXX 1152
app2XXXXXX 192

The execution time for the setbased SQL was 0.606 seconds on my test
system; the cursorbased SQL required 5.483 seconds.

I don't have the time available right now to test against even larger
tables.

Hugo Kornelis

unread,
Sep 20, 2004, 8:21:51 AM9/20/04
to
On Mon, 20 Sep 2004 14:17:14 +0200, Hugo Kornelis wrote:

>CREATE CLUSTERED INDEX ImproveIt ON Sessions (app, starttime)
>
>I added this line to your creation script

And I added the NONCLUSTERED option on the primary key constraint, of
course. Apologies for leaving this out of my post.

Joe Celko

unread,
Sep 20, 2004, 9:52:02 AM9/20/04
to
>> The execution time for the set-based SQL was 0.606 seconds on my test
system; the cursor-based SQL required 5.483 seconds. <<

Another victory for non-procedural code!

I was thinking about building a table of points in time, like a more
detailed calendar table, then using BETWEEN predicates to count how many
sessions each of those points falls inside of. It gives you control of
grandularity and you can ask other questions with it. But if all we
needed was an index ...

Anith Sen

unread,
Sep 20, 2004, 10:16:51 AM9/20/04
to
>> I guess you focussed too much on the SQL and forgot to look at indexes.

I was about to comment on Itzik's choice of indexes before I saw your post
and it is to the point.

Even otherwise, there may be several examples where we find cursors in
certain special cases perform better than corresponding set-based solutions.
However these "out-performances" are simply attributions of the physical
model which vary from OS to OS, hardware to hardware, DBMS to DBMS,
underlying dataset variations, indexes involved & lots of factors, which
makes any generic performance claims suspect.

In general, I wouldn't give much encouragement to loop-based solutions,
since it is a deviation from the functional programming approach which is
desirable in data management.

--
Anith


Itzik Ben-Gan

unread,
Sep 20, 2004, 2:20:46 PM9/20/04
to
Well, I'll be damned! :-)
An index on app, starttime of course seems like a natural choice here.

I tested against larger tables, and both solutions seem to have fairly
linear performance degradation.
I'll go to my original benchmarks and see what went wrong there. The
set-based solution did logically seem to have the potential for exponential
performance degradation...
Here's the results I got from the new benchmark with the index:

#rows set-based cursor-based
---------- ------------- ---------------
16 0 0.01
16,000 0.08 1.463
160,000 0.96 12.68
1,600,000 10.423 136.686
16,000,000 130.503 still running...

I used the following code to populate the test tables.

SELECT IDENTITY(INT, 1, 1) AS key_col, app, usr, host, starttime, endtime
INTO Sessions16k
FROM Sessions, Nums
WHERE n <= 1000 -- also 10000, 100000, 1000000

CREATE CLUSTERED INDEX ImproveIt ON Sessions16K (app, starttime)
ALTER TABLE Sessions16k ADD PRIMARY KEY NONCLUSTERED(key_col)

Remind me to get you a beer when we meet next time.
Mind if I shared this with SQLMag's readers?

Now that this problem was solved, maybe someone can figure out a way to
solve the row-numbers problem with a faster set-based solution?
Against a table with 100K rows (with an index on orderid), I got 26 seconds
with a cursor based solution, and about half an hour for the set-based
solution.

--
BG, SQL Server MVP
www.SolidQualityLearning.com


"Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message
news:5mgtk0tv1bssjkii6...@4ax.com...

Itzik Ben-Gan

unread,
Sep 20, 2004, 3:52:12 PM9/20/04
to
Re the row-numbers problem, solutions with IDENTITY are not valid because
SQL Server 2000 doesn't guarantee that identity values would be generated
according to the sort specified in an ORDER BY clause.
Identity values sometimes are generated prior to the sort, and there's no
way to force SQL Server to assign them otherwise.

--
BG, SQL Server MVP
www.SolidQualityLearning.com


"Itzik Ben-Gan" <it...@REMOVETHIS.SolidQualityLearning.com> wrote in message
news:uq9Bx6zn...@TK2MSFTNGP10.phx.gbl...

Hugo Kornelis

unread,
Sep 20, 2004, 5:53:09 PM9/20/04
to
On Mon, 20 Sep 2004 06:52:02 -0700, Joe Celko wrote:

>I was thinking about building a table of points in time, like a more
>detailed calendar table, then using BETWEEN predicates to count how many
>sessions each of those points falls inside of.

Hi Joe,

That was my first thought too. But how many point would you need? Every 5
minutes? Every minute? Every second?

The beauty of Itzik's set-based query is that the derived table in effect
builds a collection of all points in time that might be relevant for the
query. I've been thinking about the puzzle some more this evening, but I
don't think I could come up with anything better than Itzik's query.

Hugo Kornelis

unread,
Sep 20, 2004, 6:02:32 PM9/20/04
to
On Mon, 20 Sep 2004 13:20:46 -0500, Itzik Ben-Gan wrote:

>I tested against larger tables, and both solutions seem to have fairly
>linear performance degradation.

Hi Itzik,

Thanks for runnning these tests. I've already been thinking about how I
could create more test data for this puzzle, but your post save me the
hassle (and the time).


>Remind me to get you a beer when we meet next time.

Okay. But do you expect to be in the Netherlands soon? :-)


>Mind if I shared this with SQLMag's readers?

Not at all.


>Now that this problem was solved, maybe someone can figure out a way to
>solve the row-numbers problem with a faster set-based solution?
>Against a table with 100K rows (with an index on orderid), I got 26 seconds
>with a cursor based solution, and about half an hour for the set-based
>solution.

Hmmm. Adding a non-clustered index on orderid should improve the set-based
version, but pprobably not enough to beat the cursor-based version. I'll
think about this before going to sleep, maybe my subconsious gets creative
tonight.

Don't hold your breath for it, though :-)

Hugo Kornelis

unread,
Sep 20, 2004, 6:46:49 PM9/20/04
to
On Tue, 21 Sep 2004 00:02:32 +0200, Hugo Kornelis wrote:

>>Mind if I shared this with SQLMag's readers?
>
>Not at all.

Forgot to add: When your article is published, I'd like to get a
notification and a URL, if possible. My email adress is Hugo at perFact
dot info.

Gert-Jan Strik

unread,
Sep 20, 2004, 6:50:43 PM9/20/04
to
Itzik (and Hugo),

the problem is, that the testdata is not realistic, and this influences
the results in a big way. There are too few distinct starttime values.
This doesn't really affect the cursor based solution but masively
changes the efficiency of the set based solution.

I have a table with realistic data that I transformed to the DDL that
was posted earlier (I converted some INTs to VARCHAR(10)s and DATETIME
to SMALLDATETIME). I indexed it the way Hugo described. The table had
10803979 rows in it (782976 KB, 1369 distinct apps).

If I understand the performance table correctly, the set based solution
should have finished anywhere between 10.423 and 130.503 seconds. Well,
I had to cancel it after 30 minutes. (BTW: I also had to cancel the
cursor based solution after 30 minutes)

So then I kept the first 1832314 rows (129760 KB) which contain (the
first) 399 distinct apps and reindexed the table.

The cursor based solution then took 18 minutes and 15 seconds to finish.
During the execution, there was hardly any CPU activity.

Then I ran the set based solution. I cancelled it after 40 minutes. At
that point, it had been overloading 3 of the available 4 CPU's for about
the entire period.

I ran the test on SQL7 with 2 HT CPU's and enough memory with the
following query and query plan.

SELECT app, MAX(concurrent) AS mx
FROM (SELECT app, starttime,
(SELECT COUNT(*)
FROM Sessions AS S2
WHERE S1.app = S2.app
AND S1.starttime >= S2.starttime
AND S1.starttime < S2.endtime) AS concurrent
FROM Sessions AS S1) AS C
GROUP BY app

option (maxdop 3)

|--Hash Match(Aggregate, HASH:([S1].[app]),
RESIDUAL:([S1].[app]=[S1].[app]) DEFINE:([Expr1005]=MAX([Expr1002])))
|--Parallelism(Gather Streams)
|--Hash Match(Aggregate, HASH:([S1].[app]),
RESIDUAL:([S1].[app]=[S1].[app]) DEFINE:([Expr1005]=MAX([Expr1002])))
|--Compute Scalar(DEFINE:([Expr1002]=[Expr1002]))
|--Nested Loops(Inner Join)
|--Sort(ORDER BY:([S1].[starttime] ASC,
[S1].[app] ASC))
| |--Parallelism(Repartition Streams,
PARTITION COLUMNS:([S1].[app], [S1].[starttime]))
| |--Index
Scan(OBJECT:([mydb].[dbo].[Sessions].[PK_Sessions] AS [S1]))
|--Table Spool
|--Stream
Aggregate(DEFINE:([Expr1002]=Count(*)))
|--Clustered Index
Seek(OBJECT:([mydb].[dbo].[Sessions].[ImproveIt] AS [S2]),
SEEK:([S2].[app]=[S1].[app] AND [S2].[starttime] <= [S1].[starttime]),
WHERE:([S2].[endtime]>[S1].[starttime]) ORDERED)


This should get some people thinking again... :-)

Gert-Jan

--
(Please reply only to the newsgroup)

Itzik Ben-Gan

unread,
Sep 20, 2004, 7:06:15 PM9/20/04
to
Very interesting... and makes sense when thinking about the exponential
growth in the number of rows scanned. The hash table probably holds a
significantly smaller number of elements in the buckets when the
distribution is non-realistic (i.e., just duplicate rows).

Well, thanks for having a look!

I still didn't go over my original benchmark, but I'm quite sure I had an
index on app, starttime, but probably more realistic distribution of values.

I'll be back after going over this more thoroughly tonight.

Thanks again!


--
BG, SQL Server MVP
www.SolidQualityLearning.com


"Gert-Jan Strik" <so...@toomuchspamalready.nl> wrote in message
news:414F5EC3...@toomuchspamalready.nl...

Michael D. Long

unread,
Sep 20, 2004, 8:05:54 PM9/20/04
to
I'm sure that Mr. Anonymous believes that 4 is an obvious conclusion based
on 3 - a big flat table doesn't require joins. ;-)

p.s. - It takes a real lamer to post under an anonymous identity. Secure in
his opinions? I wouldn't count on it. Follow the advice of such a bold
soul? I'd vote for John "Feel-the-wind" Kerry first.

--
Michael D. Long


Itzik Ben-Gan

unread,
Sep 21, 2004, 2:05:48 AM9/21/04
to
After examining the issue more thoroughly and running a realistic benchmark,
the original assumption still stands, i.e., the cursor based solution has
linear performance degradation, while the set-based solution has an
exponential one. The set-based solution is much slower even with a small
number of rows.
I'll try to explain the logic behind the assumption, with supporting
numbers, and also why the non-realistic sample data caused the set-based
solution to run faster then the cursor-based solution, and to also incur a
linear performance degradation.
I'm afraid this problem remains open, but of course I welcome anyone that
likes challenges to try and find a set-based solution that outperforms the
cursor-based. I'll keep on looking myself, of course.

Here's a more realistic benchmark.
Suppose we're tracking a month worth of sessions data, with 10 applications,
and an average session duration of ~50.5 minutes (ranging from 1 through
100).
The most efficient index (at least the one that I came up with) is actually
a covering non-clustered index placed on (app, starttime, endtime).
Examining the execution plan, the number of rows that are scanned for each
base row are the actual number of matching rows. i.e., if we have r rows in
the table, and each session overlaps with o rows in average, the total
number of rows that are going to be scanned are ~ r x o. The exceptions are
duplicate rows (same app, starttime) for which the cached aggregate
information in the hash table reduces the need to actually scan duplicates
multiple times, hence the fast results of the non-realistic sample data.

Now for realistic sample data. I'll try to explain the reason for the
exponential performance degradation.
To populate the table with realistic sample data, you need to calculate
random start times within the month, and end times which are start time +
random duration (e.g., in my benchmark in the range 1 through 100 minutes).
The more sessions there are for each application, the probability for
duplicates gets higher, and this will have an increasing positive
performance affect on the set-based solution because of the hashed
aggregates. But in practical terms even tables with large numbers of rows,
taking into consideration production systems that store session data in
DATETIME datatypes with an accuracy of 3.333 ms, we can basically ignore
duplicates.
I already mentioned that the number of rows that are going to be scanned is
roughly r x o (minus a bit because of duplicates). Now, populate the table
in a factor of f, still using random values within the same month. The table
is now f times larger (r x f), but the probability for session overlap also
increased by a factor of f (o x f), so the total number of rows that are
going to be scanned is ~ r x o x f^2 (-duplicates affect). I/O has a close
to linear affect on run-time; if it takes t seconds to query r rows, we
should expect ~ t x f^2 seconds to query r x f rows (assuming a
non-significant number of duplicates), hence the exponential performance
degradation.

Here are some estimations based on measured run time of 1 second (floored
for the sake of simplicity):

estimates:

rows factor runtime
---- ------------- -------
10K 1^2 = 1 1
20K 2^2 = 4 4
40K 4^2 = 16 16
80K 8^2 = 64 64
160K 16^2 = 256 256
1M 100^2 = 10000 10000

And here are the actuals from a benchmark I ran (code below):

rows factor runtime reads
---- ------ ------- -----------
10K 1 1.4 39851
20K 3.5 5 120628
40K 11.4 16 403296
80K 41 57 1454080
160K 150 210 40263
1M still running

As you can see the actuals closely resemble the exponential pattern I tried
to illustrate. I can attribute the increase in a slightly smaller factor
than a power of 2 to duplicates, but there may be other influencing factors
as well.

So, Hugo, Celko, others, want to take a shot at it and come up with a
creative set-based solution that does run fast?
BTW, I did try many variations including an auxiliary table of time slots
(which is of course less efficient than using the actual start times as time
slots).

Benchmark's code:

USE tempdb
SET NOCOUNT ON
GO
IF OBJECT_ID('Sessions') IS NOT NULL
DROP TABLE Sessions
GO

DECLARE @numrows AS INT
SET @numrows = 10000
-- test with 10K - 160K increasing by a factor of 2

SET STATISTICS IO OFF

SELECT
IDENTITY(INT, 1, 1) AS key_col,

D.*,
DATEADD(
minute,
CAST(ABS(RAND(CHECKSUM(NEWID()))*100-0.0000000001) AS INT)+1,
starttime) AS endtime
INTO Sessions
FROM
(
SELECT
'app' + CAST(CAST(ABS(RAND(CHECKSUM(NEWID()))*10-0.0000000001)
AS INT)+1 AS VARCHAR(10)) AS app,
'user1' AS usr,
'host1' AS host,
DATEADD(
minute,
CAST(ABS(RAND(CHECKSUM(NEWID()))*44540-0.0000000001)
AS INT)+1,
'20040101') AS starttime
FROM Nums
WHERE n <= @numrows
) AS D

ALTER TABLE Sessions ADD PRIMARY KEY(key_col)
CREATE NONCLUSTERED INDEX IDX_NCI_app_se ON Sessions (app, starttime,
endtime)

DBCC FREEPROCCACHE WITH NO_INFOMSGS
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS

SET STATISTICS IO ON

DECLARE @dt AS DATETIME
SET @dt = GETDATE()

SELECT app, MAX(concurrent) AS mx
FROM (SELECT app, starttime,
(SELECT COUNT(*)
FROM Sessions AS S2
WHERE S1.app = S2.app
AND S1.starttime >= S2.starttime
AND S1.starttime < S2.endtime) AS concurrent
FROM Sessions AS S1) AS C
GROUP BY app

PRINT CAST(@numrows AS VARCHAR(10)) + ' rows: '
+ CAST(DATEDIFF(ms, @dt, GETDATE()) / 1000. AS VARCHAR(30))
+ ' seconds.'

/*
-- Auxiliary table of numbers
CREATE TABLE Nums(n INT NOT NULL PRIMARY KEY)

SET NOCOUNT ON

DECLARE @max AS INT, @rc AS INT

SET @max = 1000000
SET @rc = 1

BEGIN TRAN
INSERT INTO Nums VALUES(1)

WHILE @rc * 2 <= @max
BEGIN
INSERT INTO Nums
SELECT n + @rc FROM Nums

SET @rc = @rc * 2
END

INSERT INTO Nums
SELECT n + @rc FROM Nums
WHERE n + @rc <= @max
COMMIT TRAN
*/

--
BG, SQL Server MVP
www.SolidQualityLearning.com

"Gert-Jan Strik" <so...@toomuchspamalready.nl> wrote in message
news:414F5EC3...@toomuchspamalready.nl...

Itzik Ben-Gan

unread,
Sep 21, 2004, 2:06:44 AM9/21/04
to
You're welcome. I'm glad to see you active in the newsgroup!

--
BG, SQL Server MVP
www.SolidQualityLearning.com


"Eric Sabine" <mopar41@mail_after_hot_not_before.com> wrote in message
news:eks1lnwn...@TK2MSFTNGP15.phx.gbl...

Itzik Ben-Gan

unread,
Sep 21, 2004, 2:15:24 AM9/21/04
to
I'm afraid not this time. See my reply to Gert-Jan. But, it was an
interesting review of the problem I must say. And it's a good thing that we
practice logic and not other stuff. Or, in the words of Arthur Conan Doyle:
"Crime is common. Logic is rare. Therefore it is upon the logic rather than
upon the crime that you should dwell."


:-)

--
BG, SQL Server MVP
www.SolidQualityLearning.com

"Joe Celko" <jcel...@earthlink.net> wrote in message

news:%23mX9Ckx...@TK2MSFTNGP11.phx.gbl...

Eric Sabine

unread,
Sep 21, 2004, 8:36:14 AM9/21/04
to
I try to contribute as much as I can. Maybe I will see you at PASS next
week if you're there but I don't see you as a presenter. I will buy _you_ a
beer if you're there.

Eric


Itzik Ben-Gan wrote:
> You're welcome. I'm glad to see you active in the newsgroup!
>
>

Itzik Ben-Gan

unread,
Sep 21, 2004, 9:03:24 AM9/21/04
to
I'm afraid I'm not attending PASS this year.

--
BG, SQL Server MVP
www.SolidQualityLearning.com

"Eric Sabine" <mopar41@mail_after_hot_not_before.com> wrote in message

news:%23vwZWe9...@TK2MSFTNGP10.phx.gbl...

Gert-Jan Strik

unread,
Sep 21, 2004, 6:33:25 PM9/21/04
to
Small update in the performance comparison:

I executed
CREATE CLUSTERED INDEX ImproveIt ON Sessions (app, starttime,endtime)
with drop_existing

on the existing table of 1832314 rows (399 distinct apps) and then reran
the cursor based solution and set based solution.

The cursor based solution benefited from the changed index. It now
finished in 0:16:57.

The set based solution finished in 1:15:28. The stats of this run were:
Table 'Sessions'. Scan count 812412, logical reads 93894032, physical
reads 1317, read-ahead reads 26329.
SQL Server Execution Times:
CPU time = 12529391 ms, elapsed time = 4526536 ms.

I will now let the server cool off :-)

Gert-Jan

0 new messages