Writing 'fast' SQL - where to start?

Roger Moran

unread,

Mar 19, 2002, 11:29:09 PM3/19/02

to

I'm not new to SQL, but am new to large databases. So I have to start
worrying about writing fast SQL, and perhaps other performance issues
(from a developer's point of view, not DBA's).

My question is rather basic and naive, but please also look at the
concrete example later below - I don't know HOW I'd have figured out
how to make that query go faster, if I was on my own.

I don't have access to a large database right now, but I know I will
be soon. So I can't yet test anything.

In a number of SQL books, I've noticed that they often solve a problem
using subqueries, whereas I can usually (not always, of course)
easily solve the same problem with just ordinary table joins.

I'm kind of assuming that normal joins would be the fastest, then
subqueries, then database cursors. Is that far too naive? Does it
totally depend on indexes, table partitioning, and other things found
in the "performance" section of books?

Or can I in fact find documentation (on the internet,
or in a book) which describes how to write fast SQL, which will help
regardless of where the indexes, etc, are?

-----
Ok, here's a concrete example, but it isn't necessarily related to
anything that I wrote above.

Someone recently had to do a SQL query against a large database. It had
to perform a join of two tables, and one of the columns was an
aggregate. It's a uses a standard Group By clause (i.e. all
non-aggregate columns appear in the Group By clause, as they should)
to get a sum of all hours a person worked, within a certain date range:

select b.col1, a.col2, b.col3,
convert(datetime,convert(varchar(10),b.the_datetime,101)) ,
sum(b.num_hours)
from table1 a,table2 b
where (b.col4 = a.col4) and (b.col1 in (<somelist>))
and (the_datetime >= DateAdd(Day,-7, <enddate>))
and (the_datetime <= <enddate>)
group by b.col1, a.col2, b.col3,
convert(datetime,convert(varchar(10),b.the_datetime,101))

(This is Sybase, but I don't think that it matters. The effect of the
two convert()'s is to change the_datetime to a varchar in order to
strip its time component - so as to leave just the date component -
then convert it back to a datetime type.
)

For the date range:
Instead of the <enddate>, substitute the end date. It then works
out the start date by going back 7 days, in this example.

Anyhow, this query took a very long time to run. What this experienced
person is to write code to put the above query into a temporary
table, but without the date range check (i.e. without the the_datetime
checks in the WHERE clause). The convert()'s still remain as before.

Only then does the person perform the date range check, i.e. on
the temporary table.

Lo and behold, it's way, way faster. Why?

I asked him how he knew to do this, and he just said "experience" = he
came across a similar problem a number of years ago.

I think that he said that there WAS an index on the the_datetime
column. His opinion was that there were too many values in that
index, and that's why the initial query was too slow. Well, I guess
that makes sense, but then what's the point of having an index on
a datetime field in the first place, if it indexes every millisecond?

Is there a standard way to approach these kinds of performance issues,
that is documented somewhere? There is documentation on indexes, of
course, and so forth. Is that the only thing available? I'm not sure
if I'd be able to apply that to the problem above. Is there documentation
that just compares various SQL techniques (normal SQL vs subqueries
vs cursors, for example) which assumes no indexes and so forth?

Or am I supposed to approach the problem like my experienced SQL
friend, who seems to put it down to just "experience" - i.e. just
try things until something gives me good speed, and learn as I go?

Todd Gillespie

unread,

Mar 21, 2002, 5:17:11 PM3/21/02

to

Roger Moran <rog7-m-...@ar.com.au> wrote:
<snip>
: I'm kind of assuming that normal joins would be the fastest, then

: subqueries, then database cursors. Is that far too naive? Does it
: totally depend on indexes, table partitioning, and other things found
: in the "performance" section of books?

Subqueries are frequently faster than joins not least because they avoid
stupid join path choices by the optimizer, but more significantly because
execution of a subquery can stop as soon as the predicate is satisfied. A
join does not have that easy way out.

Aakash Bordia

unread,

Apr 1, 2002, 6:54:05 PM4/1/02

to

You can look into vendor specific features such as summary
tables/materialized views etc. VLDB can take advantage of such db features.
Thanks
Aakash
"Roger Moran" <rog7-m-...@ar.com.au> wrote in message
news:3c97f...@bn.ar.com.au...

--CELKO--

unread,

Apr 2, 2002, 3:16:40 PM4/2/02

to

>> .. I have to start worrying about writing fast SQL, and perhaps

other performance issues (from a developer's point of view, not
DBA's). <<

Get a copy of SQL FOR SMARTIES, which has a lot of progrmaming tips in
Standad SQL-92.

>> In a number of SQL books, I've noticed that they often solve a
problem
using subqueries, whereas I can usually (not always, of course) easily
solve the same problem with just ordinary table joins. <<

A lot of times, the author just wants to demonstrate the use of a
subquery. These days, most optimizers are pretty good aobut
"flattening" a query into joins, so it is not a big problem. In the
old days, this was really true, especially in Oracle.

>> Ok, here's a concrete example ... a join of two tables, and one of

the columns was an aggregate. It's a uses a standard Group By clause

.. <<

NO! NO! NO! That is not anywhere near a Standard piece of code; in
SQL-92, you cannot do a GROUP BY on an expression-- it has to be a
column name.

SELECT B.col1, A.col2, B.col3, B.the_datetime, SUM(B.num_hours)
FROM Table1 AS A,
Table2 AS B
WHERE B.col4 = a.col4
AND B.col1 IN (<somelist>)
AND the_datetime
BETWEEN DATEADD (DAY,-7, :enddate) <== proprietary but
portable
AND the_datetime
GROUP BY B.col1, A.col2, B.col3, B.the_datetime;

The CONVERT() is CAST() in SQL-92. If you want to strip the time
component, do not do it by casting to a long, complex datatype like
VARCHAR(n). The real answer is that The original table should have
stripped the time back to 00:00:00.0000 if this column was to hold
only a date. You can do that with this bit of code in SQL Server, and
I assume it still works in Sybase.

UPDATE Tabl2
SET the_datetime
= CAST(CEILING(CAST(the_datetime AS FLOAT)) AS DATETIME);

Then change the DDL to show:

CREATE TABLE Table2
( ...
the_datetime DATEIME NOT NULL
CHECK (the_datetime
= CAST(CEILING(CAST(the_datetime AS FLOAT)) AS
DATETIME)

A lot of the slowness in a query is from a bad schema design.