SELECT DISTINCT slow, how can I speed up

Barry Bulsara

unread,

Aug 14, 2008, 2:18:41 PM8/14/08

to bbuls...@hotmail.com

Hello. I have a question about how I can structure a SQL query to make
the results come back faster. I have experimented a bit and include
some results here.

We have read-only SELECT access to tables for a payroll and absence
system running Oracle 10.0.1 on (I think) a dedicated Windows 2003
Server in the IT department on the same site. We do not have
privileges to UPDATE, DELETE, or create anymore indexes and if I did,
it would probably invalidate our software support contract.

We have staff using Access 2003 as a front end to the Oracle tables
for the purposes of ad hoc queries. These staff regularly enter
queries across 5-6 tables with various bits added to the SQL to limit
the number of rows returned and match criteria. All the rows starts to
come back within a second of the SQL starting to run.

When the SQL is like this (I have anonymised the table names).

SELECT a1,a2,a3,a5,a10
FROM table1, table2, table3, table4
WHERE table1.empID=table2.empID
AND table2.empID=table3.empID
AND table3.empID=table4.empID
AND table4.absenceT IN (15,19)
AND table4.year=2008;

The rows start to come back immediately. Around 1700-1800 rows
typically come back and there is a lot of network IO. If I type

SELECT COUNT(*) FROM
(
SELECT a1,a2,a3,a5,a10
FROM table1, table2, table3, table4
WHERE table1.empID=table2.empID
AND table2.empID=table3.empID
AND table3.empID=table4.empID
AND table4.absenceT IN (15,19)
AND table4.year=2008
)

it returns in 2.14seconds saying it found 1792 rows.

If I type SELECT DISTINCT a1,a2,a3,a5,a10

only 20-50 rows come back and the query takes 28 seconds and there is
a lot less network IO.

I only want the 20-50 rows but I don't want to wait 28 seconds. As
Oracle seems to find all 1792 rows in 2.14seconds, why does it take 28
seconds only to return the unique ones.

These are ad hoc queries so I am not trying to optimize a single
query. My question is Is there a more quick way of generally returning
a DISTINCT set of rows other than using DISTINCT in the SELECT
statement.

Thank you
Barry

hpuxrac

unread,

Aug 14, 2008, 3:28:10 PM8/14/08

to

Explain plans as well as output from a 10046 trace would be helpful
for something like this.

In general SELECT DISTINCT is going to have some additional overhead
in oracle.

Having a SELECT COUNT return in 2+ seconds is not quite the same as
having 2000 rows come back as oracle may be coming up with quite
different execution plans between the SELECT COUNT and the SELECT
columns.

Can you code the query in such as way that you don't need to use
DISTINCT? The EXISTS/NOT EXISTS and IN/NOT IN constructs are often
useful in situations like this if you cannot force uniqueness just by
correct table joins.

sybr...@hccnet.nl

unread,

Aug 14, 2008, 3:35:54 PM8/14/08

to

On Thu, 14 Aug 2008 11:18:41 -0700 (PDT), Barry Bulsara
<bbuls...@hotmail.com> wrote:

>If I type SELECT DISTINCT a1,a2,a3,a5,a10
>
>only 20-50 rows come back and the query takes 28 seconds and there is
>a lot less network IO.
>

Sure, every query consists of
parse
execute
fetch

Distinct requires SORT and SORT is in the execute phase. Execute is on
the database server.

>I only want the 20-50 rows but I don't want to wait 28 seconds. As
>Oracle seems to find all 1792 rows in 2.14seconds, why does it take 28
>seconds only to return the unique ones.
>

using DISTINCT requires a SORT operation.
Sorts can be tuned.

>These are ad hoc queries so I am not trying to optimize a single
>query. My question is Is there a more quick way of generally returning
>a DISTINCT set of rows other than using DISTINCT in the SELECT
>statement.

You would still need a SORT, wouldn't you?

--
Sybrand Bakker
Senior Oracle DBA

ora...@msn.com

unread,

Aug 14, 2008, 3:45:10 PM8/14/08

to

You might try this, although I don't know how much faster, if at all,
it will be:

SELECT a1,a2,a3,a5,a10
FROM table1, table2, table3, table4
WHERE table1.empID=table2.empID
AND table2.empID=table3.empID
AND table3.empID=table4.empID
AND table4.absenceT IN (15,19)
AND table4.year=2008

group by a1, a2, a3, a5, a10;

Possibly you should ask about increasing the hash_area_size parameter
value as 10g uses hashing algorithms to process such requests, rather
than the old-style sort operations.

David Fitzjarrell

ora...@msn.com

unread,

Aug 14, 2008, 3:54:50 PM8/14/08

to

> David Fitzjarrell- Hide quoted text -
>
> - Show quoted text -

With a 8 million records and no index I find the group by executes in
less than half of the time of distinct:

SQL> select distinct ename
2 from dist_tst;

ENAME
--------------------
NANCY10
NANCY2
NANCY7
NANCY9
NANCY5
NANCY3
NANCY6
NANCY1
NANCY4
NANCY8

10 rows selected.

Elapsed: 00:00:14.32

Execution Plan
----------------------------------------------------------
Plan hash value: 3730636855

---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost
(%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8049K| 92M| |
45750 (9)| 00:09:10 |
| 1 | HASH UNIQUE | | 8049K| 92M| 308M|
45750 (9)| 00:09:10 |
| 2 | TABLE ACCESS FULL| DIST_TST | 8049K| 92M| | 5918
(10)| 00:01:12 |
---------------------------------------------------------------------------------------

Note
-----
- dynamic sampling used for this statement

Statistics
----------------------------------------------------------
49 recursive calls
1 db block gets
47196 consistent gets
11429 physical reads
1680880 redo size
347 bytes sent via SQL*Net to client
246 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
10 rows processed

SQL>
SQL> select ename
2 from dist_tst
3 group by ename;

ENAME
--------------------
NANCY9
NANCY10
NANCY2
NANCY6
NANCY4
NANCY7
NANCY5
NANCY3
NANCY1
NANCY8

10 rows selected.

Elapsed: 00:00:05.79

Execution Plan
----------------------------------------------------------
Plan hash value: 1228703371

-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8049K| 92M| 9067 (42)|
00:01:49 |
| 1 | HASH GROUP BY | | 8049K| 92M| 9067 (42)|
00:01:49 |
| 2 | TABLE ACCESS FULL| DIST_TST | 8049K| 92M| 5918 (10)|
00:01:12 |
-------------------------------------------------------------------------------

Note
-----
- dynamic sampling used for this statement

Statistics
----------------------------------------------------------
4 recursive calls
0 db block gets
23699 consistent gets
11462 physical reads
0 redo size
348 bytes sent via SQL*Net to client
246 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
10 rows processed

SQL>

So it may help you, it may not. You'll need to test this on your own
data.

David Fitzjarrell

William Robertson

unread,

Aug 16, 2008, 6:32:29 PM8/16/08

to

A COUNT query can often use a very different execution plan to the
corresponding SELECT <actual columns>, because without the need to
access the columns it may be able to get the required result entirely
from indexes, avoiding some table accesses.