anti-joins and performance

Fredrik Bertilsson

unread,

Nov 13, 2012, 12:39:22 AM11/13/12

to

Are there any way for an anti-join like below to perform better than O(N), N=number of rows in table1? In my case (Informix) the number of rows in both table1 and table2 are huge, and the expected result is only a few rows. Are there any chance that a database engine could return a result without actually having to make a full table scan in table1?

select *
from table1 t1
where not exists (select * from table2 t2 where t2.key=t1.key)

paul c

unread,

Nov 14, 2012, 9:24:10 PM11/14/12

to

I presume this is the same as asking is it possible to have a row in
table1 whose "key" doesn't match the key of any row in table2 and avoid
"reading" that table1 row? I'd say the answer to that is no, ie., it's
not logically possible to know that a table1 row's key doesn't match a
table2 key without knowing the table1 key.

So my guess is that you are really intending a situation that has
physical nuances not mentioned in the question as written.

As far as I know, "table scan" means sequential reading of all the
physical blocks and thus all the rows of a table. I would think any dbms
that supports clustered indexes would give O(N) performance, by scanning
both tables in parallel, also that this would be pretty good performance
too. Is it possible that the keys don't have clustered indexes? If
neither does, performance could be pretty poor. Non-clustered indexes
wouldn't be good enough, nor would some pathologically fragmented
underlying filesystem where every physical block was far away from the
sequentially adjacent block.

I'm saying this without knowing anything about Informix, not even what
terminology it uses.

Cimode

unread,

Dec 21, 2012, 3:02:21 AM12/21/12

to

Try the opposite.

select *
from table1
where key not in (select key from table1 t1 inner join table2 t2 on t1.key=t2.key )

Hope this helps.

@paulc
I would like to correct a few misconceptions about direct image systems, you unknowingly are spreading in your comment:

> On modern storage IO subsystems such as RAID arrays, it does not make sense to speak of sequentiality regarding rows but it could make sense to speak about adjacency. All IO read/write operations of rows presented to user can result in non physically sequential IO's decided by the IO subsystem (such as the IO controller). In other words, what is usually *presented* as a sequential read most of the time in fact ends up as a physical read on multiple disks in a *time sequence* determined by the IO subsystem algorthythmics. This is a consequence of logical physical layer confusion direct image system induce since they do not achieve true physical data independence.
> On direct image systems, the difference between clustered vs full table scans simply imply the quantity of logical IO's performed to determine which one meets a specific predicate. WIth a clustered index scan, such quantity is reduced by keeping range value statistics about values stored and skipping the one that do not meet the predicate. In other words, a clustered index scan is only more effective/economic method than full scan but not more efficient.
> A direct image system implementation can not be a relational implementation because it *inherently* the principle of data independence since the data physical encoding is directly bound the user data presentation. The latest transrelational model is an another example of direct image system, which explains it was doomed to fail.

Regards