Dupe values vs. nulls in join results

Paul Smith

unread,

Jun 7, 2002, 12:33:25 AM6/7/02

to

Greetings -

I'm having trouble isolating what factors control whether
certain columns are populated with NULLS vs. duplicated
values from one of the joined tables in a resultset obtained
from a join. From my experience it does not seem to be the
inner/outer/full nature of the join itself, but rather has
something to do with swapping criteria between the [join...
on] clause and a [where] clause. I've gotten the differing
results I want for different situations via trial and error a
time or two, but I'm not comfortable with that long-term...

Basically, it seems easy enough to get <NULLs> in the
appropriate columns when I want them, but occasionally
even with inner joins I've seen values from one of the joined
tables repeat in columns where <NULLs> would otherwise
be, even though the source rows for those values should not
strictly be included in the join. Usually, this is only due
to my hosing up something in the SQL late at night, and it's
pretty quick to get rid of given a good cup of coffee. However,
there are actually rare situations where I'd like to take
advantage of this or similar behavior if I can isolate the
cause and it happens to be dependable...

Thanks in advance if anyone can help me define this!

Tom Moreau

unread,

Jun 7, 2002, 6:54:52 PM6/7/02

to

If you post your DDL plus INSERTS plus expected results, we can help. There
are some issues around OUTER JOIN's and filter criteria, i.e. you put the
filter criteria for the unpreserved table in the ON predicate of the
LEFT/RIGHT JOIN clause. What specific problems are you having?

--
Tom

----------------------------------------------------------
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA, MCT
SQL Server MVP
Columnist, SQL Server Professional

Toronto, ON Canada
www.pinnaclepublishing.com/sql
www.apress.com
---
Paul Smith wrote in message
<1HWL8.130206$ux5.1...@rwcrnsc51.ops.asp.att.net>...

Paul Smith

unread,

Jun 7, 2002, 9:49:23 PM6/7/02

to

> "Tom Moreau" wrote:
> If you post your DDL plus INSERTS plus expected results, we can help. There
> are some issues around OUTER JOIN's and filter criteria, i.e. you put the
> filter criteria for the unpreserved table in the ON predicate of the
> LEFT/RIGHT JOIN clause. What specific problems are you having?

Paul Smith replies:
I keep forgetting to send my SQL to my home email address; I have Usenet
access only at home (except via Google Groups), and the SQL stuff only at
work!

Anyway, I think I've figured this out...

My problem was that one of my joins used a parameter instead of a related
column in the other table as the [on] criteria. What I discovered in doing so
was that this sort of "join" produces duplicate values from the left table in
the results, even with inner joins. e.g. when joining [temptbl] and [othertbl],
if my parm name is @ID1, then an [on] clause of type "where temptbl.id =
@id1" causes the repeats, whereas an [on] clause of type "where temptbl.id =
othertbl.id" would cause <NULL> to appear in the same columns. This
makes perfect sense now that I'm awake.

Thanks!

--CELKO--

unread,

Jun 8, 2002, 1:24:08 PM6/8/02

to

>> I'm having trouble isolating what factors control whether certain
columns are populated with NULLS vs. duplicated values from one of the
joined tables in a resultset obtained from a join. <<

Here is how a SELECT works in SQL ... at least in theory. Real
products will optimize things when they can.

a) Start in the FROM clause and build a working table from all of the
joins, unions, intersections, and whatever other table constructors
are there. The table expression> AS <correlation name> option allows
you give a name to this working table which you then have to use for
the rest of the containing query.

b) Go to the WHERE clause and remove rows that do not pass criteria;
that is, that do not test to TRUE (reject UNKNOWN and FALSE). The
WHERE clause is applied to the working in the FROM clause.

c) Go to the optional GROUP BY clause, make groups and reduce each
group to a single row, replacing the original working table with the
new grouped table. The rows of a grouped table must be group
characteristics: (1) a grouping column (2) a statistic about the group
(i.e. aggregate functions) (3) a function or (4) an expression made up
of the those three items.

d) Go to the optional HAVING clause and apply it against the grouped
working table; if there was no GROUP BY clause, treat the entire table
as one group.

e) Go to the SELECT clause and construct the expressions in the list.
This means that the scalar subqueries, function calls and expressions
in the SELECT are done after all the other clauses are done. The AS
operator can give a name to expressions in the SELECT list, too.
These new names come into existence all at once, but after the WHERE
clause, GROUP BY clause and HAVING clause has been executed; you
cannot use them in the SELECT list or the WHERE clause for that
reason.

If there is a SELECT DISTINCT, then redundant duplicate rows are
removed. For purposes of defining a duplicate row, NULLs are treated
as matching (just like in the GROUP BY).

f) Nested query expressions follow the usual scoping rules you would
expect from a block structured language like C, Pascal, Algol, etc.
Namely, the innermost queries can reference columns and tables in the
queries in which they are contained.

Here is how OUTER JOINs work in SQL-92. Assume you are given:

Table1 Table2
a b a c
====== ======
1 w 1 r
2 x 2 s
3 y 3 t
4 z

and the outer join expression:

Table1
LEFT OUTER JOIN
Table2
ON Table1.a = Table2.a <== join condition
AND Table2.c = 't'; <== single table condition

We call Table1 the "preserved table" and Table2 the "unpreserved
table" in the query. What I am going to give you is a little
different, but equivalent to the ANSI/ISO standards.

1) We build the CROSS JOIN of the two tables. Scan each row in the
result set.

2) If the predicate tests TRUE for that row, then you keep it. You
also remove all rows derived from it from the CROSS JOIN

3) If the predicate tests FALSE or UNKNOWN for that row, then keep the
columns from the preserved table, convert all the columns from the
unpreserved table to NULLs and remove the duplicates.

So let us execute this by hand:

Let @ = passed the first predicate
Let * = passed the second predicate

Table1 CROSS JOIN Table2
a b a c
=========================
1 w 1 r @
1 w 2 s
1 w 3 t *
2 x 1 r
2 x 2 s @
2 x 3 t *
3 y 1 r
3 y 2 s
3 y 3 t @* <== the TRUE set
4 z 1 r
4 z 2 s
4 z 3 t *

Table1 LEFT OUTER JOIN Table2
a b a c
=========================
3 y 3 t <= only TRUE row
-----------------------
1 w NULL NULL Sets of duplicates
1 w NULL NULL
1 w NULL NULL
-----------------------
2 x NULL NULL
2 x NULL NULL
2 x NULL NULL
3 y NULL NULL <== derived from the TRUE set - Remove
3 y NULL NULL
-----------------------
4 z NULL NULL
4 z NULL NULL
4 z NULL NULL

the final results:

Table1 LEFT OUTER JOIN Table2
a b a c
=========================
1 w NULL NULL
2 x NULL NULL
3 y 3 t
4 z NULL NULL

The basic rule is that every row in the preserved table is represented
in the results in at least one result row.

There are limitations and very serious problems with the extended
equality version of an outer join used in some diseased mutant
products. Consider the two Chris Date tables

Suppliers SupParts
supno supno partno qty
========= ==============
S1 S1 P1 100
S2 S1 P2 250
S3 S2 P1 100
S2 P2 250

and let's do an extended equality outer join like this:

SELECT *
FROM Supplier, SupParts
WHERE Supplier.supno *= SupParts.supno
AND qty < 200;

If I do the outer first, I get:

Suppliers LOJ SupParts
supno supno partno qty
=======================
S1 S1 P1 100
S1 S1 P2 250
S2 S2 P1 100
S2 S2 P2 250
S3 NULL NULL NULL

Then I apply the (qty < 200) predicate and get

Suppliers LOJ SupParts
supno supno partno qty
===================
S1 S1 P1 100
S2 S2 P1 100

Doing it in the opposite order

Suppliers LOJ SupParts
supno supno partno qty
===================
S1 S1 P1 100
S2 S2 P1 100
S3 NULL NULL NULL

Sybase does it one way, Oracle does it the other and Centura (nee
Gupta) lets you pick which one -- the worst of both non-standard
worlds! In SQL-92, you have a choice and can force the order of
execution. Either do the predicates after the join ...

SELECT *
FROM Supplier
LEFT OUTER JOIN
SupParts
ON Supplier.supno = SupParts.supno
WHERE qty < 200;

... or do it in the joining:

SELECT *
FROM Supplier
LEFT OUTER JOIN
SupParts
ON Supplier.supno = SupParts.supno
AND qty < 200;

Another problem is that you cannot show the same table as preserved
and unpreserved in the extended equality version, but it is easy in
SQL-92. For example to find the students who have taken Math 101 and
might have taken Math 102:

SELECT C1.student, C1.math, C2.math
FROM (SELECT * FROM Courses WHERE math = 101) AS C1
LEFT OUTER JOIN
(SELECT * FROM Courses WHERE math = 102) AS C2
ON C1.student = C2.student;

Tom Moreau

unread,

Jun 9, 2002, 7:59:57 AM6/9/02

to

What? You mean you don't do SQL in your sleep? ;-)

--
Tom

----------------------------------------------------------
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA, MCT
SQL Server MVP
Columnist, SQL Server Professional

Toronto, ON Canada
www.pinnaclepublishing.com/sql
www.apress.com
---

Paul Smith wrote in message ...