Informix SQL - How to place a condition on COUNT( DISTINCT)

David Grove

unread,

Jan 24, 2001, 2:58:16 PM1/24/01

to

Folks,

How does one apply a HAVING condition to a COUNT( DISTINCT col_name ) in a
SELECT expression?

For instance, if I wish to identify and resolve duplications between two
attributes, say "cust_no" and "ssn" (neither is a PK for the table), I might
try:

SELECT cust_no, COUNT( DISTINCT ssn )
FROM customer
GROUP BY cust_no;

I could examine the results and determine which cust_no's had multiple ssn's
associated with them. However, if the table were millions of rows, and
there were only a few cust_no's that had multiple ssn's, I would prefer to
see in the result only those rows in which the number of distinct ssn's was
greater than 1, as in:

SELECT cust_no, COUNT(DISTINCT ssn )
FROM customer
GROUP BY cust_no
HAVING COUNT( DISTINCT ssn ) > 1;

But that is a syntax error because "DISTINCT" is used more than once in the
query, and Informix seems not to permit the use of
"position-number-in-the-SELECT" as a synonym for the expression itself.

Can anyone suggest a way to formulate this query, properly, for Informix?

I'm wondering if it can be done without using multiple SELECT's and saving
intermediate results in a temporary table.

Thank you for any suggestions.

Regards,

David Grove

Jay Roberts

unread,

Jan 24, 2001, 6:04:11 PM1/24/01

to

Does running it like this give you the results you need?

SELECT cust_no,ssn, COUNT( ssn )
FROM customer
GROUP BY cust_no,ssn
HAVING COUNT( ssn ) > 1;

Andrew Hamm

unread,

Jan 24, 2001, 7:10:09 PM1/24/01

to

David Grove wrote in message ...

>Folks,
>
>How does one apply a HAVING condition to a COUNT( DISTINCT col_name ) in a
>SELECT expression?
>

<SNIP>

I've mucked around with this and can't find any way of getting it working
either. Very curious! It should work, 'cos the query makes sense.

The closest I can get is

SELECT cust_no, COUNT(ssn )

FROM customer
GROUP BY cust_no
HAVING COUNT( DISTINCT ssn ) > 1

but then the selected count is wrong, although at least you only see the
matching cust_nos.

Hmmm - maybe a self-join to apply that as a filter?

Tried this (using on of my tables - I don't know what your key fields are so
substitute...):

select x0.cust_no, count(distinct x0.ssn)
from customer x0, customer x1
where x1.KEYFIELD = x0.KEYFIELD
group by 1
having count(distinct x1.ssn) > 1

Bastard! (oops, sorry Earle) No luck.

or a sub-query? That's starting to sound ugly... at least it's not
correlated:

select x0.cust_no, count(distinct x0.ssn)
from customer x0
where x0.cust_no in (
select x1.cust_no
from customer x1
group by 1
having count(distinct x1.ssn) > 1
)
group by 1

WOO HOO!
success...

Codd knows how efficient it will be with your tables and indexes...

David Grove

unread,

Jan 24, 2001, 7:17:22 PM1/24/01

to

Thanks for your suggestion.

I think your proposal is close, but not quite what I'm looking for.
Consider a customer table (with many other unshown attributes, including a
sequence number which serves as PK) that contains rows with the following
values for two of the attributes (cust_no and ssn):

cust_no ssn
123 123-45-6789
123 123-45-6789
123 123-45-6789
123 987-65-4321
123 987-65-4321
123 567-89-1234

Note that there are 3 distinct ssn's for a particular cust_no. I believe
the result of your query would be as follows (note the single ssn is
missing):

cust_no ssn COUNT(ssn)
123 123-45-6789 3
123 987-65-4321 2

Further, if there were, say, 3 different ssn's, but each one occurred only
once, then that particular cust_no wouldn't show up at all in the results, I
think.

What I would prefer is either of the following:

cust_no # of distinct ssn's
123 3

or else:

cust_no distinct ssn's
123 123-45-6789
123 987-65-4321
123 567-89-1234

I appreciate you taking your valuable time to help. If I misunderstood your
thoughts, I apologize.

Regards,

DG

Folks,

Regards,

David Grove

"Jay Roberts" <Jay_R...@bourns.com> wrote in message
news:94nnm1$51p$1...@news.xmission.com...

David Grove

unread,

Jan 24, 2001, 7:29:18 PM1/24/01

to

Sorry for redundancy (or should I say reDUNCEancy) in the previous post.

DG

David Grove

unread,

Jan 24, 2001, 7:47:13 PM1/24/01

to

Thank you, Mr. Hamm.

I finally arrived at:

SELECT DISTINCT cust_no
FROM customer as c1
WHERE 1 < (SELECT COUNT(DISTINCT c2.ssn)
FROM customer as c2
WHERE c1.cust_no = c2.cust_no);

Which doesn't do the complete job (it provides a list of cust_no's only,
with no indication of the actual number or values of duplicate ssn's. And
besides, it's correlated. Not pretty.

I like your proposal much better.

Still, it seems odd to me that Informix won't let me add a constraint to a
SELECTed aggregate function containing DISTINCT. As I recall, even MS
Access permits that (through the use of a synonym in the HAVING clause), but
I could be mistaken.

Does anyone know whether this feature (inability to constrain in HAVING
clause a DISTINCT aggregate function in the SELECT clause) is a consequence
of a SQL Standard, or just an Informix implementation issue?

Regards,

DG

"Andrew Hamm" <ah...@sanderson.net.au> wrote in message
news:3a6f6da6$1...@news.iprimus.com.au...

Andrew Hamm

unread,

Jan 24, 2001, 8:27:48 PM1/24/01

to

David Grove wrote in message ...

>Thank you, Mr. Hamm.
You are welcome!

>Still, it seems odd to me that Informix won't let me add a constraint to a
>SELECTed aggregate function containing DISTINCT. As I recall, even MS
>Access permits that (through the use of a synonym in the HAVING clause),
but
>I could be mistaken.
>
>Does anyone know whether this feature (inability to constrain in HAVING
>clause a DISTINCT aggregate function in the SELECT clause) is a consequence
>of a SQL Standard, or just an Informix implementation issue?

I'd guess an implementation issue. My minor experience with select distinct
in the past showed that it's really cranky, as if there's a boolean in the
optimiser that says "OK - seen a DISTINCT keyword! Reject all others at all
cost!" and it looks like there is no consideration shown to occasional
situations where it makes sense. I'm sure your original query makes sense,
unless someone can prove otherwise.

brow...@my-deja.com

unread,

Jan 25, 2001, 6:38:44 PM1/25/01

to

In article <t6uck19...@corp.supernews.com>,

"David Grove" <david...@health.state.ak.us> wrote:
> Folks,
>
> How does one apply a HAVING condition to a COUNT( DISTINCT col_name )
in a
> SELECT expression?

This works in 9.21.

DROP TABLE Test_Data;
--
CREATE TABLE Test_Data (
Id SERIAL PRIMARY KEY,
Val CHAR(2) NOT NULL,
Grp CHAR(2) NOT NULL
);
--
INSERT INTO Test_Data
( Val, Grp )
SELECT T1.C || T2.C,
T3.C
FROM TABLE(SET{'A','A','B','B','C','C','D'}) T1 ( C ),
TABLE(SET{'A','A','B','B','C','C','D'}) T2 ( C ),
TABLE(SET{'A','A','B','B','C','C','D'}) T3 ( C )
WHERE T3.C < T2.C;
--
SELECT T.Val,
COUNT(DISTINCT T.Grp )
FROM Test_Data T
GROUP BY T.Val;
--
SELECT C.Val, C.Cnt
FROM TABLE(MULTISET( SELECT T.Val,
COUNT(DISTINCT T.Grp )
FROM Test_Data T
GROUP BY T.Val )) C ( Val, Cnt )
WHERE C.Cnt > 2;

This is an example of a "closed" query expression, which means
putting another SELECT statement in the FROM list. The implementation
isn't as good as it could be, but it will work pretty well for most
things, and you avoid all the temp table stuff.

Hope this helps!

KR

Pb

Sent via Deja.com
http://www.deja.com/