Cluster analysis on dataset with ordinal and nominal data

2800 views
Skip to first unread message

cybersurferus

unread,
Feb 16, 2005, 3:28:45 PM2/16/05
to
Hi,
I am required to perform cluster analysis on a dataset which has
ordered category (Likert scale) data as well as ordinal (eg age) and
nominal (eg race) data. In order to perform the analysis, I plan to
transform the ordinal and likert scale data to a continuous scale (x)
with the following function :
x = (r-1)/(N-1) where r = 1, 2,.... N are the ordinal ranks.
Also, each nominal variable can be transformed to m binary (0,1)
variables where m is the number of categories of that nominal variable.
I then plan to use PROC CLUSTER with some sort of distance measure. My
questions are :
1) What distance measure would be best for these type of variables ?
2) Is there any alternative method of cluster analysis of this dataset
containing both nominal and ordinal variables ?
3) Does SAS 9 incorporate Latent class analysis models ?


Thanks in advance,
Pat

Eugene Gallagher

unread,
Feb 16, 2005, 6:13:44 PM2/16/05
to
cybersurferus wrote:

The only index that I know of specifically designed to utilize scale and
ordinal and nominal data is Gower's generalized similarity. There is a
nice description in Legendre & Legendre's Numerical Ecology book. I know
it is available in the commercial MVSP analysis package. A quick google
brought up this page for clustering with SAS. There appears to be a fair
number of hits on the web.


http://ftp.sas.com/techsup/download/stat/distnew.html

Richard Ulrich

unread,
Feb 17, 2005, 12:04:36 AM2/17/05
to
On 16 Feb 2005 12:28:45 -0800, "cybersurferus" <pat...@gmail.com>
wrote:

In case you are not committed to exactly that -

I would suggest, for the start:

Confine your grouping analyses to the set of similar data,
that is, the Likert-type data. Use age and race as part of
your description after the fact. The silly part of clustering,
it has always seemed to me, is that you have to FIGURE OUT
what out what you have after the fact, by ANOVA or whatever.
- If nothing else, limiting the analysis to Likert removes most
of the question of what "distance" to worry about.

Set of Likert-type items are very often reduced by using
factor analysis, which is where I would probably start when
I have a set of them.


--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html

Art Kendall

unread,
Feb 17, 2005, 8:54:14 AM2/17/05
to
The TWOSTEP cluster procedure in SPSS handles variables at different
levels of measurement. It produces a simple clustering, not a
hierarchical clustering.

Details of how to proceed depend on the details such as what questions
you are using the data to answer, any logical sets of variables, etc.

If you are using Likert scales (sum or mean of a set of Likert items),
the results are very unlikely to discrepant from the interval level of
measurement.

If you have Likert items meant to be be part of scales, you might want
to do item analysis (e.g., with RELIABILITY) and see if they do comprise
a scale or scales.

Age is usually considered a ratio level variable. There is an intrinsic
meaning to zero.

Art
A...@DrKendall.org
Social Research Consultants
University Park, MD USA
(301) 864-5570

Data Matter

unread,
Mar 2, 2005, 12:08:24 AM3/2/05
to
I agree with Richard. Even though there are procedures for different
types of data, in practical applications, there are problems when you
mix continuous, ordinal and nominal data.

Given the mixture, you've got to standardize the variables. Now if you
have a binary variable (call it X) mixed with continuous variables for
instance, then the difference in X between any two records is either 0
or 1 (everything standardized to [0,1] say) whereas for any continuous
variable it is a fraction. If you don't then weight the variables, the
binary variables will dominate. This gets more acute with more levels.

Best,
Kaiser

Reef Fish

unread,
Mar 2, 2005, 10:16:39 PM3/2/05
to

Data Matter wrote:
> I agree with Richard. Even though there are procedures for different
> types of data, in practical applications, there are problems when you
> mix continuous, ordinal and nominal data.

There are problems with any data if the person analyzing it doesn't
know what s/he is doing!

There's nothing wrong with data that consist of nominal, ordinal,
and interval-valued (or continuous) data. That is LIFE.

When one uses that kind of data in cluster analysis (or numerical
taxonomy), one simply have to make sure the measure used makes
substantive sense.

Read a few books on Numerical Taxonomy and you'll get a better
idea of what's valid/useful and what's not.

>
> Given the mixture, you've got to standardize the variables.

That's a complete red-herring. Even data that are not binary or
nominal or ordinal, standardizing the variables would only
over-weigh the variables that are highly correlated.

That's all I am going to say, on a subject which takes years to
learn well and is not going to happen in a newsgroup such as
this one. Mahalanobis distance, Manhattan and other Minkowski
and Mindowski-like metric, and a hundred other metrics that
had been used in the clustering literature immediately come to
mind -- and most of them would take very careful scrutiny before
one can decide whether it's sensible to any of them or not.

-- Bob.

Data Matter

unread,
Mar 3, 2005, 1:19:05 AM3/3/05
to
Bob,

We are in agreement. What I see in practice, especially when using
prepackaged software like SAS, people tend to let the software decide
what similarity measure to use, how to weigh the metrics, etc. A lot
of this is due to time constraint (I'm speaking of a corporate
environment, not an academic/scientific setting.)

If you need to read a few books on taxonomy before attempting the
analysis, why not think carefully before you use nominal data?
Perhaps these variables are highly correlated with other continuous
variables, for example.

As for standardizing or not, there is of course no hard and fast rule.
If you remove correlated variables before you standardize, or use
Mahalanobis, correlation is not a problem. Lets say this is a
financial services problem. The continuous variable is wealth and the
binary variable is male/female coded as 0/1. I don't see how you can
proceed without standardizing.

K

Reply all
Reply to author
Forward
0 new messages