Thanks in advance,
Pat
The only index that I know of specifically designed to utilize scale and
ordinal and nominal data is Gower's generalized similarity. There is a
nice description in Legendre & Legendre's Numerical Ecology book. I know
it is available in the commercial MVSP analysis package. A quick google
brought up this page for clustering with SAS. There appears to be a fair
number of hits on the web.
In case you are not committed to exactly that -
I would suggest, for the start:
Confine your grouping analyses to the set of similar data,
that is, the Likert-type data. Use age and race as part of
your description after the fact. The silly part of clustering,
it has always seemed to me, is that you have to FIGURE OUT
what out what you have after the fact, by ANOVA or whatever.
- If nothing else, limiting the analysis to Likert removes most
of the question of what "distance" to worry about.
Set of Likert-type items are very often reduced by using
factor analysis, which is where I would probably start when
I have a set of them.
--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html
Details of how to proceed depend on the details such as what questions
you are using the data to answer, any logical sets of variables, etc.
If you are using Likert scales (sum or mean of a set of Likert items),
the results are very unlikely to discrepant from the interval level of
measurement.
If you have Likert items meant to be be part of scales, you might want
to do item analysis (e.g., with RELIABILITY) and see if they do comprise
a scale or scales.
Age is usually considered a ratio level variable. There is an intrinsic
meaning to zero.
Art
A...@DrKendall.org
Social Research Consultants
University Park, MD USA
(301) 864-5570
Given the mixture, you've got to standardize the variables. Now if you
have a binary variable (call it X) mixed with continuous variables for
instance, then the difference in X between any two records is either 0
or 1 (everything standardized to [0,1] say) whereas for any continuous
variable it is a fraction. If you don't then weight the variables, the
binary variables will dominate. This gets more acute with more levels.
Best,
Kaiser
There are problems with any data if the person analyzing it doesn't
know what s/he is doing!
There's nothing wrong with data that consist of nominal, ordinal,
and interval-valued (or continuous) data. That is LIFE.
When one uses that kind of data in cluster analysis (or numerical
taxonomy), one simply have to make sure the measure used makes
substantive sense.
Read a few books on Numerical Taxonomy and you'll get a better
idea of what's valid/useful and what's not.
>
> Given the mixture, you've got to standardize the variables.
That's a complete red-herring. Even data that are not binary or
nominal or ordinal, standardizing the variables would only
over-weigh the variables that are highly correlated.
That's all I am going to say, on a subject which takes years to
learn well and is not going to happen in a newsgroup such as
this one. Mahalanobis distance, Manhattan and other Minkowski
and Mindowski-like metric, and a hundred other metrics that
had been used in the clustering literature immediately come to
mind -- and most of them would take very careful scrutiny before
one can decide whether it's sensible to any of them or not.
-- Bob.
We are in agreement. What I see in practice, especially when using
prepackaged software like SAS, people tend to let the software decide
what similarity measure to use, how to weigh the metrics, etc. A lot
of this is due to time constraint (I'm speaking of a corporate
environment, not an academic/scientific setting.)
If you need to read a few books on taxonomy before attempting the
analysis, why not think carefully before you use nominal data?
Perhaps these variables are highly correlated with other continuous
variables, for example.
As for standardizing or not, there is of course no hard and fast rule.
If you remove correlated variables before you standardize, or use
Mahalanobis, correlation is not a problem. Lets say this is a
financial services problem. The continuous variable is wealth and the
binary variable is male/female coded as 0/1. I don't see how you can
proceed without standardizing.
K