cluster analysis with nominal variables

Lolay

unread,

Oct 23, 2001, 5:11:41 AM10/23/01

to

I have to do a cluster analysis with only nominal variables. If I
transform them all into binary dummy variables and use these to do the
clustering I actually give the nominal variables with a lot of
categories a higher weight in the cluster analysis. How do I get
around this problem? If you have any thoughts/techniques/tricks for
this, please let me know.

Joseph Saint Pierre

unread,

Oct 23, 2001, 5:30:18 AM10/23/01

to

lola...@yahoo.com.tw (Lolay) writes:

When I want to perform a cluster analysis with nominal variables
with SPSS, I start using optimal scaling with homals command
(avalaible with categories), I save object scores in new variables
and I perform cluster analysis on these news variables.
Optimal scaling is similar to correspondence analysis which can
be seen as method to transform categorical variables into quantitative
variables.

--
Joseph Saint Pierre
http://www.cict.fr/cict/personnel/stpierre

christian

unread,

Oct 23, 2001, 6:28:08 AM10/23/01

to Joseph Saint Pierre

And what is the best way, if i have a mixed dataSet
with some ordinal variables (ratings) and some ratio variables like age ???

My strategie in present is transform all ordinal variables
into dummy variables, because one have got a 3-point scale and another
a 10-point scale. The ratio variables i do transform to z-Standardized
values with mean 0 and standard deviation 1 !

Is this a acceptable way ???
( i use k-means and want avoid different impacts as a cause from
different codes {euclidean distance} )

Thanks for advance & regards,christian

P.S. The trick with homals is new for me, thanks!

Joseph Saint Pierre

unread,

Oct 23, 2001, 10:49:31 AM10/23/01

to christian

christian <c.sc...@metafacts.de> writes:

> And what is the best way, if i have a mixed dataSet
> with some ordinal variables (ratings) and some ratio variables like age ???

I never put in a cluster analysis or in a factorial analysis or in any
multivariate analysis variables like ratings (Likert scales ) with
biographical variables such as sex, age, marital status etc...
Just because clusters (or factors) could be defined by associations or
correlations between such variables which are strong and usually known.
For example in a developped country women live longer than men are
more often widow. If you put these three variables age, sex, marital
there are associations between women, old age and widow which, IMHO,
have not to be mixed with ratings variables.

Usually I put only ratings variables in the multivariate analysis and
I study the links between factors or clusters with variables like age
using simple statistical techniques.

> My strategie in present is transform all ordinal variables
> into dummy variables, because one have got a 3-point scale and another
> a 10-point scale. The ratio variables i do transform to z-Standardized
> values with mean 0 and standard deviation 1 !

When there two or more different type of ordinal scales I do
different analysis for different sets of variables. In social
sciences data set I have often used different analysis for different
set of variables even ther were all 4-point scale. I prefer
not to mix differents set in a multivariate analysis, it is very
common that a set of variable has much higher internal associations
than another one, and in such a situation clusters or factors are often
defined only by the variables of the set with highest associations...

> Is this a acceptable way ???

I do not know what is a exactly an acceptable way, I consider that
multivariate techniques have not to be used systematically,
I suggest very often "The Mismeasure of Man" of Stephen Jay Gould,
this book contain an excellent critic of factor analysis usages.
I have written (in French) on my web pages many comments on
usages of multivariates statistical analysis in social sciences,
I suggest basic simple analysis, I consider SPSS as a great package
for its simplicity in recoding, calculating, aggregating etc...

> ( i use k-means and want avoid different impacts as a cause from
> different codes {euclidean distance} )

I am not a specialist of cluster analysis and I still do not understand
how does it really works as a mathematical model, choice of method,
disctance is still a kind of magic:-))

Rich Ulrich

unread,

Oct 23, 2001, 3:24:08 PM10/23/01

to

First, I do like the notion of not mixing variable types.

Second, I like the notion of doing correspondence analysis
and forgetting all about cluster analysis.

Third, there are special problems and limits for computing
distances with binary variables; read up on those.

On the other hand, Here is an answer about *weights*
[this is an answer, not a recommendation] --

You can use and manipulate the old default.
Let the scores determine weight: Then, use dummy
scores of 0-10 for X2, if you want X2 to matter 100 times
as much (in squared distance) as X1, for X1 scored 0-1.
Et cetera, et cetera.

--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html

ckdon...@sheffield.ac.uk

unread,

Oct 22, 2018, 1:02:09 PM10/22/18

to

I was searching for solution to my worries on cluster analysis, and the process linked me up here. Looking at similar issues raised, I believe this will help.

I am currently modelling 20 variables to fine out how they are naturally grouped into cluster and to later add one (the 21st) variable on regions This is to undstand whether the regions have effect in the distribution of the variabes in to clusters compare with the clusters without ther region. Now:

1. I expect personal, behavioural and community varaibles be grouped separatetly but there is a mix.Can I use it they way it it produced? When I added region as varable, a nuw cluster ermaged with region among them (is it the influence?).

2. Why is the dendrogram outcome using Ward Linkage, Between groups linkage, and Complete linkage presented differently? Which is the best fro use on binary data?

3. Can Twostep clusster analysis be used for binary data? I found the using categorical variable option providing results I expect. For instances, it produced the degree of influence specific variables have in the model and the percenate distribution of impact each haas in the cluster they belong. If not;

3. How can one approach Hierarchical cluthering analysis using binary variable to determine how a particular variable has impact on the distribution of the groups?

I may have asked many questions.This is beacause i learnt the statistical tool recently. I have not found someone who has used the approach nor an article that discuss my specific concern.

Rich Ulrich

unread,

Oct 22, 2018, 6:14:04 PM10/22/18

to

Okay, you are new to this. How sure are you that you want
"cluster" analysis and not "factor" analysis?

In my thinking, clusters get constructed out of subjects
who get lumped together - afterwards, you may try to
characterize the groups on the variables. (I find this awkward
and have always tried to avoid it.)

By contrast, factors show groupings of variables that work
together. Once I have a few composite scores (instead of
dozens of variables), I can look for combinations of extremes, etc.
I can see if Regions differ on the scores.

Since you say you want to put together variables, "factors"
seems to me to be the choice. On the other hand, I don't
have any idea how you are getting dendograms, since I don't
have any idea of what analysis you are running if you don't
have "subjects" in the analysis.

I sounds like you might be asking if the same factor structure
exists in various samples (regions). And that question is
complicated by having dichotomous indicator variables
instead of continuous variables. And maybe the dichotomous
variables represent multiple categories of "nominal variables",
which complicates things more.

Please describe the variables in a line of data - and who or
what the line represents. And then, perhaps, provide the
lines of some syntax that you have run. (If you are using the
graphical interface, "paste" the generated syntax to a file and
save the file. Copy lines to here.)

--
Rich Ulrich