FAMD - Explained variance and Encoding Issues

Skip to first unread message


Jul 16, 2021, 1:46:14 PM7/16/21
to FactoMineR users

I have an isse with the current FAMD implementation. I used the FAMD on a dataset where one feature corresponds to a categorical feature with a lot of unique values (~8000 zip codes).

The problem is, that when I apply FAMD on the dataset, this single feature dominates all other features, which basically means with around 20 ncp you get 0.01% cumulated explained variances.

I think this is due to the fact that internally every unique zip code is one hot encoded and somehow the probability squareroot scaling does not work anymore.

A solution would be to add an option to use binary encoding instead of one hot encoding. However I lack the skill to change this by myself. What do you think about this and the "solution" or am I completely wrong about this?

Best regards

Francois Husson

Oct 5, 2021, 3:18:17 AM10/5/21
to factomin...@googlegroups.com
If a column as unique values, then it is not a variable but it is the name of the individuals. So you have to consider the column as the names or you have to suppress the column, but you cannot use such a column as a variable, and whatever the method (FAMD, MCA but also analysis of variance, logistic model, etc.)
Vous recevez ce message, car vous êtes abonné au groupe Google Groupes "FactoMineR users".
Pour vous désabonner de ce groupe et ne plus recevoir d'e-mails le concernant, envoyez un e-mail à l'adresse factominer-use...@googlegroups.com.
Cette discussion peut être lue sur le Web à l'adresse https://groups.google.com/d/msgid/factominer-users/00b9d516-2c01-4478-8851-1de8d01155fbn%40googlegroups.com.

Francois Husson
Department Statistics & Computer science
65 rue de St-Brieuc - 35042 RENNES
Tel: +33 2 23 48 58 86
Reply all
Reply to author
0 new messages