PCA vs NMDS and categorical variables

kenste...@gmail.com

unread,

May 16, 2014, 3:32:04 PM5/16/14

to pc-...@googlegroups.com

I am modeling stream fish and microhabitat relationships using a focal species. I have 24 variables, 3 are continuous, the rest are categorical. I ran a PCA and NMDS ordination on my entire data set (~1900 random points and 800 occurrence points). Stress was high for the NMDS ordination, ~35. The graph shows all the variables clustered together at the center of plot and data points in a random cloud. The PCA ordination indicated two valid axes (broken stick method) and explained about 40% variance. Environmental gradients were about what was expected and the graphic was informative.

I reduced the number of variables and ran PCA and NMDS again. PCA returned very similar results. NMDS found a solution, stress was about 15. The graphic showed weak gradients that were roughly consistent with PCA but much less informative (some variables still clustered in the middle). The NMDS graph also shows clusters of data points with shared values of categorical variables.

My question is: does the high proportion of categorical variables have much to do with the performance of NMDS? Or is my data truly uninformative with weak gradients (which doesn't seem to be the case according to other analyses)?

br...@salal.us

unread,

May 16, 2014, 9:13:58 PM5/16/14

to pc-...@googlegroups.com

There might be an issue with the categorical variables, but most likely the difference between PCA and NMS is happening because PCA has built-in standardizations while NMS does not. Assuming that the variables are on different scales, you might try first relativizing by standard deviates of the variables (i.e. re-expressing each variable as number of standard deviations from the mean). This is what PCA does. Then, in NMS select Euclidean distance.

Btw, either of these makes sense only if the categories are binary or ordered categories. If they are not ordered categorical variables, you should recode them to a series of binary variables.

Also, on the PCA, I recommend not using the broken stick criterion -- it's not nearly as reliable as the randomization test using the eigenvalues.

One other thing... You didn't mention if the NMS was beating the randomization test. A high stress solution would be more acceptable if it beat the randomization test, but at stress=35 I'm doubtful that it would. The randomization test will be slow with that many data points, but if you can run it over a weekend, it would be nice to have that information.

Good luck,

Bruce McCune

--
You received this message because you are subscribed to the Google Groups "PC-ORD" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pc-ord+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ken Sterling

unread,

May 20, 2014, 1:20:34 PM5/20/14

to pc-...@googlegroups.com

Thank you for the help. It never did occur to me to relativize the variables, though I really should have thought of that. Thanks for the additional information, too. I will follow your recommendations and see what happens.