Hi Zach,
It would be good to know a little bit more about what you are trying to accomplish. What type of regression do you plan to use? What is your response? What is very large?
Without knowing more, all I can say is that numbers constrained to be between 0 and 1 can cause some issues at the boundaries at 0 and 1 (as the model will assume less than 0 or greater than 1 is possible). You can transform the data to unconstrained space (not a Box/Cox - power transformations are not what you want here) to avoid some of these concerns.
There are other issues in general with using relative frequency (it depends highly on your count number which is a function of sampling variability). If possible, using the "raw" counts can be more robust, depending on what you are doing.
Matt
On Wednesday, May 18, 2016 at 12:01:16 PM UTC-7, Zachary Pierce wrote:
Hi all,
I am working with a very large dataset that involves microbial relative abundance data (between 0-1). I am TOLD that it would be wise to transform these values for regression particularly, but there seems to be some contention here. Perhaps someone has some insight into that matter (using rel abundance data as predictors).
Anyway, I want to perform a transform of all columns of rel abund data using a box/cox process to determine the best transform, then apply that function to all specified columns that contain rel abund data.
Does anyone have any familiarity with this process? I'm still learning here, so go easy on me.
Thanks very much!!
Zach