Dear Miriam,
Moccasin will build a linear model of the factors provided in the model matrix. Thus, it usually makes sense to binarize categorical variables, since each factor will then get a coefficient in the linear model, for example: is_female, is_in_family_22, etc. I would not recommend inputting the numerical family ID as a factor value (e.g. family_ID = 22) since Moccasin will then treat this linearly, e.g. 22 has twice the effect of 11. This brings us to your age factor question. It’s up to you, but if you input the numerical age as a factor value, it would need to be because you believe the splicing effect has a contribution which is linear in the age of each individual (e.g. age 10 has 2x the effect of age 5). It probably would make more sense either to use binarized age categories (e.g. is_between_age_20_and_40, is_between_age_40_and_60 and so forth; or a single age threshold age as you suggested) or otherwise to transform age onto a scale for which you believe a linear effect is appropriate.
An intercept allows the model to learn an overall bias (average), so then each confounding factor coefficient is essentially the effect magnitude relative to this overall bias (i.e., each factor increases or decreases PSI from the mean by a certain amount). You usually want to have an intercept. Without an intercept, the linear factors must explain the splicing at each junction up from zero.
Regarding full-rank: the full-rank criterion of the model matrix says essentially that no nonzero linear combination of rows is allowed to equal another row, and likewise with columns. A common pitfall is to set up the confounding factors in a way that violates this criterion, for example: suppose your model matrix is the three columns “intercept”, “is_male”, and “is_female”. Intercept will be all 1s, and is_male + is_female (added together element-wise) will be all 1s, which violates linear independence. So you can only have one of the gender columns, or no intercept. In general this will mean “leaving one out” of each category set of binarized factors.
--
You received this message because you are subscribed to the Google Groups "Biociphers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to majiq_voila...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/majiq_voila/c3379719-dcfb-403e-873b-2d1890155cddn%40googlegroups.com.
Dear Miriam,
Suppose gender is the only confounding factor you wish to model. With no intercept, you could have is_male and is_female indicators (as described before) without violating the full-rank criterion. But if you include the intercept, you can have only one of those.
Suppose you use the two columns is_male and is_female as confounders (with no intercept). Moccasin will learn a linear model with male and female effects. To calculate the adjusted values, Moccasin’s default behavior is to remove both effects, i.e. Moccasin will set is_male = 0 and is_female = 0 for all samples. This usually is not a desired result, as it simply removes all the modeled variation. Instead, if an intercept were used together with one indicator e.g. is_male, the intercept would implicitly learn the female effect and is_male would learn the difference from it; and then Moccasin’s default behavior would set is_male = 0 for all samples, so the adjusted values will include the effect of the intercept (female). On the other hand, you could include both columns is_male and is_female (with no intercept) and also set the adjusted value for is_male to 1 (vs the default 0), which tells Moccasin to actually include the male effect when calculating the adjusted values. Similarly, with continuously-valued variables like RIN, you could include a RIN column and then set the adjusted value for RIN to 10 in the adjustment, which tells Moccasin to output values for (modeled) high-integrity samples.
Now suppose you want to model both age and gender: male/female, and over 50 / not over 50. If you include an intercept, you must leave one factor label out of each pair or the matrix won’t be full rank. Without an intercept, you could include both factor labels for *one* of the two sets, e.g. is_male, is_female, and is_over_50. With the three included columns, you would want to set the adjusted value of some binary indicators to be 1 (similar to the gender-only case above) in order to avoid simply removing all the modeled variation in Moccasin’s adjustment. But if you tried to include a fourth column is_not_over_50, then the matrix would violate the full rank criterion, because the element-wise vector sum (is_male + is_female) will be all 1s, and (is_over_50 + is_not_over_50) also will be all 1s, hence a nonzero linear combination of these is the 0 vector (is_male + is_female - is_over_50 - is_not_over_50).
Thus, Moccasin doesn’t strictly need to include an intercept in order to execute. But, if you don’t use an intercept, then you probably will want to carefully set non-default adjusted values for some of your confounders in order to avoid simply removing all modeled variation in the adjustment. We find it’s simpler for users to include the intercept, which implicitly solves the problem of having to add more command line arguments in order to define a meaningful “non-confounded” case. In fact, as of now, the coming major update to MAJIQ (MAJIQ V3) does not include the option to set adjusted values for confounders, which makes including the intercept effectively required to get meaningful results for most use-cases. It also adds to the need for users to consider transforming confounding factors before executing Moccasin so that the adjusted value of 0 is desirable (e.g. 10 - RIN rather than RIN). We would consider re-adding this option if there is strong interest.
Best Regards,
Barry