MOCCASIN model matrix doubts

57 views
Skip to first unread message

Miriam Martínez

unread,
May 8, 2025, 10:08:45 AMMay 8
to Biociphers
Hi biociphers team,

I have some questions regarding the best approach to include some variables on the model matrix and I was wondering if you could help me to clarify this.

I would like to correct my data for 3 variables: gender, age and family. I'm treating gender as a binomial factor using 1 and 0 for male and female. But regarding age and family, I was wondering if I should treat those variables as binomials too or not.

Regarding age: should I treat it like a continuous variable and then try to eliminate the effect (taking as reference the RIN example on the documentation) or should I use a threshold age and treat it like a binomial factor?

Regarding family: I tried to treat this variable as binary by setting for each sample if it was part of the family or not (0 or 1), but I ended up with a model matrix not being full rank. Therefore, should I try to treat it as single variable and write the ID of the family for each sample and then try to correct for it? (Example: samples 1, 2 and 3 are from family 10; samples 4 and 5 from family 22; and samples 6, 7 and 8 from family 25).

Finally, I don't really finish to understand when it is advisable to use an intercept with MOCCASIN, if you could shed some light on this too.

I hope the questions are clear enough and sorry if the questions are too specific. If you need me to elaborate on anything, please let me know.

Thank you in advance!

Best,

Miriam Martínez

Barry Slaff

unread,
May 8, 2025, 5:50:29 PMMay 8
to Miriam Martínez, Biociphers

Dear Miriam,


Moccasin will build a linear model of the factors provided in the model matrix. Thus, it usually makes sense to binarize categorical variables, since each factor will then get a coefficient in the linear model, for example: is_female, is_in_family_22, etc. I would not recommend inputting the numerical family ID as a factor value (e.g. family_ID = 22) since Moccasin will then treat this linearly, e.g. 22 has twice the effect of 11. This brings us to your age factor question. It’s up to you, but if you input the numerical age as a factor value, it would need to be because you believe the splicing effect has a contribution which is linear in the age of each individual (e.g. age 10 has 2x the effect of age 5). It probably would make more sense either to use binarized age categories (e.g. is_between_age_20_and_40, is_between_age_40_and_60 and so forth; or a single age threshold age as you suggested) or otherwise to transform age onto a scale for which you believe a linear effect is appropriate.


An intercept allows the model to learn an overall bias (average), so then each confounding factor coefficient is essentially the effect magnitude relative to this overall bias (i.e., each factor increases or decreases PSI from the mean by a certain amount). You usually want to have an intercept. Without an intercept, the linear factors must explain the splicing at each junction up from zero.


Regarding full-rank: the full-rank criterion of the model matrix says essentially that no nonzero linear combination of rows is allowed to equal another row, and likewise with columns. A common pitfall is to set up the confounding factors in a way that violates this criterion, for example: suppose your model matrix is the three columns “intercept”, “is_male”, and “is_female”. Intercept will be all 1s, and is_male + is_female (added together element-wise) will be all 1s, which violates linear independence. So you can only have one of the gender columns, or no intercept. In general this will mean “leaving one out” of each category set of binarized factors.


I hope this helps and please feel free to reach out with additional questions.
Barry

--
You received this message because you are subscribed to the Google Groups "Biociphers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to majiq_voila...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/majiq_voila/c3379719-dcfb-403e-873b-2d1890155cddn%40googlegroups.com.

Miriam Martínez

unread,
May 12, 2025, 10:08:15 AMMay 12
to Biociphers
Hi Barry,

thank you very much for your detailed explanation on how Moccasin works. It has been very enlightening and with this info I have been able to define my variables more precisely and understand why is it preferable the use of binary categories. Nonetheless, I was wondering if you could elaborate a bit more on the addition of an intercept or not (I'm not really aware of the repercussions on the correction of adding or not the intercept). 

1) On the documentation says that "When an intercept is included, this means one factor label must be left out for each factor", but wouldn't it be the same even if an intercept is not added?

2) In the documentation also says that "It is not necessary to use an intercept with MOCCASIN; the same effects can be accomplished by appropriately setting the adjustment values for the confounding factor", so when is really useful to add an intercept?

I hope the questions make sense and sorry if this is explained already somewhere in the documentation. Thank you very much in advance!

Best,

Miriam

bsl...@seas.upenn.edu

unread,
May 12, 2025, 1:30:12 PMMay 12
to Biociphers

Dear Miriam,


Suppose gender is the only confounding factor you wish to model. With no intercept, you could have is_male and is_female indicators (as described before) without violating the full-rank criterion. But if you include the intercept, you can have only one of those.


Suppose you use the two columns is_male and is_female as confounders (with no intercept). Moccasin will learn a linear model with male and female effects. To calculate the adjusted values, Moccasin’s default behavior is to remove both effects, i.e. Moccasin will set is_male = 0 and is_female = 0 for all samples. This usually is not a desired result, as it simply removes all the modeled variation. Instead, if an intercept were used together with one indicator e.g. is_male, the intercept would implicitly learn the female effect and is_male would learn the difference from it; and then Moccasin’s default behavior would set is_male = 0 for all samples, so the adjusted values will include the effect of the intercept (female). On the other hand, you could include both columns is_male and is_female (with no intercept) and also set the adjusted value for is_male to 1 (vs the default 0), which tells Moccasin to actually include the male effect when calculating the adjusted values. Similarly, with continuously-valued variables like RIN, you could include a RIN column and then set the adjusted value for RIN to 10 in the adjustment, which tells Moccasin to output values for (modeled) high-integrity samples.


Now suppose you want to model both age and gender: male/female, and over 50 / not over 50. If you include an intercept, you must leave one factor label out of each pair or the matrix won’t be full rank. Without an intercept, you could include both factor labels for *one* of the two sets, e.g. is_male, is_female, and is_over_50. With the three included columns, you would want to set the adjusted value of some binary indicators to be 1 (similar to the gender-only case above) in order to avoid simply removing all the modeled variation in Moccasin’s adjustment. But if you tried to include a fourth column is_not_over_50, then the matrix would violate the full rank criterion, because the element-wise vector sum (is_male + is_female) will be all 1s, and (is_over_50 + is_not_over_50) also will be all 1s, hence a nonzero linear combination of these is the 0 vector (is_male + is_female - is_over_50 - is_not_over_50).


Thus, Moccasin doesn’t strictly need to include an intercept in order to execute. But, if you don’t use an intercept, then you probably will want to carefully set non-default adjusted values for some of your confounders in order to avoid simply removing all modeled variation in the adjustment. We find it’s simpler for users to include the intercept, which implicitly solves the problem of having to add more command line arguments in order to define a meaningful “non-confounded” case. In fact, as of now, the coming major update to MAJIQ (MAJIQ V3) does not include the option to set adjusted values for confounders, which makes including the intercept effectively required to get meaningful results for most use-cases. It also adds to the need for users to consider transforming confounding factors before executing Moccasin so that the adjusted value of 0 is desirable (e.g. 10 - RIN rather than RIN). We would consider re-adding this option if there is strong interest.


Best Regards,

Barry

Miriam Martínez

unread,
May 21, 2025, 8:53:03 AMMay 21
to Biociphers
Hi Barry,

Thank you very much for your time and detailed answer! I think that with all this information I understand much better how Moccasin works and I have more knowledge to design future analyses with this tool. Amazing work and assistance from the whole biociphers team. Hope you have a nice day.

Best,

Miriam
Reply all
Reply to author
Forward
0 new messages