Normality assumptions in qtl2

87 views
Skip to first unread message

Mark Sfeir

unread,
Jun 16, 2022, 3:15:59 PM6/16/22
to R/qtl2 discussion
A few questions in this post that I'll number, but first, some context:

I'm feeling a little uneasy about the prospect of transforming (e.g. log) my lab's DO mouse phenotype data to better approximate normality, and could use some guidance (or direction to helpful reading materials) on interpretation of assumptions of normality in relation to qtl2's different functions (and maybe other general statistical tests), and which types of situations may warrant log vs. square root transformation or not. I've read that transformations can have the opposite of the intended effect at times, as well, and want to be aware of when not to use them. 

I'm planning to plot histograms of the data to roughly assess their distribution, and am considering performing a more formal Shapiro-Wilk test on the untransformed data, but also want to be sure for which functions' proper operation the normality condition should be met. 1) For instance, is normality of the data only important if doing Haley-Knott regression, but not for a linear mixed model with a random polygenic effect (as may have been hinted at in the "Non-normal traits" section of this Kbroman presentation—though the reference made there may have been strictly to generalized linear models, and not linear mixed models)? 

I understand that binary data for logistic regression is a different case. 

2) And is it actually the distribution of the residuals that matters, or is normal distribution of the residuals practically equivalent to normal distribution of the data itself? I want to be sure that I check for normality of the right thing. 

I'd appreciate input from anybody interested,
Mark Sfeir 

Karl Broman

unread,
Jun 18, 2022, 11:58:31 AM6/18/22
to R/qtl2 discussion
I think normality of residuals is equally important for Haley-Knott regression and the linear mixed models. 

It is normality of residuals rather than just the trait distribution that is assumed, but unless there is some really big QTL effect, you should expect the trait distribution to be approximately normal.

I don't think you need to be uneasy about transforming the phenotypes. I usually just pick between log, square-root, and untransformed. For genome-scale phenotypes like gene expression, it can be useful to just force normality using normal quantiles, as with the nqrank function in R/qtl1. I've never seen anyone question choices of transformation. The only thing to avoid would be selecting the transformation based on the results, like trying a bunch of transformations and picking the one that gave you the largest LOD scores. That would be bad.

karl

Mark Sfeir

unread,
Jun 22, 2022, 10:44:25 AM6/22/22
to R/qtl2 discussion
Ok, thank you, let me process this...
-Mark

Mark Sfeir

unread,
Oct 13, 2022, 3:09:35 PM10/13/22
to R/qtl2 discussion
Hi,
I have some new questions on this thread. I understand why picking a transformation that results in a higher LOD score is a bad approach. 

1) What about picking the transformation that, when running the Shapiro-Wilk test for normality on the transformed data, results in the highest p-value (and/or smallest effect size as assessed on an accompanying normal QQ plot)? That way we're going into the LOD score calculation with a transformed set of data that is as normal as reasonably possible. Does this sound like a viable approach? 

Also, in your last response in this thread, Karl, you mentioned the utility of forcing normality using the nqrank function in qtl1. I'm not sure if I'm failing to grasp the distinction between using the nqrank function to transform "genome-scale phenotypes like gene expression" as compared with any other phenotype. 2) Is it maybe ok to use that nqrank function idea to transform the general set of phenotype data for which neither a log transformation nor a square root transformation leads to a more normally distributed dataset? If not, maybe I need to better grasp the significance of only using this type of transformation on genome-scale phenotypes? 

Log transformation did help normalize some of our study's data, but not for some other phenotypes. 

I would appreciate any insight you provide on these two sets of related questions. 
Take care,
Mark

Karl Broman

unread,
Oct 13, 2022, 3:43:02 PM10/13/22
to R/qtl2 discussion
An advantage of using the nqrank transformation for genome-scan phenotypes is that, since all traits have the same distribution, you can use the same significance threshold for each.

If you choose the transformation that gives the biggest LOD score, then you need to account for that when assessing the significance of the results.

You'd maybe be interested in Box-Cox transformations. For example, see Yang et al (2006), https://doi.org/10.1007/s10709-005-5577-z

karl

Mark Sfeir

unread,
Oct 13, 2022, 4:06:40 PM10/13/22
to R/qtl2 discussion
Thank you, but again I'm not going to choose transformations based on LOD score, I was only asking about choosing transformations based on how much more normally distributed the transformation renders the data, as assessed by a Shapiro-Wilk test p-value and/or normal QQ plot. 

But ok, that is good to know. Might using nqrank on non-genome-scan phenotypes data for which no common transformation rendered it even approximately normally distributed be good to do in some cases, just with the understanding that the significance of the result would have to be assessed individually? 

I will look into Box-Cox transformations, thank you. 
Reply all
Reply to author
Forward
0 new messages