PCA as covariate

294 views
Skip to first unread message

Parisa BoodaghiMalidarreh

unread,
Sep 23, 2023, 4:07:41 PM9/23/23
to plink2-users
Hi, 
I have a genotype file which and I want to do GWAS for it to find some significant SNPS related to the trait I am working on, I have two options first just apply: --glm allow-no-covars, and the second option of --glm --covar out.eigenvec.
I am not sure how does it important to use pca as covar file becausse this command changes the result significantly.
can anyone help me with this, or send me reference to learn more about when should use the covar and pca?

Message has been deleted

Zuxi Cui

unread,
Sep 25, 2023, 3:57:32 PM9/25/23
to Parisa BoodaghiMalidarreh, plink2-users
PCs are used to filter out outliers by ancestries. You can refer to this detailed tutorial: https://github.com/JoniColeman/gwas_scripts
The tutorial used EIGENSOFT to calculate PCA; you have the same thing from PLINK2 --pca. They used the same algorithm by documentation.
Usually, you adjust the first few PCs but not all of them in your GWAS regression.

Terry

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/560af75f-11a5-4c24-9f3b-50605acc9a60n%40googlegroups.com.

Parisa BoodaghiMalidarreh

unread,
Oct 12, 2023, 12:29:31 PM10/12/23
to plink2-users
Thank you so much for your help,
I have another question which I asked here multiple times and I did not get response.
 I have genotype information of about 3000 people as train and 900 poeple as test which are all from one dataset and have the same distributio of case and contol. I applied the QC part for both groups as well ( --mind 0.01 --maf 0.01 --hwe 1e-10) and (--indep-pairwise 100 5 0.2). 
then I implemented the GWAs with --glm command and --covar pca on train data, which provide me the p values and OR. 
at last I want to use the OR value with command --score to calculate the polygenic risk score in test (unseen data). I expected to have a considerable difference between score of case and control in test data, but I did not recieve it and there is not any difference between scores of case and control here. However, when I repeat it with train data which I get the gwas from I can see the considerable difference between case and control. 
can you help me with my problem, I think I am overfitted. but do not know how to solve it.
It is urgent for me. 



Matthew Maher

unread,
Oct 12, 2023, 7:31:37 PM10/12/23
to Parisa BoodaghiMalidarreh, plink2-users
As far as I know, PRSes are generally calculated using BETAs, not the ORs.  

So you could either convert the values yourself (BETA  = ln(OR) ) or you could just ask PLINK2 to include it in the --glm output by adding something like 'cols=+beta' to the --glm switch. 

see:
I hope that helps


Parisa BoodaghiMalidarreh

unread,
Oct 13, 2023, 2:28:37 AM10/13/23
to plink2-users
Thank you so much, you mean  in the columns for --score I should add the column for beta instead of OR?
I did the same but, I did not change any thing , again I can see the difference between prs scores  of case and control in train data for --glm but I cannot see any difference between the prs scores of case and control in test (unseen) data.
do you have any suggestion for me?
is it likely that the glm is overfitted for my dataset?

Matthew Maher

unread,
Oct 13, 2023, 10:23:47 AM10/13/23
to Parisa BoodaghiMalidarreh, plink2-users
Perhaps there really is no signal in the data?   Does a QQ of the GWAS actually show enrichment for significant associations?

I don't believe it's valid (or proves anything) to test the PRS in Case-vs-Control in the very data that was the source GWAS population, since I think that would always show a difference.  To test that, perhaps try to randomize/scramble all the ID-phenotype associations and rerun the GWAS. Just by random chance, there will be some SNPs that associate with the now random phenotype, in this specific cohort.  I would expect that calculating a PRS from these random winner SNPs, should show a case-control difference.  But that's just circular math, I believe.   

Parisa BoodaghiMalidarreh

unread,
Oct 20, 2023, 2:48:01 PM10/20/23
to plink2-users
thanks,
I have the following QQ plot.
and I add the BMI covariates beside the PC1 and PC2 for this result. what do you think about the plot, does it show any significant snps?
I have another question, I have my phenotype in .fam file which I feed it as --bfile, do I need to again --phenotype command to feed it ?
Screenshot 2023-10-20 at 1.44.24 PM.png

Matthew Maher

unread,
Oct 24, 2023, 8:04:44 PM10/24/23
to plink2-users
Most GWAS QQ plots will show strong line right up the diagonal (i.e. P-value distribution is as expected by chance) with (hopefully) an uplifted tail towards the end where some limited # of SNPs end up more significant than expected by chance.  Your plot seems to show there is a massive amount of inflation - i.e. very large # of unexpectedly significant variants across a wide spectrum of P-values.

I could be wrong, but I would think this is saying that something is fundamentally different about your cases and controls.  Possibly ancestry?  you said you added 2 PCs, but I believe a more typical # might be 10.  At a coarse level, you could project the cases and controls onto an ancestrally-informative reference set (e.g. 1KG) to see if the plots seem similar (e.g. it's not all Euro cases and Asian controls)

But more fundamentally, you need to confirm that your cases and controls are (except for Case/Control status) equivalent - e.g. did they come from the same experimental process?  were they similarly mixed together across batches?  etc.

I hope that helps - that's all I got.
Reply all
Reply to author
Forward
0 new messages