issues with explained variance in PCA plots

29 views
Skip to first unread message

Matteo Ungaro

unread,
Jul 8, 2024, 1:30:53 PM (8 days ago) Jul 8
to plink2-users
Hi there,

I generated two PCA for SNPs (and INDELs, I post only the commands for SNPs) using the SGDP panel as follow:

  • bcftools norm --threads 32 -m+ SGDP_panel_clean.vcf.gz | bcftools view --threads 32 -m2 -M2 -v snps -Oz -o SGDP_panel_bi_snps.vcf.gz
  • plink2 --vcf SGDP_panel_bi_snps.vcf.gz --set-missing-var-ids @:#:\$r:\$a --rm-dup --indep-pairwise 200kb 0.5 --not-chr X,Y,MT,EBV --vcf-half-call m --out SGDP_panel_bi_snps
  • plink2 --vcf SGDP_panel_bi_snps.vcf.gz --set-missing-var-ids @:#:\$r:\$a --not-chr
    X,Y,MT,EBV --vcf-half-call m --maf 0.05 --extract SGDP_panel_bi_snps.prune.in
    --make-pgen --pca --out SGDP_panel_bi_snps
However, for some reason, I'm getting quite high values for the explained variance by both the first and second PC, as per the images... is there any reason for that?

Thanks in advance, if helpful I can post the eigenvectors and eigenvalues outputs.

Matteosnp.pngindel.png


Christopher Chang

unread,
Jul 8, 2024, 3:09:44 PM (8 days ago) Jul 8
to plink2-users
Can you clarify why you think this is a problem?

Matteo Ungaro

unread,
Jul 8, 2024, 3:46:44 PM (8 days ago) Jul 8
to plink2-users
Hi Chris, 

Indeed, initially I though everything was fine.

However, upon consulting other people in genetics/population genetics, they pointed out how such values are "abnormally" high for human populations. It's worth noting they also specify that the shape is consistent with other observations on other large scale panels e.g. 1KG etc. So, my main concern is probably having another opinion on whether everything was done correctly and sound reasonable. Let me know, thanks!

Matteo

Chris Chang

unread,
Jul 8, 2024, 7:24:30 PM (8 days ago) Jul 8
to Matteo Ungaro, plink2-users
Assuming that something analogous to “—hwe keep-fewhet” was used to remove obvious too-many-hets variants from the initial VCF, the commands look fine.  The SGDP has especially high geographic diversity, so it is not that surprising to me that a larger share of variation lines up with the top two PCs than you typically see elsewhere.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/a5ef292e-cfb5-414f-862d-ac1bcc27430an%40googlegroups.com.

Matteo Ungaro

unread,
Jul 9, 2024, 4:11:58 AM (7 days ago) Jul 9
to plink2-users
Hi Chris thanks,

I haven't used any filter on the het, I might give it a try; however, the explained variance is 5-10 folds higher than what seen in other cases, so I'm not sure if this can mitigate the actual values. 

Instead, what you mentioned might make more sense in the context of the analyses done; in fact, I kind of overlooked that the SGDP harbors a much greater diversity compared to any other human panel. Thanks again for the insight!

Matteo
Reply all
Reply to author
Forward
0 new messages