Reusing PCA loadings file by PLINK

806 views
Skip to first unread message

Sulev Reisberg

unread,
Oct 23, 2018, 3:18:55 AM10/23/18
to plink2-users
PLINK is a fast and convenient tool to do a PCA for the samples. It also provides a loadings matrix as an output (when using var-wts flag).

Is it possible to reuse the calculated loadings matrix as an input for calculating the PCA positions for some other (new) samples? I want to see where are my new samples located on the "old" PCA plot.

It seems to be "doable" by using "--score" function, but the problem is that new sample data need to be scaled (centered) beforehand scoring. Unfortunately, I didn't find an option to do the scaling with PLINK.. I believe the scaling is conducted inside the "pca" function but seems that it can't be run separately.

Of course, alternatively I could do this in R, but R is much worse than PLINK when dealing with >600K SNPs in the data...

So my question is whether it is somehow still doable with PLINK.

Sulev Reisberg

unread,
Oct 23, 2018, 4:36:17 AM10/23/18
to plink2-users
I found that PCA has an option to "standardize variance", but I'm not sure I'm using it correctly. Here are my steps.

#conduct PCA with 2 principal components
./plink2 --bfile mybfile --pca 2 var-wts --make-rel square --out pca

#variant weights now stored to pca.eigenvec.var
#PC1 values given in column 3:
head pca.eigenvec

#FID IID PC1 PC2
SAMPLE1 SAMPLE1 -0.000336055 0.0258372
SAMPLE2 SAMPLE2 -5.94524e-05 0.0263202
SAMPLE3 SAMPLE3 0.000611151 0.0255399


#try to apply PC1 for the same data (to check whether the scores match)
./plink2 --bfile mybfile --variance-standardize --score pca.eigenvec.var 2 4 5 header #2=variant id col, 4=minor allele col, 5=PC1, header skips header line

#expect to see the same PC1 values in column 5 (but do not see)
head plink2.sscore

#FID IID NMISS_ALLELE_CT NAMED_ALLELE_DOSAGE_SUM SCORE1_AVG
SAMPLE1 SAMPLE1 1273440 300233 -0.0317194
SAMPLE3 SAMPLE2 1272066 299363 -0.0310998
SAMPLE3 SAMPLE3 1273690 298153 -0.0293866

Christopher Chang

unread,
Oct 23, 2018, 11:54:22 AM10/23/18
to plink2-users
You have the right idea, but yes, there's a difference between the --variance-standardize flag, which affects phenotypes and covariates, and --score's 'variance-standardize' modifier, which affects genotypes during --score computation.  The correct syntax should be

  plink2 --bfile mybfile --score pca.eigenvec.var 2 4 5 header variance-standardize

Sulev Reisberg

unread,
Oct 24, 2018, 2:11:31 AM10/24/18
to plink2...@googlegroups.com
Thank you,

it helped a lot as the principal component values from "pca" and loadings from the same pca result+"score" are now much closer to each other. However, still not exactly the same, e.g 0.000611151 from PCA and 0.00267541 from loadings+"score". Probably it is normal to have such difference due to rounding issues (SNP count in my case is >600,000) or can there be another reason?

Sulev

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christopher Chang

unread,
Oct 30, 2018, 10:53:08 AM10/30/18
to plink2-users
Oh, forgot to mention that the variance-standardization is based on allele frequencies. So to make --score replicate --pca more precisely:
1. Use --freq to export reference-dataset allele frequencies for the variants you're projecting.
2. Use --read-freq to load these frequencies during your --score run.

Isobel Howard

unread,
Jan 16, 2023, 8:13:20 AM1/16/23
to plink2-users
Hi there,

I am trying to project a set of new samples on to principal components built on 1000 genomes data. When I come to graph the results of PC1 and PC2, the projected PC1 of the new samples is as expected, but the PC2 of the new samples is much smaller than expected. I have used the --freq and --read-freq flags to include reference allele frequencies. Any idea why I may be having this problem?

Christopher Chang

unread,
Jan 16, 2023, 1:28:58 PM1/16/23
to plink2-users
See https://www.cog-genomics.org/plink/2.0/score#pca_project for the current documentation on this; there is a remark about the different scaling of --score and --pca output.
Reply all
Reply to author
Forward
0 new messages