plink --pca --projection?

3,963 views
Skip to first unread message

Wei-Min Chen

unread,
Jun 17, 2017, 3:26:35 AM6/17/17
to plink2-users
It looks PCA has been well implemented in PLINK. I am wondering if a PCA projection analysis can be implemented as well? E.g., in a hapmap-rooted PCA, each sample is projected onto the hapmap PC space. Now that plink --king-cutoff option is available for selecting a subset of unrelated individuals, often I would like to project related individuals onto the unrelated PC space. 

For the implementation, samples for PCA (e.g., hapmap samples) can be indicated as 1 in  the 6th column of the .fam file, while samples to be projected can be indicated as 2. This is how is implemented in KING. However, PCA in KING is substantially slower than PLINK (even with the use of LAPACK) so it would be nice if we can use PLINK for PCA instead. 

Do you think it is possible or worth the effort?

Thanks!

Christopher Chang

unread,
Jun 20, 2017, 12:38:19 PM6/20/17
to plink2-users
Yes, it's likely that something like PLINK 1.9's --pca-clusters/--pca-cluster-names will eventually make it into 2.0.  With that said, PCA projection is actually already supported, the workflow is just a bit more convoluted for now.

Step 1: Export allele frequencies and PCA variant weights from your reference dataset.  E.g.
  plink2 --bfile hapmap --freq --pca var-wts --out pca_hapmap

Step 2: Use --score to compute the necessary dot products with the variant weights.  E.g.
  plink2 --bfile mydata --read-freq pca_hapmap.afreq --score pca_hapmap.eigenvec.var 2 3 header-read no-mean-imputation variance-normalize --score-col-nums 5-14 --out pca_proj_mydata

These PCs will be scaled a bit differently from pca_hapmap.eigenvec (you need to multiply or divide the PCs by sqrt(eigenvalue) to put them on the same scale).  One way to make the PCs directly comparable is to also run step 2 on the original dataset.

Wei-Min Chen

unread,
Jun 20, 2017, 1:33:09 PM6/20/17
to plink2-users
Thanks for the helpful instruction on how to get it done in PLINK 1.9! I'll give it a try later.

Christopher Chang

unread,
Jun 20, 2017, 1:37:38 PM6/20/17
to plink2-users
To clarify, the detailed instructions are for PLINK 2.0.  They do not work for PLINK 1.9 (which is missing most of the necessary --score options), but 1.9 lets you use --pca-clusters/--pca-cluster-names if you've merged the hapmap data with your new genomes.

Wei-Min Chen

unread,
Jun 20, 2017, 1:42:22 PM6/20/17
to plink2-users
Great! That's even better. So I do not need to go back to PLINK 1.9.

Francesco Bettella

unread,
May 14, 2019, 9:37:39 AM5/14/19
to plink2-users
hello there,

not sure this thread is still relevant but i bumped into it looking for ways to project relatives onto PCs calculated from a maximal set of unrelated individuals.
plink 1.9's online documentation says:

If clusters are defined (via --within/--family), you can base the principal components off a subset of samples and then project everyone else onto those PCs with --pca-cluster-names and/or --pca-clusters

which kind of sounds exactly like what I want to do. however, the clusters defined by the '--family' options are *not* really what I need. rather, they would enable the calculation of PCs within a family and the eventual projection of other families onto those PCs. of course one could define a cluster by selecting only *one* representative from each family and leave the relatives for another complementary cluster. but this would be a quite indirect and overly complicated way of using the '--family' clustering functionality.
am I missing something obvious or is this statement in the documentation slightly misleading?

cheers,
Francesco

Christopher Chang

unread,
May 14, 2019, 11:55:13 AM5/14/19
to plink2-users
Yes, in practice, you wouldn’t want to use this with —family. I’ll just delete the “/—family” from that sentence, since it’ll still be technically correct.

Francesco Bettella

unread,
May 14, 2019, 11:58:28 AM5/14/19
to Christopher Chang, plink2-users
thank you Christopher.

just sanity checking (my own).. ;)

cheers,
F

On Tue, 14 May 2019 at 17:55, Christopher Chang <chrch...@gmail.com> wrote:
>
> Yes, in practice, you wouldn’t want to use this with —family. I’ll just delete the “/—family” from that sentence, since it’ll still be technically correct.
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "plink2-users" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/plink2-users/W6DL5-hs_Q4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to plink2-users...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/ec6730f9-a3d1-4910-b16d-5c0f8823136a%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Message has been deleted

Kate Shim

unread,
Jul 22, 2019, 10:53:09 AM7/22/19
to plink2-users
Hi Christopher, 

Thank you very much for your kind help and detailed instruction.
I tried to follow the steps in your instruction with 1kG data and got results as below. 
Sample HG00096 was in the original dataset, so I guess row A (projected PCs) is supposed to be the same as row B (PCs from 1kG including HG00096).
As trying to put the values on the same scale, I found a conversion formula with negligible errors. (Row D and E) 
Does this formula (computedPC=projectedPC/(-sqrt(eigenvalue)/2)) make sense? Would this formula work consistently in any situations?   

Sample: HG00096PC1PC2PC3PC4PC5PC6PC7PC8PC9PC10
Apca_proj_HG00096.sscore-0.09594720.1515180.02434920.02646598.83E-050.00193974-0.00885685-0.000327014-0.00370184-0.00283137
Bpca_1kg_pruned.eigenvec0.0133146-0.0344978-0.0118639-0.0171922-8.94E-05-0.002016070.009780140.0003801630.005562810.00433521
Cpca_1kg_pruned.eigenval207.71577.162716.8499.479163.902093.702853.280422.959741.771361.70621
DA/(-sqrt(C)/2)0.0133146-0.0344977-0.0118639-0.0171922-0.0000894-0.00201610.00978010.00038020.00556280.0043352
EB-D0.0000000-0.00000010.00000000.00000002.12E-100.00000000.00000000.00000000.00000000.0000000

Thanks, 
Kate


2017년 6월 21일 수요일 오전 1시 38분 19초 UTC+9, Christopher Chang 님의 말:

Christopher Chang

unread,
Jul 23, 2019, 12:36:05 AM7/23/19
to plink2-users
Something like that always works, yes; just make sure you're working with a build from 31 Jan 2019 or later (there was an eigenvalue-reporting bugfix then).

Camila Tagliabue

unread,
Aug 26, 2019, 9:06:52 PM8/26/19
to plink2-users

Hi! I've been trying to use the steps described here to do a PCA projection analysis, but for some reason the process gets stuck when reading the .afreq file. This is what appears on the screen:


Options in effect:

--bfile datosbhr10

--extract bhr10armenians

--out proy10arm-sobre13

--read-freq b13pca.afreq

--score b13pca.eigenvec.var 2 3 header-read no-mean-imputation variance-normalize

--score-col-nums 5-14


Start time: Mon Aug 26 20:02:11 2019

Note: --score's 'variance-normalize' modifier has been renamed to the more

precise 'variance-standardize'.

7841 MiB RAM detected; reserving 3920 MiB for main workspace.

Using up to 4 compute threads.

466 samples (85 females, 378 males, 3 ambiguous; 466 founders) loaded from

datosbhr10.fam.

206495 variants loaded from datosbhr10.bim.

Note: No phenotype data present.

--extract: 0 variants remaining.

--read-freq: PLINK 2 --freq file detected.


And it stays like that for hours until I stop it. It doesn’t give me any error message or anything.

Any idea why or how to solve it?


Thanks

Christopher Chang

unread,
Aug 26, 2019, 9:27:50 PM8/26/19
to plink2-users
This is probably because there are no variants after —extract; if so, I’ll post a fix for the bug today (it should immediately error out instead of getting into a loop).

With that said, you need to synchronize the IDs in your —extract file with those in your .bim file.

Camila Tagliabue

unread,
Aug 26, 2019, 10:11:34 PM8/26/19
to plink2-users
The problem was that I was trying to use the --extract with family IDs, now I found the --keep-fam and it works. Thank you!

Isobel Howard

unread,
Mar 24, 2023, 7:51:14 AM3/24/23
to plink2-users
Hi there,

Apologies for posting on an old thread, it is the most relevant one I could find. I have followed the guidance above and performed PCA on the build 37 phase 3 reference data from 1000 Genomes. 
I then performed step 2 on the output of this so that pca projection would be directly comparable between reference and the new samples I wan to project.
When trying to project the new samples I got this error, and can't find any guidance online as to the problem:

Error: --score variance-standardize failure for variant '37:1:2106896:T:C':

estimated allele frequency is zero or NaN, but not all dosages are zero. (This

is possible when e.g. allele frequencies are estimated from founders, but the

allele is only observed in nonfounders.)


Any tips would be appreciated as to why I am getting this error. Thank you!


Christopher Chang

unread,
Mar 24, 2023, 6:52:26 PM3/24/23
to plink2-users
As noted in the --pca documentation, it is important to remove low-allele-frequency variants from your reference dataset before performing PCA.  "--maf 0.01" should do the trick for 1000 Genomes phase 3.

Roughly speaking, once allele frequency gets too low, a variant's contribution to the PCA computation starts becoming wildly unstable.  In the extreme case where an allele has a frequency of 0 in your reference dataset but does appear in your own data, the calculation simply fails (divide by zero).
Reply all
Reply to author
Forward
0 new messages